From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:50161) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RJlQI-0001x3-L6 for qemu-devel@nongnu.org; Fri, 28 Oct 2011 08:20:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RJlQE-0001B7-1N for qemu-devel@nongnu.org; Fri, 28 Oct 2011 08:20:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:24533) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RJlQD-0001Av-NT for qemu-devel@nongnu.org; Fri, 28 Oct 2011 08:20:29 -0400 Message-ID: <4EAA9E07.3060704@redhat.com> Date: Fri, 28 Oct 2011 09:20:23 -0300 From: Cleber Rosa MIME-Version: 1.0 References: <1316443033-6489-1-git-send-email-freddy77@gmail.com> <4EA95BFF.6070807@redhat.com> <20111027135731.GA21052@stefanha-thinkpad.localdomain> <4EA96776.6020807@redhat.com> <4EA96B82.6070507@redhat.com> <4EAA9310.2030705@redhat.com> In-Reply-To: <4EAA9310.2030705@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v2] block: avoid SIGUSR2 Reply-To: cleber@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: Lucas Meneghel Rodrigues , aliguori@us.ibm.com, Stefan Hajnoczi , qemu-devel@nongnu.org, Frediano Ziglio , Paolo Bonzini On 10/28/2011 08:33 AM, Kevin Wolf wrote: > Am 27.10.2011 16:32, schrieb Kevin Wolf: >> Am 27.10.2011 16:15, schrieb Kevin Wolf: >>> Am 27.10.2011 15:57, schrieb Stefan Hajnoczi: >>>> On Thu, Oct 27, 2011 at 03:26:23PM +0200, Kevin Wolf wrote: >>>>> Am 19.09.2011 16:37, schrieb Frediano Ziglio: >>>>>> Now that iothread is always compiled sending a signal seems only an >>>>>> additional step. This patch also avoid writing to two pipe (one from signal >>>>>> and one in qemu_service_io). >>>>>> >>>>>> Work with kvm enabled or disabled. strace output is more readable (less syscalls). >>>>>> >>>>>> Signed-off-by: Frediano Ziglio >>>>> Something in this change has bad effects, in the sense that it seems to >>>>> break bdrv_read_em. >>>> How does it break bdrv_read_em? Are you seeing QEMU hung with 100% CPU >>>> utilization or deadlocked? >>> Sorry, I should have been more detailed here. >>> >>> No, it's nothing obvious, it must be some subtle side effect. The result >>> of bdrv_read_em itself seems to be correct (return value and checksum of >>> the read buffer). >>> >>> However instead of booting into the DOS setup I only get an error >>> message "Kein System oder Laufwerksfehler" (don't know how it reads in >>> English DOS versions), which seems to be produced by the boot sector. >>> >>> I excluded all of the minor changes, so I'm sure that it's caused by the >>> switch from kill() to a direct call of the function that writes into the >>> pipe. >>> >>>> One interesting thing is that qemu_aio_wait() does not release the QEMU >>>> mutex, so we cannot write to a pipe with the mutex held and then spin >>>> waiting for the iothread to do work for us. >>>> >>>> Exactly how kill and qemu_notify_event() were different I'm not sure >>>> right now but it could be a factor. >>> This would cause a hang, right? Then it isn't what I'm seeing. >> While trying out some more things, I added some fprintfs to >> posix_aio_process_queue() and suddenly it also fails with the kill() >> version. So what has changed might really just be the timing, and it >> could be a race somewhere that has always (?) existed. > Replying to myself again... It looks like there is a problem with > reentrancy in fdctrl_transfer_handler. I think this would have been > guarded by the AsyncContexts before, but we don't have them any more. > > qemu-system-x86_64: /root/upstream/qemu/hw/fdc.c:1253: > fdctrl_transfer_handler: Assertion `reentrancy == 0' failed. > > Program received signal SIGABRT, Aborted. > > (gdb) bt > #0 0x0000003ccd2329a5 in raise () from /lib64/libc.so.6 > #1 0x0000003ccd234185 in abort () from /lib64/libc.so.6 > #2 0x0000003ccd22b935 in __assert_fail () from /lib64/libc.so.6 > #3 0x000000000046ff09 in fdctrl_transfer_handler (opaque= optimized out>, nchan=, dma_pos=, > dma_len=) at /root/upstream/qemu/hw/fdc.c:1253 > #4 0x000000000046702c in channel_run () at /root/upstream/qemu/hw/dma.c:348 > #5 DMA_run () at /root/upstream/qemu/hw/dma.c:378 > #6 0x000000000040b0e1 in qemu_bh_poll () at async.c:70 > #7 0x000000000040aa19 in qemu_aio_wait () at aio.c:147 > #8 0x000000000041c355 in bdrv_read_em (bs=0x131fd80, sector_num=19, > buf=, nb_sectors=1) at block.c:2896 > #9 0x000000000041b3d2 in bdrv_read (bs=0x131fd80, sector_num=19, > buf=0x1785a00 "IO SYS!", nb_sectors=1) at block.c:1062 > #10 0x000000000041b3d2 in bdrv_read (bs=0x131f430, sector_num=19, > buf=0x1785a00 "IO SYS!", nb_sectors=1) at block.c:1062 > #11 0x000000000046fbb8 in do_fdctrl_transfer_handler (opaque=0x1785788, > nchan=2, dma_pos=, dma_len=512) > at /root/upstream/qemu/hw/fdc.c:1178 > #12 0x000000000046fecf in fdctrl_transfer_handler (opaque= optimized out>, nchan=, dma_pos=, > dma_len=) at /root/upstream/qemu/hw/fdc.c:1255 > #13 0x000000000046702c in channel_run () at /root/upstream/qemu/hw/dma.c:348 > #14 DMA_run () at /root/upstream/qemu/hw/dma.c:378 > #15 0x000000000046e456 in fdctrl_start_transfer (fdctrl=0x1785788, > direction=1) at /root/upstream/qemu/hw/fdc.c:1107 > #16 0x0000000000558a41 in kvm_handle_io (env=0x1323ff0) at > /root/upstream/qemu/kvm-all.c:834 > #17 kvm_cpu_exec (env=0x1323ff0) at /root/upstream/qemu/kvm-all.c:976 > #18 0x000000000053686a in qemu_kvm_cpu_thread_fn (arg=0x1323ff0) at > /root/upstream/qemu/cpus.c:661 > #19 0x0000003ccda077e1 in start_thread () from /lib64/libpthread.so.0 > #20 0x0000003ccd2e151d in clone () from /lib64/libc.so.6 > > I'm afraid that we can only avoid things like this reliably if we > convert all devices to be direct users of AIO/coroutines. The current > block layer infrastructure doesn't emulate the behaviour of bdrv_read > accurately as bottom halves can be run in the nested main loop. > > For floppy, the following seems to be a quick fix (Lucas, Cleber, does > this solve your problems?), though it's not very satisfying. And I'm not > quite sure yet why it doesn't always happen with kill() in > posix-aio-compat.c. > > diff --git a/hw/dma.c b/hw/dma.c > index 8a7302a..1d3b6f1 100644 > --- a/hw/dma.c > +++ b/hw/dma.c > @@ -358,6 +358,13 @@ static void DMA_run (void) > struct dma_cont *d; > int icont, ichan; > int rearm = 0; > + static int running = 0; > + > + if (running) { > + goto out; > + } else { > + running = 0; > + } > > d = dma_controllers; > > @@ -374,6 +381,8 @@ static void DMA_run (void) > } > } > > +out: > + running = 0; > if (rearm) > qemu_bh_schedule_idle(dma_bh); > } > > Kevin Kevin, In my quick test (compiling qemu.git master + your dma patch, and running a FreeDOS floppy image) it does not have any visible difference. The boot is still stuck after printing "FreeDOS" at the console. PS: We will trigger a full blown test, with a Windows installation using a floppy, but the results with the FreeDOS floppy have been very consistent with the full blown test.