From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49525) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCBx8-0001KQ-0v for qemu-devel@nongnu.org; Mon, 06 Jul 2015 15:21:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCBx5-0002tr-8X for qemu-devel@nongnu.org; Mon, 06 Jul 2015 15:21:17 -0400 Message-ID: <559AD528.5030903@redhat.com> Date: Mon, 06 Jul 2015 15:21:12 -0400 From: John Snow MIME-Version: 1.0 References: <559AA1BB.8020709@redhat.com> <559AABAF.9020208@redhat.com> In-Reply-To: <559AABAF.9020208@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] qtest hang in /x86_64/ahci/io/ncq/simple (ppc64 host) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , Peter Maydell , QEMU Developers , Stefan Hajnoczi , "qemu-ppc@nongnu.org" On 07/06/2015 12:24 PM, Paolo Bonzini wrote: >=20 >=20 > On 06/07/2015 17:41, John Snow wrote: >> I'll take a look. More than possible there's a race in the wait >> conditional that I have just never seen before. >> >> I tweaked it a little recently to wait on both NCQ and traditional >> completion, but it doesn't have a timeout or anything. I'll go ahead a= nd >> fix that while I'm here. >> >> Should be something simple, hopefully. The one Peter Maydell reported is indeed something simple. Sent out a pat= ch. >=20 > My hang in test_halted_ncq is different: ... Great :) >=20 > (gdb) up > #1 0x00007f3aaadb4476 in read (__nbytes=3D1, __buf=3D0x7ffc0652abbf, > __fd=3D) at /usr/include/bits/unistd.h:44 > 44 return __read_alias (__fd, __buf, __nbytes); > (gdb) > #2 qtest_qmp_receive (s=3Ds@entry=3D0x7f3aac017360) at > /home/work/upstream/qemu/tests/libqtest.c:371 > 371 len =3D read(s->qmp_fd, &c, 1); > (gdb) > #3 0x00007f3aaadb4c48 in qtest_qmp_eventwait (s=3D0x7f3aac017360, > event=3Devent@entry=3D0x7f3aaadbdc68 "STOP") at > /home/work/upstream/qemu/tests/libqtest.c:477 > 477 response =3D qtest_qmp_receive(s); > (gdb) > #4 0x00007f3aaadb3514 in qmp_eventwait (event=3D0x7f3aaadbdc68 "STOP")= at > /home/work/upstream/qemu/tests/libqtest.h:498 > 498 return qtest_qmp_eventwait(global_qtest, event); > (gdb) > #5 ahci_guest_io_halt (ahci=3Dahci@entry=3D0x7f3aac014820, > port=3Dport@entry=3D5 '\005', ide_cmd=3Dide_cmd@entry=3D97 'a', > buffer=3Dbuffer@entry=3D1097728, bufsize=3Dbufsize@entry=3D4096, > sector=3Dsector@entry=3D0) > at /home/work/upstream/qemu/tests/libqos/ahci.c:604 > 604 qmp_eventwait("STOP"); > (gdb) > #6 0x00007f3aaada9e09 in ahci_halted_io_test (cmd_read=3D, > cmd_write=3D) at > /home/work/upstream/qemu/tests/ahci-test.c:1244 > 1244 cmd =3D ahci_guest_io_halt(ahci, port, cmd_write, >=20 >=20 > In QEMU, current_run_state is RUN_STATE_RUNNING, so the halt didn't > happen. It's 100% reproducible with "make check-qtest-{i386,x86_64} V=3D= 1 > -j12". This is the final part of the qtest log: >=20 > [R +0.066539] readl 0xe0000390 > [S +0.066553] OK 0x0000000000000000 > [R +0.066582] memset 0x10b000 0x100 0x00 > [S +0.066597] OK > [R +0.067627] write 0x10c000 0x1000 [snip] > [S +0.067748] OK > [R +0.067782] readl 0xe00003b8 > [S +0.067796] OK 0x0000000000000000 > [R +0.067824] read 0x10a020 0x20 > [S +0.067843] OK > 0x0000000000000000000000000000000000000000000000000000000000000000 > [R +0.067874] write 0x10a020 0x20 > 0x0000000000000000000000000000000000000000000000000000000000000000 > [S +0.067897] OK > [R +0.067926] write 0x10a020 0x20 > 0x450401000000000000e010000000000000000000000000000000000000000000 > [S +0.067940] OK > [R +0.067964] write 0x10e000 0x14 0x27806108000000400000000008000000000= 00000 > [S +0.067980] OK > [R +0.068004] write 0x10e080 0x10 0x00c010000000000000000000ff0f0080 > [S +0.068018] OK > [R +0.068042] writel 0xe00003b4 0x2 > [S +0.068056] OK > [R +0.068079] writel 0xe00003b8 0x2 > [S +0.069500] OK >=20 > This is the /tmp/qtest-blkdebug.INHp7s file: >=20 > [inject-error] > event =3D "write_aio" > errno =3D "5" > state =3D "1" > immediately =3D "off" > once =3D "on" > [set-state] > event =3D "write_aio" > new_state =3D "2" >=20 > There is no trace of the write payload in the .qcow2 file. Smells like the injected error prevented that successfully, but... >=20 > Paolo >=20 Can't reproduce: Fedora 21, 4.0.5, Today's QEMU HEAD (7edd8e56) ... So it's a race condition of some sort. Obvious questions: You've tried a distclean? What was your configure invocation? My attempt: make distclean ../../configure --target-list=3D"x86_64-softmmu i386-softmmu" --enable-de= bug make -j9 make check-qtest-{i386,x86_64} V=3D1 -j12 Completes just fine for me. That aside, I can't figure out why the machine HALT *and* the write wouldn't occur. Can you stick a bunch of printfs inside the NCQ error handling function and see if there's something goofing up in there, or if it's even making it there? diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c index bb6a92f..f1af6f9 100644 --- a/hw/ide/ahci.c +++ b/hw/ide/ahci.c @@ -921,6 +921,7 @@ static void ncq_cb(void *opaque, int ret) IDEState *ide_state =3D &ncq_tfs->drive->port.ifs[0]; if (ret =3D=3D -ECANCELED) { + fprintf(stderr, "ncq_cb: -ECANCELED\n"); return; } @@ -928,12 +929,17 @@ static void ncq_cb(void *opaque, int ret) bool is_read =3D ncq_tfs->cmd =3D=3D READ_FPDMA_QUEUED; BlockErrorAction action =3D blk_get_error_action(ide_state->blk, is_read, -ret); + fprintf(stderr, "ncq_cb: ret (%d) is < 0; action is %s\n", + ret, BlockErrorAction_lookup[action]); if (action =3D=3D BLOCK_ERROR_ACTION_STOP) { ncq_tfs->halt =3D true; ide_state->bus->error_status =3D IDE_RETRY_HBA; + fprintf(stderr, "ncq_cb: BLOCK_ERROR_ACTION_STOP\n"); } else if (action =3D=3D BLOCK_ERROR_ACTION_REPORT) { ncq_err(ncq_tfs); + fprintf(stderr, "ncq_cb: BLOCK_ERROR_ACTION_REPORT\n"); } + fprintf(stderr, "ncq_cb: -> blk_error_action\n"); blk_error_action(ide_state->blk, action, is_read, -ret); } else { ide_state->status =3D READY_STAT | SEEK_STAT; My guess is I have goofed something up above there in the error handling section of ncq_cb, but "worksforme," so I am not sure what's going on. If no prints show up during testing, maybe turn on AHCI_DEBUG as well and see if any prints are showing up in the success case. --js