From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1JhDWI-0003hn-UH
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:35 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1JhDWI-0003hF-1X
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:34 -0400
Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1JhDWH-0003h5-N2
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:33 -0400
Received: from fftw.vpsland.com ([69.61.62.151] helo=fftw.org)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <athena@fftw.org>) id 1JhDWH-000815-Ed
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:33 -0400
Received: from pool-96-237-13-71.bstnma.east.verizon.net ([96.237.13.71]
	helo=thinkpad)
	by fftw.org with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.68) (envelope-from <athena@fftw.org>) id 1JhDWF-0000Dx-1C
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:31 -0400
Received: from athena by thinkpad with local (Exim 4.69)
	(envelope-from <athena@fftw.org>) id 1JhDW8-0001Mv-9i
	for qemu-devel@nongnu.org; Wed, 02 Apr 2008 20:41:24 -0400
From: Matteo Frigo <athena@fftw.org>
Date: Wed, 02 Apr 2008 20:41:24 -0400
Message-ID: <87k5jfixcb.fsf@fftw.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Subject: [Qemu-devel] QEMU/KVM SCSI lock up
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

--=-=-=

kvm-64 hangs under heavy disk I/O with scsi disks.  To reproduce,
create a fresh qcow2 disk, boot linux, and execute

  dd if=/dev/sdX of=/dev/null bs=1M

on the fresh disk.  See also https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1895893&group_id=180599

I have attached a patch that appears to fix the problem.  The bug
seems to be the following.  scsi_read_data() does the following

    bdrv_aio_read()
    r->sector += n;
    r->sector_count -= n;

For reasons that I do not fully understand, bdrv_aio_read() does
not return immediately, but instead it calls scsi_read_data()
recursively.  Since ``r->sector += n;'' has not been executed
yet, the re-entrant call triggers a read of the same sector, which
breaks the producer-consumer lockstep.  The fix is to swap the operations
as follows:

    r->sector += n;
    r->sector_count -= n;
    bdrv_aio_read()

A similar fix applies to scsi_write_data().

Thanks for developing kvm, it is truly an amazing piece of software.

Regards,
Matteo Frigo


--=-=-=
Content-Type: application/octet-stream
Content-Disposition: attachment; filename=scsi-patch

diff -aur kvm-64.old/qemu/hw/scsi-disk.c kvm-64.new/qemu/hw/scsi-disk.c
--- kvm-64.old/qemu/hw/scsi-disk.c	2008-03-26 08:49:35.000000000 -0400
+++ kvm-64.new/qemu/hw/scsi-disk.c	2008-03-30 08:37:25.000000000 -0400
@@ -196,12 +196,12 @@
         n = SCSI_DMA_BUF_SIZE / 512;
 
     r->buf_len = n * 512;
-    r->aiocb = bdrv_aio_read(s->bdrv, r->sector, r->dma_buf, n,
+    r->sector += n;
+    r->sector_count -= n;
+    r->aiocb = bdrv_aio_read(s->bdrv, r->sector - n, r->dma_buf, n,
                              scsi_read_complete, r);
     if (r->aiocb == NULL)
         scsi_command_complete(r, SENSE_HARDWARE_ERROR);
-    r->sector += n;
-    r->sector_count -= n;
 }
 
 static void scsi_write_complete(void * opaque, int ret)
@@ -248,12 +248,12 @@
         BADF("Data transfer already in progress\n");
     n = r->buf_len / 512;
     if (n) {
-        r->aiocb = bdrv_aio_write(s->bdrv, r->sector, r->dma_buf, n,
+        r->sector += n;
+        r->sector_count -= n;
+        r->aiocb = bdrv_aio_write(s->bdrv, r->sector - n, r->dma_buf, n,
                                   scsi_write_complete, r);
         if (r->aiocb == NULL)
             scsi_command_complete(r, SENSE_HARDWARE_ERROR);
-        r->sector += n;
-        r->sector_count -= n;
     } else {
         /* Invoke completion routine to fetch data from host.  */
         scsi_write_complete(r, 0);

--=-=-=--