[Qemu-devel] QEMU/KVM SCSI lock up

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] QEMU/KVM SCSI lock up
@ 2008-04-03  0:41 Matteo Frigo
  2008-04-03  8:38 ` Avi Kivity
  0 siblings, 1 reply; 5+ messages in thread
From: Matteo Frigo @ 2008-04-03  0:41 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1016 bytes --]

kvm-64 hangs under heavy disk I/O with scsi disks.  To reproduce,
create a fresh qcow2 disk, boot linux, and execute

  dd if=/dev/sdX of=/dev/null bs=1M

on the fresh disk.  See also https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1895893&group_id=180599

I have attached a patch that appears to fix the problem.  The bug
seems to be the following.  scsi_read_data() does the following

    bdrv_aio_read()
    r->sector += n;
    r->sector_count -= n;

For reasons that I do not fully understand, bdrv_aio_read() does
not return immediately, but instead it calls scsi_read_data()
recursively.  Since ``r->sector += n;'' has not been executed
yet, the re-entrant call triggers a read of the same sector, which
breaks the producer-consumer lockstep.  The fix is to swap the operations
as follows:

    r->sector += n;
    r->sector_count -= n;
    bdrv_aio_read()

A similar fix applies to scsi_write_data().

Thanks for developing kvm, it is truly an amazing piece of software.

Regards,
Matteo Frigo


[-- Attachment #2: scsi-patch --]
[-- Type: application/octet-stream, Size: 1375 bytes --]

diff -aur kvm-64.old/qemu/hw/scsi-disk.c kvm-64.new/qemu/hw/scsi-disk.c
--- kvm-64.old/qemu/hw/scsi-disk.c	2008-03-26 08:49:35.000000000 -0400
+++ kvm-64.new/qemu/hw/scsi-disk.c	2008-03-30 08:37:25.000000000 -0400
@@ -196,12 +196,12 @@
         n = SCSI_DMA_BUF_SIZE / 512;
 
     r->buf_len = n * 512;
-    r->aiocb = bdrv_aio_read(s->bdrv, r->sector, r->dma_buf, n,
+    r->sector += n;
+    r->sector_count -= n;
+    r->aiocb = bdrv_aio_read(s->bdrv, r->sector - n, r->dma_buf, n,
                              scsi_read_complete, r);
     if (r->aiocb == NULL)
         scsi_command_complete(r, SENSE_HARDWARE_ERROR);
-    r->sector += n;
-    r->sector_count -= n;
 }
 
 static void scsi_write_complete(void * opaque, int ret)
@@ -248,12 +248,12 @@
         BADF("Data transfer already in progress\n");
     n = r->buf_len / 512;
     if (n) {
-        r->aiocb = bdrv_aio_write(s->bdrv, r->sector, r->dma_buf, n,
+        r->sector += n;
+        r->sector_count -= n;
+        r->aiocb = bdrv_aio_write(s->bdrv, r->sector - n, r->dma_buf, n,
                                   scsi_write_complete, r);
         if (r->aiocb == NULL)
             scsi_command_complete(r, SENSE_HARDWARE_ERROR);
-        r->sector += n;
-        r->sector_count -= n;
     } else {
         /* Invoke completion routine to fetch data from host.  */
         scsi_write_complete(r, 0);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] QEMU/KVM SCSI lock up
  2008-04-03  0:41 [Qemu-devel] QEMU/KVM SCSI lock up Matteo Frigo
@ 2008-04-03  8:38 ` Avi Kivity
  2008-04-03 11:18   ` Matteo Frigo
  2008-04-03 11:31   ` Jamie Lokier
  0 siblings, 2 replies; 5+ messages in thread
From: Avi Kivity @ 2008-04-03  8:38 UTC (permalink / raw)
  To: Matteo Frigo; +Cc: qemu-devel

Matteo Frigo wrote:
> kvm-64 hangs under heavy disk I/O with scsi disks.  To reproduce,
> create a fresh qcow2 disk, boot linux, and execute
>
>   dd if=/dev/sdX of=/dev/null bs=1M
>
> on the fresh disk.  See also https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1895893&group_id=180599
>
> I have attached a patch that appears to fix the problem.  The bug
> seems to be the following.  scsi_read_data() does the following
>
>     bdrv_aio_read()
>     r->sector += n;
>     r->sector_count -= n;
>
> For reasons that I do not fully understand, bdrv_aio_read() does
> not return immediately, but instead it calls scsi_read_data()
> recursively.

What happens (I think) is that bdrv_aio_read() completes immediately, 
calls the completion callback, which starts a read for the next batch of 
sectors.


>   Since ``r->sector += n;'' has not been executed
> yet, the re-entrant call triggers a read of the same sector, which
> breaks the producer-consumer lockstep.  The fix is to swap the operations
> as follows:
>
>     r->sector += n;
>     r->sector_count -= n;
>     bdrv_aio_read()
>
> A similar fix applies to scsi_write_data().
>
>   

Will that not issue the read for the wrong sector?

I think the correct fix is to move r->sector and r->sector_count 
adjustment into scsi_read_complete() and scsi_write_complete().

Long term we want to replace the recursion by queuing.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] QEMU/KVM SCSI lock up
  2008-04-03  8:38 ` Avi Kivity
@ 2008-04-03 11:18   ` Matteo Frigo
  2008-04-03 11:49     ` Avi Kivity
  2008-04-03 11:31   ` Jamie Lokier
  1 sibling, 1 reply; 5+ messages in thread
From: Matteo Frigo @ 2008-04-03 11:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity <avi@qumranet.com> writes:

> Will that not issue the read for the wrong sector?

The patch that I submitted issues the read for the correct sector:
It reads (r->sector - n) instead of r->sector.  I apologize
if my explanation of the patch was confusing on this point.

> I think the correct fix is to move r->sector and r->sector_count
> adjustment into scsi_read_complete() and scsi_write_complete().

This will work as long as you can guarantee that scsi_read_complete()
is called before the recursive call to scsi_read_data().

Regards,
Matteo Frigo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] QEMU/KVM SCSI lock up
  2008-04-03 11:18   ` Matteo Frigo
@ 2008-04-03 11:49     ` Avi Kivity
  0 siblings, 0 replies; 5+ messages in thread
From: Avi Kivity @ 2008-04-03 11:49 UTC (permalink / raw)
  To: Matteo Frigo; +Cc: qemu-devel

Matteo Frigo wrote:
> Avi Kivity <avi@qumranet.com> writes:
>
>   
>> Will that not issue the read for the wrong sector?
>>     
>
> The patch that I submitted issues the read for the correct sector:
> It reads (r->sector - n) instead of r->sector.  I apologize
> if my explanation of the patch was confusing on this point.
>
>   

I misread the patch, sorry.

>> I think the correct fix is to move r->sector and r->sector_count
>> adjustment into scsi_read_complete() and scsi_write_complete().
>>     
>
> This will work as long as you can guarantee that scsi_read_complete()
> is called before the recursive call to scsi_read_data().

Yes, scsi_read_complete() calls s->completion, which triggers the 
recursion, by my reading.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] QEMU/KVM SCSI lock up
  2008-04-03  8:38 ` Avi Kivity
  2008-04-03 11:18   ` Matteo Frigo
@ 2008-04-03 11:31   ` Jamie Lokier
  1 sibling, 0 replies; 5+ messages in thread
From: Jamie Lokier @ 2008-04-03 11:31 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> >    bdrv_aio_read()
> >    r->sector += n;
> >    r->sector_count -= n;
> >
> >For reasons that I do not fully understand, bdrv_aio_read() does
> >not return immediately, but instead it calls scsi_read_data()
> >recursively.
> 
> What happens (I think) is that bdrv_aio_read() completes immediately, 
> calls the completion callback, which starts a read for the next batch of 
> sectors.
> 
> Long term we want to replace the recursion by queuing.

Ah yes.  Reminds me of a bug in a program of mine with asynchronous
sockets, which sometimes completes immediately.  My async connection
function took a callback argument.  The callback was called
immediately if the connect returned success immediately, otherwise was
queued.

Then I wrote some code like this:

     // some stuff
     my_obj->connector = async_connection (address, my_obj_connect_cb, my_obj);
     // some more stuff

     void my_obj_connect_cb (void * arg)
     {
        MyObj * my_obj = arg;
        do_stuff_with (my_obj->connector);
     }

That worked for several months flawlessly, the code was deployed on
500 units around the country, then one day an async connect happened
to complete fast enough that the callback was called recursively, and
a core dump followed.

I learned a few lessons about async callbacks:

    1. Queue them, even when the async operation completes immediately.
       It's really not worth the "optimisation" of immediate callback,
       especially if the code for queuing is already there.

       Even when immediate callback is documented, mistakes are too
       easy to make when calling it, because the async call pattern
       _looks_ safe, and may work most of the time or always with some
       configurations.

    2. If you really must have recursion, e.g. perhaps the synchronous
       path is common and a hot path, beware that this is prone to
       mistakes in usage.  Check that it's worth the optimisation.
       Then document with words like "warning" the need for special
       care.

    3. If there is immediate callback possible, sometimes it's useful
       API to have a flag argument to enable that, so that always
       queueing is an option, and so the programmer has to think
       about consequences if they set the flag.  E.g.:

          async_id = async_func (args, callback, cb_arg, CB_ALWAYS_QUEUED);
          async_id = async_func (args, callback, cb_arg, CB_MAYBE_IMMEDIATE);

    3. The code pattern "async_id = async_func (args, callback,
       callback_arg)" looks very natural, but it doesn't work when
       there is immediate callback.

       Callback is called before the id can be stored anywhere.  If
       the async_id is still meaningful at the point of callback, the
       API is fundamentally broken.

Just a few notes on async callback API pitfalls I've had...

-- Jamie

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-04-03 11:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-03  0:41 [Qemu-devel] QEMU/KVM SCSI lock up Matteo Frigo
2008-04-03  8:38 ` Avi Kivity
2008-04-03 11:18   ` Matteo Frigo
2008-04-03 11:49     ` Avi Kivity
2008-04-03 11:31   ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).