[Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance
@ 2008-10-03 22:05 Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 1/4] lsi_queue_command: add dma direction parameter Ryan Harper
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-03 22:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

With the aio refactoring completed we now have aio backend that supports
submitting large amounts of io into the host system.  Curiously, write
performance through the scsi device is significantly slower than read
performance.  On our reference setup[1], we see about 10MB/s writes and 40MB/s
for reads.  Digging a bit deeper into the problem revealed that the linux
driver was queueing up to 16 requests into the device.  Enabling debugging
in the LSI device and scsi-disk layers showed that for reads the LSI device
would queue up to 16 commands but that writes never were queued.

In lsi_do_command() the emulation submits the scsi command to the layer to
decode the packet and the result determines if the device issues a read or
write request via the value of n, negative values represent writes, positive
values reads.

    n = s->current_dev->send_command(s->current_dev, s->current_tag, buf,
                                     s->current_lun);
    if (n > 0) {
        lsi_set_phase(s, PHASE_DI);
        s->current_dev->read_data(s->current_dev, s->current_tag);
    } else if (n < 0) {
        lsi_set_phase(s, PHASE_DO);
        s->current_dev->write_data(s->current_dev, s->current_tag);
    }

Subsequenetly if the command has not already completed *AND* it was a read
operation, the emulation embarks upon queueing the command and continues to
process incoming requests.  The first thought is to simply change the logic
here to support queueing writes as well.  This is the right idea, but a couple
of other issues prevent this from working.

In lsi_queue_command(), as we're queuing the command we record whether this
operation is a read or write and store that in p->out:

/* Add a command to the queue.  */
static void lsi_queue_command(LSIState *s)
{
    lsi_queue *p;

    DPRINTF("Queueing tag=0x%x\n", s->current_tag);
    if (s->queue_len == s->active_commands) {
        s->queue_len++;
        s->queue = qemu_realloc(s->queue, s->queue_len * sizeof(lsi_queue));
    }
    p = &s->queue[s->active_commands++];
    p->tag = s->current_tag;
    p->pending = 0;
    p->out = (s->sstat1 & PHASE_MASK) == PHASE_DO;
}

If we change the state prior to queue'ing the command (via lsi_set_phase()),
then when we process the deferred command, we end up issuing a dma in the
wrong direction (read versus write).  This isn't an issue when only queuing
reads since the logic in lsi_reselect which sets the direction:

 s->msg_action = p->out ? 2 : 3;

defaults to reads(3 = DATA IN) if it isn't a write operation.  However, once
we start queueing write operations, this needs to be corrected.  The first
patch in the series adds a dma direction parameter to lsi_queue_command since
the caller is in a position to know what type of operation they are queueing. 

Now that we can queue reads and writes properly, we have to look at how the
scsi layer is handling the read and write operations.  In scsi-disk.c, the
main functions that submit io to the aio subsystem via bdrv_aio_read/write are
in scsi_read_data/write_data.  Considering the radical difference in read
versus write performance it was curious to see the code path for reads and
writes to differ greatly, when ultimately, the code should be very similar
with just a few different values to check.  In genreal, we've decoded the scsi
request at this point, we know how much to read/write, etc, and can submit.
scsi_write_data however, seems to go through some extra hoops to calculate the
r->buf_len when it can be derived from the request itself, I observed the
following sequenece:

  lsi_do_command()
     scsi_send_command()       /* return -16384 -- write operation */
     scsi_write_data()         /* n == 0 since r->buf_len is 0 */
     scsi_write_complete(r, 0) /* set buf_len to 
                                  min(r->sector_count * 512,
                                      SCSI_DMA_BUF_LEN), call driver
                                  completion function */
     lsi_command_complete()    /* we're not done with the transfer,
                               we're not wiating, and this isn't
                               a different tag, so we set the
                               current_dma_len, and mark 
                               command_complete = 1 */
     /* back in lsi_do_command() we check if s->command_complete is set,
        it is, and so we do no queueing of writes, and change the lsi
        state to PHASE_DI, and wait for the command to complete. *.

This sequence effectively serializes all scsi write operations.  Comparing
this sequence to a read operation reveals that the scsi_read_data() path
doesn't delay calculating the request's buf_len, nor call the driver complete
function before it has submitted the request into the aio subsystem.

So, the fix should be straight forward: change the lsi device to allow queing
of write operations, and change the scsi disk write path to match the read
path as we should be Golden (TM).

Patches 1-3 implement this and deliver a good boost to write performance.  In
observing the results it was noted that increasing the size of the requests
(16k versus 64k) didn't increase thoroughput even though we aren't bound by
cpu consumption.

Our observation noted that in both read and write to the scsi device, we are
breaking up requests that are larger than the default scsi dma buffer size.
For reads and writes, we can instead of breaking up the larger request into
many smaller once, re-allocate the buffer and submit the request as-is.  This
reduces the number of ios we submit.  This change resulted in a huge boost in
throughput and lowering of cpu consumption.

Below are the results comparing current qemu bits, the first 3 patches (all
the work needed for queuing write commands), and the 4th patch which
reallocates the scsi buffer for larger requests on-demand as well as how the
improved scsi emulation performs under qemu and contrasted with virtio blk.
As I noted above, the setup is the same as with the aio subsystem testing.
These patches cleanly apply to both qemu SVN and kvm's current copy of qemu,
tested on x86_64.

baremetal baseline:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, lvm            | 127.7 |  12   |  11.66       |    9.48    |
read , 16k, lvm            | 171.1 |  15   |  10.69       |    7.09    |
write, 64k, lvm            | 178.4 |   5   |  13.65       |   27.15    |
read , 64k, lvm            | 195.7 |   4   |  12.73       |   24.76    |
---------------------------+-------+-------+--------------+------------+

qemu:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, no patch |  12.8 |  43   | 188.79       |   84.45    |
read , 16k, scsi, no patch |  41.0 | 100   | 317.39       |   29.09    |
write, 64k, scsi, no patch |  26.6 |  31   | 280.25       |  181.75    |
read , 64k, scsi, no patch | 101.0 | 100   | 542.41       |   46.08    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-3 |  36.8 |  80   | 219.28       |   32.67    |
read , 16k, scsi, patch1-3 |  40.5 | 100   | 314.80       |   29.51    |
write, 64k, scsi, patch1-3 |  42.7 |  38   | 248.50       |  113.24    |
read , 64k, scsi, patch1-3 | 130.0 | 100   | 405.32       |   35.87    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-4 |  44.7 | 100   | 284.44       |   26.79    |
read , 16k, scsi, patch1-4 |  40.9 | 100   | 321.81       |   29.24    |
write, 64k, scsi, patch1-4 | 135.0 | 100   | 381.70       |   34.59    |
read , 64k, scsi, patch1-4 | 113.0 | 100   | 488.01       |   41.34    |
---------------------------+-------+-------+--------------+------------+

kvm:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, no patch |  12.0 |   6   |   5.32       |   97.84    |
read , 16k, scsi, no patch | 104.0 |  28   |  21.99       |   11.34    |
write, 64k, scsi, no patch |  28.0 |   6   |   8.50       |  171.70    |
read , 64k, scsi, no patch | 127.0 |  26   |   9.81       |   37.22    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-3 |  43.0 |   6   |  12.86       |   28.14    |
read , 16k, scsi, patch1-3 | 103.0 |  30   |  20.64       |   11.43    |
write, 64k, scsi, patch1-3 |  39.5 |   6   |  10.53       |   27.81    |
read , 64k, scsi, patch1-3 | 125.1 |  25   |   9.51       |   37.78    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-4 | 130.0 |  47   |  12.06       |    9.07    |
read , 16k, scsi, patch1-4 | 155.0 |  50   |  66.99       |    7.57    |
write, 64k, scsi, patch1-4 | 155.0 |  26   |  10.38       |   30.54    |
read , 64k, scsi, patch1-4 | 188.0 |  35   |  10.79       |   25.18    |
---------------------------+-------+-------+--------------+------------+
write, 16k, virtio         | 136.0 |  43   |   9.84       |    8.72    |
read , 16k, virtio         | 182.0 |  44   |   8.79       |    6.49    |
write, 64k, virtio         | 173.0 |  53   |  19.21       |   27.35    |
read , 64k, virtio         | 190.0 |  64   |  17.56       |   24.94    |
---------------------------+-------+-------+--------------+------------+

1. http://lists.gnu.org/archive/html/qemu-devel/2008-09/msg01115.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 1/4] lsi_queue_command: add dma direction parameter
  2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
@ 2008-10-03 22:05 ` Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 2/4] Refactor lsi_do_command to queue read and write ops Ryan Harper
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-03 22:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

The patch changes lsi_queue_command to take a parameter indicating whether the
queue'ed command is a read or write.  This is needed because the lsi device may
change phase we've recorded the type of operation.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

diff --git a/hw/lsi53c895a.c b/hw/lsi53c895a.c
index 53a2add..5c161a1 100644
--- a/hw/lsi53c895a.c
+++ b/hw/lsi53c895a.c
@@ -494,7 +494,7 @@ static void lsi_do_dma(LSIState *s, int out)
 
 
 /* Add a command to the queue.  */
-static void lsi_queue_command(LSIState *s)
+static void lsi_queue_command(LSIState *s, int out)
 {
     lsi_queue *p;
 
@@ -506,7 +506,9 @@ static void lsi_queue_command(LSIState *s)
     p = &s->queue[s->active_commands++];
     p->tag = s->current_tag;
     p->pending = 0;
-    p->out = (s->sstat1 & PHASE_MASK) == PHASE_DO;
+    /* the device may change state before we can queue the command
+     * the caller knows if the command is input/output */
+    p->out = out;
 }
 
 /* Queue a byte for a MSG IN phase.  */
@@ -654,7 +656,7 @@ static void lsi_do_command(LSIState *s)
             /* wait data */
             lsi_set_phase(s, PHASE_MI);
             s->msg_action = 1;
-            lsi_queue_command(s);
+            lsi_queue_command(s, 0);
         } else {
             /* wait command complete */
             lsi_set_phase(s, PHASE_DI);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 2/4] Refactor lsi_do_command to queue read and write ops
  2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 1/4] lsi_queue_command: add dma direction parameter Ryan Harper
@ 2008-10-03 22:05 ` Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 3/4] Refactor scsi-disk layer for queue'ing writes Ryan Harper
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-03 22:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

The patch refactors lsi_do_command to queue both read and write operations where
previously the code only queue'ed read operations.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>


diff --git a/hw/lsi53c895a.c b/hw/lsi53c895a.c
index 5c161a1..5c0535c 100644
--- a/hw/lsi53c895a.c
+++ b/hw/lsi53c895a.c
@@ -649,18 +649,23 @@ static void lsi_do_command(LSIState *s)
     }
 
     if (!s->command_complete) {
-        if (n) {
-            /* Command did not complete immediately so disconnect.  */
-            lsi_add_msg_byte(s, 2); /* SAVE DATA POINTER */
-            lsi_add_msg_byte(s, 4); /* DISCONNECT */
-            /* wait data */
-            lsi_set_phase(s, PHASE_MI);
-            s->msg_action = 1;
-            lsi_queue_command(s, 0);
-        } else {
-            /* wait command complete */
+        if (n == 0) {
+            /* not a read or write command, just wait on completion */
             lsi_set_phase(s, PHASE_DI);
+            return;
         }
+            
+        /* read or write command that did not complete immediately so
+         * inject a disconnect message, and queue the command */
+        lsi_add_msg_byte(s, 2); /* SAVE DATA POINTER */
+        lsi_add_msg_byte(s, 4); /* DISCONNECT */
+        /* wait data */
+        lsi_set_phase(s, PHASE_MI);
+        s->msg_action = 1;
+        if (n > 0) /* read operation */
+            lsi_queue_command(s, 0);
+        else       /* write operation */
+            lsi_queue_command(s, 1);
     }
 }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 3/4] Refactor scsi-disk layer for queue'ing writes
  2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 1/4] lsi_queue_command: add dma direction parameter Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 2/4] Refactor lsi_do_command to queue read and write ops Ryan Harper
@ 2008-10-03 22:05 ` Ryan Harper
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed Ryan Harper
  2008-10-05 23:08 ` [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
  4 siblings, 0 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-03 22:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

Refactor scsi write path to match the simpler, faster scsi read path.  No need
to call into the write completion function to calculate the request buf length.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 16b3215..f2b4814 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -208,25 +208,14 @@ static void scsi_write_complete(void * opaque, int ret)
 {
     SCSIRequest *r = (SCSIRequest *)opaque;
     SCSIDeviceState *s = r->dev;
-    uint32_t len;
 
     if (ret) {
         fprintf(stderr, "scsi-disc: IO write error\n");
         exit(1);
     }
-
     r->aiocb = NULL;
-    if (r->sector_count == 0) {
-        scsi_command_complete(r, SENSE_NO_SENSE);
-    } else {
-        len = r->sector_count * 512;
-        if (len > SCSI_DMA_BUF_SIZE) {
-            len = SCSI_DMA_BUF_SIZE;
-        }
-        r->buf_len = len;
-        DPRINTF("Write complete tag=0x%x more=%d\n", r->tag, len);
-        s->completion(s->opaque, SCSI_REASON_DATA, r->tag, len);
-    }
+    DPRINTF("Write complete tag=0x%x more=%d\n", r->tag, r->buf_len);
+    s->completion(s->opaque, SCSI_REASON_DATA, r->tag, r->buf_len);
 }
 
 /* Write data to a scsi device.  Returns nonzero on failure.
@@ -235,7 +224,7 @@ static int scsi_write_data(SCSIDevice *d, uint32_t tag)
 {
     SCSIDeviceState *s = d->state;
     SCSIRequest *r;
-    uint32_t n;
+    uint32_t n, len;
 
     DPRINTF("Write data tag=0x%x\n", tag);
     r = scsi_find_request(s, tag);
@@ -244,8 +233,22 @@ static int scsi_write_data(SCSIDevice *d, uint32_t tag)
         scsi_command_complete(r, SENSE_HARDWARE_ERROR);
         return 1;
     }
+    DPRINTF("Write sector_count=%d\n", r->sector_count);
+    if (r->sector_count == 0) {
+        scsi_command_complete(r, SENSE_NO_SENSE);
+        return 0;
+    }
     if (r->aiocb)
         BADF("Data transfer already in progress\n");
+
+    /* we know the len of the request and with the max size of the scsi buffer,
+     * we can calculate buf_len needed */
+    len = r->sector_count * 512;
+    if (len > SCSI_DMA_BUF_SIZE)
+        len = SCSI_DMA_BUF_SIZE;
+    r->buf_len = len;
+
+    /* number of sectors to submit */
     n = r->buf_len / 512;
     if (n) {
         r->aiocb = bdrv_aio_write(s->bdrv, r->sector, r->dma_buf, n,

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
                   ` (2 preceding siblings ...)
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 3/4] Refactor scsi-disk layer for queue'ing writes Ryan Harper
@ 2008-10-03 22:05 ` Ryan Harper
  2008-10-03 23:17   ` Paul Brook
  2008-10-05 23:08 ` [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
  4 siblings, 1 reply; 18+ messages in thread
From: Ryan Harper @ 2008-10-03 22:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

The default buffer size breaks up larger read/write requests unnecessarily.
When we encounter requests larger than the default dma buffer, reallocate the
buffer to support the request.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>


diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index f2b4814..a94f10a 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -192,10 +192,14 @@ static void scsi_read_data(SCSIDevice *d, uint32_t tag)
     }
 
     n = r->sector_count;
-    if (n > SCSI_DMA_BUF_SIZE / 512)
-        n = SCSI_DMA_BUF_SIZE / 512;
-
     r->buf_len = n * 512;
+
+    /* if the request is larger than the default, allocate a larger buffer */
+    if (r->buf_len > SCSI_DMA_BUF_SIZE) {
+        qemu_free(r->dma_buf);
+        r->dma_buf = qemu_memalign(512, r->buf_len);
+    }
+
     r->aiocb = bdrv_aio_read(s->bdrv, r->sector, r->dma_buf, n,
                              scsi_read_complete, r);
     if (r->aiocb == NULL)
@@ -224,7 +228,7 @@ static int scsi_write_data(SCSIDevice *d, uint32_t tag)
 {
     SCSIDeviceState *s = d->state;
     SCSIRequest *r;
-    uint32_t n, len;
+    uint32_t n;
 
     DPRINTF("Write data tag=0x%x\n", tag);
     r = scsi_find_request(s, tag);
@@ -243,10 +247,13 @@ static int scsi_write_data(SCSIDevice *d, uint32_t tag)
 
     /* we know the len of the request and with the max size of the scsi buffer,
      * we can calculate buf_len needed */
-    len = r->sector_count * 512;
-    if (len > SCSI_DMA_BUF_SIZE)
-        len = SCSI_DMA_BUF_SIZE;
-    r->buf_len = len;
+    r->buf_len = r->sector_count * 512;
+
+    /* if the request is larger than the default, allocate a larger buffer */
+    if (r->buf_len > SCSI_DMA_BUF_SIZE) {
+        qemu_free(r->dma_buf);
+        r->dma_buf = qemu_memalign(512, r->buf_len);
+    }
 
     /* number of sectors to submit */
     n = r->buf_len / 512;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed Ryan Harper
@ 2008-10-03 23:17   ` Paul Brook
  2008-10-03 23:35     ` Anthony Liguori
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Brook @ 2008-10-03 23:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

On Friday 03 October 2008, Ryan Harper wrote:
> The default buffer size breaks up larger read/write requests unnecessarily.
> When we encounter requests larger than the default dma buffer, reallocate
> the buffer to support the request.

Allocating unboundedly large host buffers based on guest input seems like a 
bad idea.

Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-03 23:17   ` Paul Brook
@ 2008-10-03 23:35     ` Anthony Liguori
  2008-10-04  0:00       ` Paul Brook
  2008-10-04 10:00       ` Avi Kivity
  0 siblings, 2 replies; 18+ messages in thread
From: Anthony Liguori @ 2008-10-03 23:35 UTC (permalink / raw)
  To: Paul Brook; +Cc: Anthony Liguori, Ryan Harper, qemu-devel, kvm

Paul Brook wrote:
> On Friday 03 October 2008, Ryan Harper wrote:
>   
>> The default buffer size breaks up larger read/write requests unnecessarily.
>> When we encounter requests larger than the default dma buffer, reallocate
>> the buffer to support the request.
>>     
>
> Allocating unboundedly large host buffers based on guest input seems like a 
> bad idea.
>   

Perhaps they could be at least bound to phys_ram_size.

In general, I don't think there's a correct size to bound them that's 
less than phys_ram_size.  The guest may be issuing really big IO requests.

Regards,

Anthony Liguori

> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-03 23:35     ` Anthony Liguori
@ 2008-10-04  0:00       ` Paul Brook
  2008-10-04 10:00       ` Avi Kivity
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Brook @ 2008-10-04  0:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

On Saturday 04 October 2008, Anthony Liguori wrote:
> Paul Brook wrote:
> > On Friday 03 October 2008, Ryan Harper wrote:
> >> The default buffer size breaks up larger read/write requests
> >> unnecessarily. When we encounter requests larger than the default dma
> >> buffer, reallocate the buffer to support the request.
> >
> > Allocating unboundedly large host buffers based on guest input seems like
> > a bad idea.
>
> Perhaps they could be at least bound to phys_ram_size.

That's still way too large.  It means that the maximum host footprint of qemu 
is many times the size of the guest RAM. There's a good chance that the host 
machine doesn't even have enough virtual address space to satisfy this 
request.

I expect that the only situation where you can only avoid breaking up large 
transfers when you have zero-copy IO.  Previous zero-copy/vectored IO patches 
suffered from a similar problem: It is not acceptable to allocate huge chunks 
of host ram when you fallback to normal IO.

> In general, I don't think there's a correct size to bound them that's
> less than phys_ram_size.  The guest may be issuing really big IO requests.

Qemu is perfectly capable of handling large IO requests by splitting them into 
multiple smaller requests. Enlarging the size of this buffer is just a 
secondary performance optimisation.

Admittedly we don't currently limit the number of simultaneous commands a 
guest can submit, but that's relatively easy to fix.

Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-03 23:35     ` Anthony Liguori
  2008-10-04  0:00       ` Paul Brook
@ 2008-10-04 10:00       ` Avi Kivity
       [not found]         ` <20081004135749.pphehrhuw9w4gwsc@imap.linux.ibm.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-10-04 10:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, Paul Brook, kvm

Anthony Liguori wrote:
> Paul Brook wrote:
>> On Friday 03 October 2008, Ryan Harper wrote:
>>  
>>> The default buffer size breaks up larger read/write requests 
>>> unnecessarily.
>>> When we encounter requests larger than the default dma buffer, 
>>> reallocate
>>> the buffer to support the request.
>>>     
>>
>> Allocating unboundedly large host buffers based on guest input seems 
>> like a bad idea.
>>   
>
> Perhaps they could be at least bound to phys_ram_size.
>

So the guest could double the memory load on the host with just one 
request?!

> In general, I don't think there's a correct size to bound them that's 
> less than phys_ram_size.  The guest may be issuing really big IO 
> requests.

The correct fix is not to buffer at all but use scatter-gather.  Until 
this is done buffering has to be bounded.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
       [not found]         ` <20081004135749.pphehrhuw9w4gwsc@imap.linux.ibm.com>
@ 2008-10-04 21:47           ` Ryan Harper
  2008-10-04 22:22             ` Anthony Liguori
                               ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-04 21:47 UTC (permalink / raw)
  To: aliguori
  Cc: Anthony Liguori, kvm, qemu-devel, Ryan Harper, Paul Brook,
	Avi Kivity

* aliguori@linux.ibm.com <aliguori@linux.ibm.com> [2008-10-04 12:58]:
> Quoting Avi Kivity <avi@redhat.com>:
> 
> >Anthony Liguori wrote:
> >>In general, I don't think there's a correct size to bound them   
> >>that's less than phys_ram_size.  The guest may be issuing really   
> >>big IO requests.
> >
> >The correct fix is not to buffer at all but use scatter-gather.  Until
> >this is done buffering has to be bounded.
> 
> Instead of capping each request at a certain relatively small size and  
> capping the number of requests, I think if we had a cap that was the  
> total amount of outstanding IO, that would give us good performance  
> while not allowing the guest to do anything crazy.
> 
> For instance, a 4MB pool would allow decent sized requests without  
> letting the guest allocate an absurd amount of memory.

I'd rather avoid any additional accounting overhead of a pool.  If 4MB
is a reasonable limit, lets make that the new max.  I can do some
testing to see where we drop off on performance improvements.  We'd
have a default buffer size (smaller than the previous 64, and now 128k
buf size) that is used when we allocate scsi requests; scanning through
send_command() provides a good idea of other scsi command buf usage; and
on reads and writes, keep the capping logic we've had all along, but
bump the max size up to something like 4MB -- or whatever tests results
show as being ideal. 

In all, it seems silly to worry about this sort of thing since the
entire process could be contained with process ulimits if this is really
a concern.  Are we any more concerned that by splitting the requests
into many smaller requests that we're wasting cpu, pegging the
processor to 100% in some cases?

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-04 21:47           ` Ryan Harper
@ 2008-10-04 22:22             ` Anthony Liguori
  2008-10-05  5:23               ` Avi Kivity
  2008-10-05 23:06               ` Ryan Harper
  2008-10-04 23:00             ` Paul Brook
  2008-10-05  5:29             ` Avi Kivity
  2 siblings, 2 replies; 18+ messages in thread
From: Anthony Liguori @ 2008-10-04 22:22 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Paul Brook, Anthony Liguori, Avi Kivity, kvm, qemu-devel

Ryan Harper wrote:
> I'd rather avoid any additional accounting overhead of a pool.  If 4MB
> is a reasonable limit, lets make that the new max.  I can do some
> testing to see where we drop off on performance improvements.  We'd
> have a default buffer size (smaller than the previous 64, and now 128k
> buf size) that is used when we allocate scsi requests; scanning through
> send_command() provides a good idea of other scsi command buf usage; and
> on reads and writes, keep the capping logic we've had all along, but
> bump the max size up to something like 4MB -- or whatever tests results
> show as being ideal. 
>
> In all, it seems silly to worry about this sort of thing since the
> entire process could be contained with process ulimits if this is really
> a concern.  Are we any more concerned that by splitting the requests
> into many smaller requests that we're wasting cpu, pegging the
> processor to 100% in some cases?
>   

There are two concerns with allowing the guest to alloc arbitrary 
amounts of memory.  The first is that QEMU is not written in such a way 
to be robust in the face of out-of-memory conditions so if we run out of 
VA space, an important malloc could fail and we'd fall over.

The second concern is that if a guest can allocate arbitrary amounts of 
memory, it could generate a swap storm.  Unfortunately, AFAIK, Linux is 
not yet to a point where it can deal with swap fairness.  Hopefully this 
is a limitation that the IO controller folks are taking into account.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-04 21:47           ` Ryan Harper
  2008-10-04 22:22             ` Anthony Liguori
@ 2008-10-04 23:00             ` Paul Brook
  2008-10-05  5:29             ` Avi Kivity
  2 siblings, 0 replies; 18+ messages in thread
From: Paul Brook @ 2008-10-04 23:00 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, Anthony Liguori, aliguori, kvm, Avi Kivity

On Saturday 04 October 2008, Ryan Harper wrote:
> In all, it seems silly to worry about this sort of thing since the
> entire process could be contained with process ulimits if this is really
> a concern.  Are we any more concerned that by splitting the requests
> into many smaller requests that we're wasting cpu, pegging the
> processor to 100% in some cases?

Using small requests may be a bit inefficient, but it still works and allows 
the guest to make progress.

Allocating very large quantities of memory is very likely to kill the VM one 
way or another. This is not acceptable, especially when the guest hasn't even 
done anything wrong. There are legitimate circumstances where the size of the 
outstanding IO requests may be comparable to the guest ram size.

Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-04 22:22             ` Anthony Liguori
@ 2008-10-05  5:23               ` Avi Kivity
  2008-10-05 23:06               ` Ryan Harper
  1 sibling, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-10-05  5:23 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Anthony Liguori, Ryan Harper, qemu-devel, kvm, Paul Brook

Anthony Liguori wrote:
>
> The second concern is that if a guest can allocate arbitrary amounts 
> of memory, it could generate a swap storm.  Unfortunately, AFAIK, 
> Linux is not yet to a point where it can deal with swap fairness.  
> Hopefully this is a limitation that the IO controller folks are taking 
> into account.

Even if Linux was 100% fair (or rather 100% controlled -- we don't want 
fairness here) in dealing with memory overcommit, we don't want swapping 
just because a guest issued large requests.  Breaking up large requests 
will reduce efficiency (I believe negligibly is the limit is 1MB or 
above), swapping will be much worse.

And again, there's no real reason to allocate memory here, we should be 
doing I/O directly to the guest's scatter/gather buffers.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-04 21:47           ` Ryan Harper
  2008-10-04 22:22             ` Anthony Liguori
  2008-10-04 23:00             ` Paul Brook
@ 2008-10-05  5:29             ` Avi Kivity
  2 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-10-05  5:29 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Paul Brook, Anthony Liguori, aliguori, kvm, qemu-devel

Ryan Harper wrote:
> I'd rather avoid any additional accounting overhead of a pool.

The accounting overhead is noise compared to copying hundreds of 
megabytes per second.


> If 4MB
> is a reasonable limit, lets make that the new max.  

The real max is the dma buffer size multiplied by the number of 
concurrent requests.  With a queue depth of 64, the buffers become 4 MB 
* 64 = 256 MB.  That can double the size of a small guest, and using 
just one disk, too.

> I can do some
> testing to see where we drop off on performance improvements.  We'd
> have a default buffer size (smaller than the previous 64, and now 128k
> buf size) that is used when we allocate scsi requests; scanning through
> send_command() provides a good idea of other scsi command buf usage; and
> on reads and writes, keep the capping logic we've had all along, but
> bump the max size up to something like 4MB -- or whatever tests results
> show as being ideal. 
>   

We know what the ideal is: dropping the scatter/gather buffer completely.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-04 22:22             ` Anthony Liguori
  2008-10-05  5:23               ` Avi Kivity
@ 2008-10-05 23:06               ` Ryan Harper
  2008-10-06  7:27                 ` Avi Kivity
  1 sibling, 1 reply; 18+ messages in thread
From: Ryan Harper @ 2008-10-05 23:06 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Anthony Liguori, kvm, qemu-devel, Ryan Harper, Paul Brook,
	Avi Kivity

* Anthony Liguori <anthony@codemonkey.ws> [2008-10-04 17:28]:
> Ryan Harper wrote:
> >I'd rather avoid any additional accounting overhead of a pool.  If 4MB
> >is a reasonable limit, lets make that the new max.  I can do some
> >testing to see where we drop off on performance improvements.  We'd
> >have a default buffer size (smaller than the previous 64, and now 128k
> >buf size) that is used when we allocate scsi requests; scanning through
> >send_command() provides a good idea of other scsi command buf usage; and
> >on reads and writes, keep the capping logic we've had all along, but
> >bump the max size up to something like 4MB -- or whatever tests results
> >show as being ideal. 
> >
> >In all, it seems silly to worry about this sort of thing since the
> >entire process could be contained with process ulimits if this is really
> >a concern.  Are we any more concerned that by splitting the requests
> >into many smaller requests that we're wasting cpu, pegging the
> >processor to 100% in some cases?
> >  
> 
> There are two concerns with allowing the guest to alloc arbitrary 
> amounts of memory.  The first is that QEMU is not written in such a way 
> to be robust in the face of out-of-memory conditions so if we run out of 
> VA space, an important malloc could fail and we'd fall over.

That is an understandable concern and I don't want to make matters
worse, even if the instability already exists in the code as-is.  I think
I'd like to see this fail in practice before I'm really concerned.  For
64-bit builds, is the VA space an issue?

> 
> The second concern is that if a guest can allocate arbitrary amounts of 
> memory, it could generate a swap storm.  Unfortunately, AFAIK, Linux is 
> not yet to a point where it can deal with swap fairness.  Hopefully this 
> is a limitation that the IO controller folks are taking into account.

Sure, but as I mentioned, the amount of memory it can allocate can
surely be controlled by the host system per-process ulimit no?


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance
  2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
                   ` (3 preceding siblings ...)
  2008-10-03 22:05 ` [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed Ryan Harper
@ 2008-10-05 23:08 ` Ryan Harper
  2008-10-13 16:15   ` Ryan Harper
  4 siblings, 1 reply; 18+ messages in thread
From: Ryan Harper @ 2008-10-05 23:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm


Any comments/review for patches 1-3?  It seems that patch4 is something
that we need to work out before committing, but I'd like to see pathes
1-3 go in since they provide a performance boost in and of themselves.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed
  2008-10-05 23:06               ` Ryan Harper
@ 2008-10-06  7:27                 ` Avi Kivity
  0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-10-06  7:27 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, Anthony Liguori, Paul Brook, kvm

Ryan Harper wrote:

 

>> There are two concerns with allowing the guest to alloc arbitrary 
>> amounts of memory.  The first is that QEMU is not written in such a way 
>> to be robust in the face of out-of-memory conditions so if we run out of 
>> VA space, an important malloc could fail and we'd fall over.
>>     
>
> That is an understandable concern and I don't want to make matters
> worse, even if the instability already exists in the code as-is.  I think
> I'd like to see this fail in practice before I'm really concerned.  For
> 64-bit builds, is the VA space an issue?
>
>   

VA space is virtually (groan) unlimited on x86_64.  But note that for 
every kilobyte of virtual address space you allocate, you consume half a 
byte in nonpagepable kernel memory.

>> The second concern is that if a guest can allocate arbitrary amounts of 
>> memory, it could generate a swap storm.  Unfortunately, AFAIK, Linux is 
>> not yet to a point where it can deal with swap fairness.  Hopefully this 
>> is a limitation that the IO controller folks are taking into account.
>>     
>
> Sure, but as I mentioned, the amount of memory it can allocate can
> surely be controlled by the host system per-process ulimit no?
>   

So if the guest issues some large I/Os in parallel, we kill it?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance
  2008-10-05 23:08 ` [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
@ 2008-10-13 16:15   ` Ryan Harper
  0 siblings, 0 replies; 18+ messages in thread
From: Ryan Harper @ 2008-10-13 16:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, kvm

* Ryan Harper <ryanh@us.ibm.com> [2008-10-05 18:08]:
> 
> Any comments/review for patches 1-3?  It seems that patch4 is something
> that we need to work out before committing, but I'd like to see pathes
> 1-3 go in since they provide a performance boost in and of themselves.

Any movement here on getting 1-3 in tree?

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-10-13 16:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-03 22:05 [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
2008-10-03 22:05 ` [Qemu-devel] [PATCH 1/4] lsi_queue_command: add dma direction parameter Ryan Harper
2008-10-03 22:05 ` [Qemu-devel] [PATCH 2/4] Refactor lsi_do_command to queue read and write ops Ryan Harper
2008-10-03 22:05 ` [Qemu-devel] [PATCH 3/4] Refactor scsi-disk layer for queue'ing writes Ryan Harper
2008-10-03 22:05 ` [Qemu-devel] [PATCH 4/4] Reallocate dma buffers in read/write path if needed Ryan Harper
2008-10-03 23:17   ` Paul Brook
2008-10-03 23:35     ` Anthony Liguori
2008-10-04  0:00       ` Paul Brook
2008-10-04 10:00       ` Avi Kivity
     [not found]         ` <20081004135749.pphehrhuw9w4gwsc@imap.linux.ibm.com>
2008-10-04 21:47           ` Ryan Harper
2008-10-04 22:22             ` Anthony Liguori
2008-10-05  5:23               ` Avi Kivity
2008-10-05 23:06               ` Ryan Harper
2008-10-06  7:27                 ` Avi Kivity
2008-10-04 23:00             ` Paul Brook
2008-10-05  5:29             ` Avi Kivity
2008-10-05 23:08 ` [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Ryan Harper
2008-10-13 16:15   ` Ryan Harper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).