From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KlsmO-0002pl-NY
	for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:44 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KlsmJ-0002oH-MP
	for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:43 -0400
Received: from [199.232.76.173] (port=51470 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KlsmJ-0002o6-G7
	for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:39 -0400
Received: from e34.co.us.ibm.com ([32.97.110.152]:35251)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <ryanh@us.ibm.com>) id 1KlsmI-0007Km-OM
	for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:39 -0400
Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com
	[9.17.195.227])
	by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id m93M5ZTB018289
	for <qemu-devel@nongnu.org>; Fri, 3 Oct 2008 18:05:35 -0400
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])
	by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v9.1) with ESMTP id
	m93M5ZFH224800
	for <qemu-devel@nongnu.org>; Fri, 3 Oct 2008 16:05:35 -0600
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])
	by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id
	m93M5ZOk001028
	for <qemu-devel@nongnu.org>; Fri, 3 Oct 2008 16:05:35 -0600
From: Ryan Harper <ryanh@us.ibm.com>
Date: Fri,  3 Oct 2008 17:05:27 -0500
Message-Id: <1223071531-31817-1-git-send-email-ryanh@us.ibm.com>
Subject: [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: Anthony Liguori <aliguori@us.ibm.com>, Ryan Harper <ryanh@us.ibm.com>, kvm@vger.kernel.org

With the aio refactoring completed we now have aio backend that supports
submitting large amounts of io into the host system.  Curiously, write
performance through the scsi device is significantly slower than read
performance.  On our reference setup[1], we see about 10MB/s writes and 40MB/s
for reads.  Digging a bit deeper into the problem revealed that the linux
driver was queueing up to 16 requests into the device.  Enabling debugging
in the LSI device and scsi-disk layers showed that for reads the LSI device
would queue up to 16 commands but that writes never were queued.

In lsi_do_command() the emulation submits the scsi command to the layer to
decode the packet and the result determines if the device issues a read or
write request via the value of n, negative values represent writes, positive
values reads.

    n = s->current_dev->send_command(s->current_dev, s->current_tag, buf,
                                     s->current_lun);
    if (n > 0) {
        lsi_set_phase(s, PHASE_DI);
        s->current_dev->read_data(s->current_dev, s->current_tag);
    } else if (n < 0) {
        lsi_set_phase(s, PHASE_DO);
        s->current_dev->write_data(s->current_dev, s->current_tag);
    }

Subsequenetly if the command has not already completed *AND* it was a read
operation, the emulation embarks upon queueing the command and continues to
process incoming requests.  The first thought is to simply change the logic
here to support queueing writes as well.  This is the right idea, but a couple
of other issues prevent this from working.

In lsi_queue_command(), as we're queuing the command we record whether this
operation is a read or write and store that in p->out:

/* Add a command to the queue.  */
static void lsi_queue_command(LSIState *s)
{
    lsi_queue *p;

    DPRINTF("Queueing tag=0x%x\n", s->current_tag);
    if (s->queue_len == s->active_commands) {
        s->queue_len++;
        s->queue = qemu_realloc(s->queue, s->queue_len * sizeof(lsi_queue));
    }
    p = &s->queue[s->active_commands++];
    p->tag = s->current_tag;
    p->pending = 0;
    p->out = (s->sstat1 & PHASE_MASK) == PHASE_DO;
}

If we change the state prior to queue'ing the command (via lsi_set_phase()),
then when we process the deferred command, we end up issuing a dma in the
wrong direction (read versus write).  This isn't an issue when only queuing
reads since the logic in lsi_reselect which sets the direction:

 s->msg_action = p->out ? 2 : 3;

defaults to reads(3 = DATA IN) if it isn't a write operation.  However, once
we start queueing write operations, this needs to be corrected.  The first
patch in the series adds a dma direction parameter to lsi_queue_command since
the caller is in a position to know what type of operation they are queueing. 

Now that we can queue reads and writes properly, we have to look at how the
scsi layer is handling the read and write operations.  In scsi-disk.c, the
main functions that submit io to the aio subsystem via bdrv_aio_read/write are
in scsi_read_data/write_data.  Considering the radical difference in read
versus write performance it was curious to see the code path for reads and
writes to differ greatly, when ultimately, the code should be very similar
with just a few different values to check.  In genreal, we've decoded the scsi
request at this point, we know how much to read/write, etc, and can submit.
scsi_write_data however, seems to go through some extra hoops to calculate the
r->buf_len when it can be derived from the request itself, I observed the
following sequenece:

  lsi_do_command()
     scsi_send_command()       /* return -16384 -- write operation */
     scsi_write_data()         /* n == 0 since r->buf_len is 0 */
     scsi_write_complete(r, 0) /* set buf_len to 
                                  min(r->sector_count * 512,
                                      SCSI_DMA_BUF_LEN), call driver
                                  completion function */
     lsi_command_complete()    /* we're not done with the transfer,
                               we're not wiating, and this isn't
                               a different tag, so we set the
                               current_dma_len, and mark 
                               command_complete = 1 */
     /* back in lsi_do_command() we check if s->command_complete is set,
        it is, and so we do no queueing of writes, and change the lsi
        state to PHASE_DI, and wait for the command to complete. *.

This sequence effectively serializes all scsi write operations.  Comparing
this sequence to a read operation reveals that the scsi_read_data() path
doesn't delay calculating the request's buf_len, nor call the driver complete
function before it has submitted the request into the aio subsystem.

So, the fix should be straight forward: change the lsi device to allow queing
of write operations, and change the scsi disk write path to match the read
path as we should be Golden (TM).

Patches 1-3 implement this and deliver a good boost to write performance.  In
observing the results it was noted that increasing the size of the requests
(16k versus 64k) didn't increase thoroughput even though we aren't bound by
cpu consumption.

Our observation noted that in both read and write to the scsi device, we are
breaking up requests that are larger than the default scsi dma buffer size.
For reads and writes, we can instead of breaking up the larger request into
many smaller once, re-allocate the buffer and submit the request as-is.  This
reduces the number of ios we submit.  This change resulted in a huge boost in
throughput and lowering of cpu consumption.

Below are the results comparing current qemu bits, the first 3 patches (all
the work needed for queuing write commands), and the 4th patch which
reallocates the scsi buffer for larger requests on-demand as well as how the
improved scsi emulation performs under qemu and contrasted with virtio blk.
As I noted above, the setup is the same as with the aio subsystem testing.
These patches cleanly apply to both qemu SVN and kvm's current copy of qemu,
tested on x86_64.


baremetal baseline:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, lvm            | 127.7 |  12   |  11.66       |    9.48    |
read , 16k, lvm            | 171.1 |  15   |  10.69       |    7.09    |
write, 64k, lvm            | 178.4 |   5   |  13.65       |   27.15    |
read , 64k, lvm            | 195.7 |   4   |  12.73       |   24.76    |
---------------------------+-------+-------+--------------+------------+

qemu:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, no patch |  12.8 |  43   | 188.79       |   84.45    |
read , 16k, scsi, no patch |  41.0 | 100   | 317.39       |   29.09    |
write, 64k, scsi, no patch |  26.6 |  31   | 280.25       |  181.75    |
read , 64k, scsi, no patch | 101.0 | 100   | 542.41       |   46.08    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-3 |  36.8 |  80   | 219.28       |   32.67    |
read , 16k, scsi, patch1-3 |  40.5 | 100   | 314.80       |   29.51    |
write, 64k, scsi, patch1-3 |  42.7 |  38   | 248.50       |  113.24    |
read , 64k, scsi, patch1-3 | 130.0 | 100   | 405.32       |   35.87    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-4 |  44.7 | 100   | 284.44       |   26.79    |
read , 16k, scsi, patch1-4 |  40.9 | 100   | 321.81       |   29.24    |
write, 64k, scsi, patch1-4 | 135.0 | 100   | 381.70       |   34.59    |
read , 64k, scsi, patch1-4 | 113.0 | 100   | 488.01       |   41.34    |
---------------------------+-------+-------+--------------+------------+


kvm:
---------------------------+-------+-------+--------------+------------+
Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
type, block size, iface    | MB/s  | usage | latency usec | latency ms |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, no patch |  12.0 |   6   |   5.32       |   97.84    |
read , 16k, scsi, no patch | 104.0 |  28   |  21.99       |   11.34    |
write, 64k, scsi, no patch |  28.0 |   6   |   8.50       |  171.70    |
read , 64k, scsi, no patch | 127.0 |  26   |   9.81       |   37.22    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-3 |  43.0 |   6   |  12.86       |   28.14    |
read , 16k, scsi, patch1-3 | 103.0 |  30   |  20.64       |   11.43    |
write, 64k, scsi, patch1-3 |  39.5 |   6   |  10.53       |   27.81    |
read , 64k, scsi, patch1-3 | 125.1 |  25   |   9.51       |   37.78    |
---------------------------+-------+-------+--------------+------------+
write, 16k, scsi, patch1-4 | 130.0 |  47   |  12.06       |    9.07    |
read , 16k, scsi, patch1-4 | 155.0 |  50   |  66.99       |    7.57    |
write, 64k, scsi, patch1-4 | 155.0 |  26   |  10.38       |   30.54    |
read , 64k, scsi, patch1-4 | 188.0 |  35   |  10.79       |   25.18    |
---------------------------+-------+-------+--------------+------------+
write, 16k, virtio         | 136.0 |  43   |   9.84       |    8.72    |
read , 16k, virtio         | 182.0 |  44   |   8.79       |    6.49    |
write, 64k, virtio         | 173.0 |  53   |  19.21       |   27.35    |
read , 64k, virtio         | 190.0 |  64   |  17.56       |   24.94    |
---------------------------+-------+-------+--------------+------------+

1. http://lists.gnu.org/archive/html/qemu-devel/2008-09/msg01115.html