From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KlsmO-0002pl-NY for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:44 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KlsmJ-0002oH-MP for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:43 -0400 Received: from [199.232.76.173] (port=51470 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KlsmJ-0002o6-G7 for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:39 -0400 Received: from e34.co.us.ibm.com ([32.97.110.152]:35251) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1KlsmI-0007Km-OM for qemu-devel@nongnu.org; Fri, 03 Oct 2008 18:05:39 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id m93M5ZTB018289 for ; Fri, 3 Oct 2008 18:05:35 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v9.1) with ESMTP id m93M5ZFH224800 for ; Fri, 3 Oct 2008 16:05:35 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m93M5ZOk001028 for ; Fri, 3 Oct 2008 16:05:35 -0600 From: Ryan Harper Date: Fri, 3 Oct 2008 17:05:27 -0500 Message-Id: <1223071531-31817-1-git-send-email-ryanh@us.ibm.com> Subject: [Qemu-devel] [PATCH 0/4] Improve emulated scsi write performance Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Anthony Liguori , Ryan Harper , kvm@vger.kernel.org With the aio refactoring completed we now have aio backend that supports submitting large amounts of io into the host system. Curiously, write performance through the scsi device is significantly slower than read performance. On our reference setup[1], we see about 10MB/s writes and 40MB/s for reads. Digging a bit deeper into the problem revealed that the linux driver was queueing up to 16 requests into the device. Enabling debugging in the LSI device and scsi-disk layers showed that for reads the LSI device would queue up to 16 commands but that writes never were queued. In lsi_do_command() the emulation submits the scsi command to the layer to decode the packet and the result determines if the device issues a read or write request via the value of n, negative values represent writes, positive values reads. n = s->current_dev->send_command(s->current_dev, s->current_tag, buf, s->current_lun); if (n > 0) { lsi_set_phase(s, PHASE_DI); s->current_dev->read_data(s->current_dev, s->current_tag); } else if (n < 0) { lsi_set_phase(s, PHASE_DO); s->current_dev->write_data(s->current_dev, s->current_tag); } Subsequenetly if the command has not already completed *AND* it was a read operation, the emulation embarks upon queueing the command and continues to process incoming requests. The first thought is to simply change the logic here to support queueing writes as well. This is the right idea, but a couple of other issues prevent this from working. In lsi_queue_command(), as we're queuing the command we record whether this operation is a read or write and store that in p->out: /* Add a command to the queue. */ static void lsi_queue_command(LSIState *s) { lsi_queue *p; DPRINTF("Queueing tag=0x%x\n", s->current_tag); if (s->queue_len == s->active_commands) { s->queue_len++; s->queue = qemu_realloc(s->queue, s->queue_len * sizeof(lsi_queue)); } p = &s->queue[s->active_commands++]; p->tag = s->current_tag; p->pending = 0; p->out = (s->sstat1 & PHASE_MASK) == PHASE_DO; } If we change the state prior to queue'ing the command (via lsi_set_phase()), then when we process the deferred command, we end up issuing a dma in the wrong direction (read versus write). This isn't an issue when only queuing reads since the logic in lsi_reselect which sets the direction: s->msg_action = p->out ? 2 : 3; defaults to reads(3 = DATA IN) if it isn't a write operation. However, once we start queueing write operations, this needs to be corrected. The first patch in the series adds a dma direction parameter to lsi_queue_command since the caller is in a position to know what type of operation they are queueing. Now that we can queue reads and writes properly, we have to look at how the scsi layer is handling the read and write operations. In scsi-disk.c, the main functions that submit io to the aio subsystem via bdrv_aio_read/write are in scsi_read_data/write_data. Considering the radical difference in read versus write performance it was curious to see the code path for reads and writes to differ greatly, when ultimately, the code should be very similar with just a few different values to check. In genreal, we've decoded the scsi request at this point, we know how much to read/write, etc, and can submit. scsi_write_data however, seems to go through some extra hoops to calculate the r->buf_len when it can be derived from the request itself, I observed the following sequenece: lsi_do_command() scsi_send_command() /* return -16384 -- write operation */ scsi_write_data() /* n == 0 since r->buf_len is 0 */ scsi_write_complete(r, 0) /* set buf_len to min(r->sector_count * 512, SCSI_DMA_BUF_LEN), call driver completion function */ lsi_command_complete() /* we're not done with the transfer, we're not wiating, and this isn't a different tag, so we set the current_dma_len, and mark command_complete = 1 */ /* back in lsi_do_command() we check if s->command_complete is set, it is, and so we do no queueing of writes, and change the lsi state to PHASE_DI, and wait for the command to complete. *. This sequence effectively serializes all scsi write operations. Comparing this sequence to a read operation reveals that the scsi_read_data() path doesn't delay calculating the request's buf_len, nor call the driver complete function before it has submitted the request into the aio subsystem. So, the fix should be straight forward: change the lsi device to allow queing of write operations, and change the scsi disk write path to match the read path as we should be Golden (TM). Patches 1-3 implement this and deliver a good boost to write performance. In observing the results it was noted that increasing the size of the requests (16k versus 64k) didn't increase thoroughput even though we aren't bound by cpu consumption. Our observation noted that in both read and write to the scsi device, we are breaking up requests that are larger than the default scsi dma buffer size. For reads and writes, we can instead of breaking up the larger request into many smaller once, re-allocate the buffer and submit the request as-is. This reduces the number of ios we submit. This change resulted in a huge boost in throughput and lowering of cpu consumption. Below are the results comparing current qemu bits, the first 3 patches (all the work needed for queuing write commands), and the 4th patch which reallocates the scsi buffer for larger requests on-demand as well as how the improved scsi emulation performs under qemu and contrasted with virtio blk. As I noted above, the setup is the same as with the aio subsystem testing. These patches cleanly apply to both qemu SVN and kvm's current copy of qemu, tested on x86_64. baremetal baseline: ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | type, block size, iface | MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ write, 16k, lvm | 127.7 | 12 | 11.66 | 9.48 | read , 16k, lvm | 171.1 | 15 | 10.69 | 7.09 | write, 64k, lvm | 178.4 | 5 | 13.65 | 27.15 | read , 64k, lvm | 195.7 | 4 | 12.73 | 24.76 | ---------------------------+-------+-------+--------------+------------+ qemu: ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | type, block size, iface | MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, no patch | 12.8 | 43 | 188.79 | 84.45 | read , 16k, scsi, no patch | 41.0 | 100 | 317.39 | 29.09 | write, 64k, scsi, no patch | 26.6 | 31 | 280.25 | 181.75 | read , 64k, scsi, no patch | 101.0 | 100 | 542.41 | 46.08 | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, patch1-3 | 36.8 | 80 | 219.28 | 32.67 | read , 16k, scsi, patch1-3 | 40.5 | 100 | 314.80 | 29.51 | write, 64k, scsi, patch1-3 | 42.7 | 38 | 248.50 | 113.24 | read , 64k, scsi, patch1-3 | 130.0 | 100 | 405.32 | 35.87 | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, patch1-4 | 44.7 | 100 | 284.44 | 26.79 | read , 16k, scsi, patch1-4 | 40.9 | 100 | 321.81 | 29.24 | write, 64k, scsi, patch1-4 | 135.0 | 100 | 381.70 | 34.59 | read , 64k, scsi, patch1-4 | 113.0 | 100 | 488.01 | 41.34 | ---------------------------+-------+-------+--------------+------------+ kvm: ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | type, block size, iface | MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, no patch | 12.0 | 6 | 5.32 | 97.84 | read , 16k, scsi, no patch | 104.0 | 28 | 21.99 | 11.34 | write, 64k, scsi, no patch | 28.0 | 6 | 8.50 | 171.70 | read , 64k, scsi, no patch | 127.0 | 26 | 9.81 | 37.22 | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, patch1-3 | 43.0 | 6 | 12.86 | 28.14 | read , 16k, scsi, patch1-3 | 103.0 | 30 | 20.64 | 11.43 | write, 64k, scsi, patch1-3 | 39.5 | 6 | 10.53 | 27.81 | read , 64k, scsi, patch1-3 | 125.1 | 25 | 9.51 | 37.78 | ---------------------------+-------+-------+--------------+------------+ write, 16k, scsi, patch1-4 | 130.0 | 47 | 12.06 | 9.07 | read , 16k, scsi, patch1-4 | 155.0 | 50 | 66.99 | 7.57 | write, 64k, scsi, patch1-4 | 155.0 | 26 | 10.38 | 30.54 | read , 64k, scsi, patch1-4 | 188.0 | 35 | 10.79 | 25.18 | ---------------------------+-------+-------+--------------+------------+ write, 16k, virtio | 136.0 | 43 | 9.84 | 8.72 | read , 16k, virtio | 182.0 | 44 | 8.79 | 6.49 | write, 64k, virtio | 173.0 | 53 | 19.21 | 27.35 | read , 64k, virtio | 190.0 | 64 | 17.56 | 24.94 | ---------------------------+-------+-------+--------------+------------+ 1. http://lists.gnu.org/archive/html/qemu-devel/2008-09/msg01115.html