From: Anthony Liguori <aliguori@us.ibm.com>
To: Ryan Harper <ryanh@us.ibm.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
Date: Mon, 22 Sep 2008 22:32:49 -0500 [thread overview]
Message-ID: <48D86361.4000405@us.ibm.com> (raw)
In-Reply-To: <1222125454-21744-1-git-send-email-ryanh@us.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 3559 bytes --]
Can you run the same performance tests with the following patches (using
sync=on instead of cache=off)?
You'll need my aio_init fix too. I suspect this will give equally good
performance to your patch set. That's not saying your patch set isn't
useful, but I would like to get performance to be better for the case
that we're going through the page cache.
Regards,
Anthony Liguori
Ryan Harper wrote:
> The patchset adds additional AIO driver abstraction to the block raw driver to
> support multiple aio implementations for each device. The first patch pulls
> the posix aio implementation out of the block-raw device using a generic call
> to the newly created AIO Driver structure. The posix aio implementation that
> was contained in block-raw-posix.c has been refactored in to aio-posix.c. The
> next patch adds a linux aio implementation for raw devices being opened
> O_DIRECT via cache=off drive option. We only use linux aio when cache=off as
> linux aio falls back to synchronous ops if not opened with O_DIRECT flag.
>
> Addtional work has been done on top of QEMU for KVM and virtio-blk devices.
> While virtio-blk is not yet upstream in QEMU, the AIO changes here provide a
> tremendous performance improvement (from 7.6% of native, to 100% of randwrite,
> and 3.9% of native, to 101.4% of native for seq write) for virtio
> devices with cache=off.
>
> Storage subsystem:
> IBM EXP300 - 14 Disk Fiber Expansion, 17G - 15K RPMS
> Host: AMD Barcelona, 2 socket, 8G RAM
> HBA: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA (rev 02)
>
> Benchmark[1]:
> fio --name=guestrun --filename=/dev/mapper/volumes-fibre \
> --rw=randwrite --bs=16k --ioengine=libaio --direct=1 \
> --norandommap --runtime=120 --time_based --numjobs=1 \
> --group_reporting --thread --size=25g --write_lat_log \
> --write_bw_log --iodepth=74
>
> Qemu parameters:
> -m 1024 \
> -drive file=/images/npt2-guest-virtio.qcow2,if=ide,boot=on,snapshot=off \
> -drive file=/dev/mapper/volumes-fibre,if=virtio,cache=(on|off) \
> -drive file=/dev/mapper/volumes-npt2--dom1,if=virtio,cache=off
> -net nic,macaddr=00:FF:FF:00:00:01,model=rtl8139 -net tap -vnc :123 \
> -monitor stdio
>
> Guest io scheduler: noop
>
> Results:
> These results are with the patch series applied to KVM (plus a small KVM only
> change -- KVM patches forthcoming).
>
>
> 16k randwrite 1 thread, 74 iodepth | MB/s | avg sub lat (us) | avg comp lat (ms)
> ---------------------------------------+---------------------+------------------
> baremetal (O_DIRECT, aka cache=off)| 61.2 | 13.07 | 19.59
> kvm: cache=off posix-aio w/o patch | 4.7 | 3467.44 | 254.08
> kvm: cache=off linux-aio | 61.1 | 75.35 | 19.57
> kvm: cache=on posix-aio w/o patch |127.0 | 115.78 | 9.19
> kvm: cache=on posix-aio w/ patch |126.0 | 67.35 | 9.30
>
>
> 16k write 1 thread, 74 iodepth | MB/s | avg sub lat (us) | avg comp lat (ms)
> ---------------------------------------+---------------------+------------------
> baremetal (O_DIRECT, aka cache=off)|128.1 | 10.90 | 9.45
> kvm: cache=off posix-aio w/o patch | 5.1 | 3152.00 | 231.06
> kvm: cache=off linux-aio |130.0 | 83.83 | 8.99
> kvm: cache=on posix-aio w/o patch |184.0 | 80.46 | 6.35
> kvm: cache=on posix-aio w/ patch |165.0 | 70.90 | 7.09
>
>
> 1. http://brick.kernel.dk/snaps/fio-1.21.tar.bz2
>
[-- Attachment #2: sync.patch --]
[-- Type: text/x-patch, Size: 2779 bytes --]
Index: vl.c
===================================================================
--- vl.c (revision 5300)
+++ vl.c (working copy)
@@ -5439,12 +5439,13 @@
int max_devs;
int index;
int cache;
+ int sync;
int bdrv_flags;
char *str = arg->opt;
static const char * const params[] = { "bus", "unit", "if", "index",
"cyls", "heads", "secs", "trans",
"media", "snapshot", "file",
- "cache", "format", NULL };
+ "cache", "format", "sync", NULL };
if (check_params(buf, sizeof(buf), params, str) < 0) {
fprintf(stderr, "qemu: unknown parameter '%s' in '%s'\n",
@@ -5459,6 +5460,7 @@
translation = BIOS_ATA_TRANSLATION_AUTO;
index = -1;
cache = 1;
+ sync = 0;
if (!strcmp(machine->name, "realview") ||
!strcmp(machine->name, "SS-5") ||
@@ -5612,6 +5614,17 @@
}
}
+ if (get_param_value(buf, sizeof(buf), "sync", str)) {
+ if (!strcmp(buf, "off"))
+ sync = 0;
+ else if (!strcmp(buf, "on"))
+ sync = 1;
+ else {
+ fprintf(stderr, "qemu: invalid sync option\n");
+ return -1;
+ }
+ }
+
if (get_param_value(buf, sizeof(buf), "format", str)) {
if (strcmp(buf, "?") == 0) {
fprintf(stderr, "qemu: Supported formats:");
@@ -5728,6 +5741,8 @@
bdrv_flags |= BDRV_O_SNAPSHOT;
if (!cache)
bdrv_flags |= BDRV_O_DIRECT;
+ if (sync)
+ bdrv_flags |= BDRV_O_SYNC;
if (bdrv_open2(bdrv, file, bdrv_flags, drv) < 0 || qemu_key_check(bdrv, file)) {
fprintf(stderr, "qemu: could not open disk image %s\n",
file);
Index: block-raw-posix.c
===================================================================
--- block-raw-posix.c (revision 5304)
+++ block-raw-posix.c (working copy)
@@ -127,6 +127,8 @@
if (flags & BDRV_O_DIRECT)
open_flags |= O_DIRECT;
#endif
+ if (flags & BDRV_O_SYNC)
+ open_flags |= O_SYNC;
s->type = FTYPE_FILE;
@@ -937,6 +939,8 @@
if (flags & BDRV_O_DIRECT)
open_flags |= O_DIRECT;
#endif
+ if (flags & BDRV_O_SYNC)
+ open_flags |= O_SYNC;
s->type = FTYPE_FILE;
#if defined(__linux__)
Index: block.h
===================================================================
--- block.h (revision 5300)
+++ block.h (working copy)
@@ -48,6 +48,7 @@
it (default for
bdrv_file_open()) */
#define BDRV_O_DIRECT 0x0020
+#define BDRV_O_SYNC 0x0040
void bdrv_info(void);
void bdrv_info_stats(void);
[-- Attachment #3: posix-aio-dup.patch --]
[-- Type: text/x-patch, Size: 5012 bytes --]
Index: block-raw-posix.c
===================================================================
--- block-raw-posix.c (revision 5304)
+++ block-raw-posix.c (working copy)
@@ -84,10 +84,20 @@
reopen it to see if the disk has been changed */
#define FD_OPEN_TIMEOUT 1000
+#define RAW_FD_POOL_SIZE 4
+
+typedef struct RawFdPoolEntry
+{
+ int fd;
+ int inuse;
+} RawFdPoolEntry;
+
typedef struct BDRVRawState {
int fd;
int type;
unsigned int lseek_err_cnt;
+ int fd0_inuse;
+ RawFdPoolEntry fd_pool[RAW_FD_POOL_SIZE];
#if defined(__linux__)
/* linux floppy specific */
int fd_open_flags;
@@ -109,6 +119,7 @@
{
BDRVRawState *s = bs->opaque;
int fd, open_flags, ret;
+ int i;
posix_aio_init();
@@ -138,6 +149,11 @@
return ret;
}
s->fd = fd;
+ for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+ s->fd_pool[i].fd = -1;
+ s->fd_pool[i].inuse = 0;
+ }
+ s->fd0_inuse = 0;
#if defined(O_DIRECT)
s->aligned_buf = NULL;
if (flags & BDRV_O_DIRECT) {
@@ -436,6 +452,7 @@
typedef struct RawAIOCB {
BlockDriverAIOCB common;
+ int fd;
struct aiocb aiocb;
struct RawAIOCB *next;
int ret;
@@ -447,6 +464,52 @@
RawAIOCB *first_aio;
} PosixAioState;
+static int raw_fd_pool_get(BDRVRawState *s)
+{
+ if (s->fd0_inuse) {
+ int i;
+
+ for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+ if (s->fd_pool[i].fd == -1) {
+ s->fd_pool[i].fd = dup(s->fd);
+ if (s->fd_pool[i].fd == -1)
+ continue;
+ s->fd_pool[i].inuse = 0;
+ }
+
+ if (!s->fd_pool[i].inuse) {
+ s->fd_pool[i].inuse++;
+ return s->fd_pool[i].fd;
+ }
+ }
+ }
+ s->fd0_inuse++;
+
+ return s->fd;
+}
+
+static void raw_fd_pool_put(RawAIOCB *acb)
+{
+ BDRVRawState *s = acb->common.bs->opaque;
+ int fd = acb->fd;
+ int i;
+
+ if (s->fd == fd) {
+ s->fd0_inuse--;
+ return;
+ }
+
+ for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+ if (s->fd_pool[i].fd == fd) {
+ s->fd_pool[i].inuse--;
+ if (s->fd_pool[i].inuse == 0) {
+ close(s->fd_pool[i].fd);
+ s->fd_pool[i].fd = -1;
+ }
+ }
+ }
+}
+
static void posix_aio_read(void *opaque)
{
PosixAioState *s = opaque;
@@ -487,6 +550,7 @@
if (ret == ECANCELED) {
/* remove the request */
*pacb = acb->next;
+ raw_fd_pool_put(acb);
qemu_aio_release(acb);
} else if (ret != EINPROGRESS) {
/* end of aio */
@@ -503,6 +567,7 @@
*pacb = acb->next;
/* call the callback */
acb->common.cb(acb->common.opaque, ret);
+ raw_fd_pool_put(acb);
qemu_aio_release(acb);
break;
} else {
@@ -575,7 +640,8 @@
acb = qemu_aio_get(bs, cb, opaque);
if (!acb)
return NULL;
- acb->aiocb.aio_fildes = s->fd;
+ acb->fd = raw_fd_pool_get(s);
+ acb->aiocb.aio_fildes = acb->fd;
acb->aiocb.aio_sigevent.sigev_signo = SIGUSR2;
acb->aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
acb->aiocb.aio_buf = buf;
@@ -682,6 +748,7 @@
break;
} else if (*pacb == acb) {
*pacb = acb->next;
+ raw_fd_pool_put(acb);
qemu_aio_release(acb);
break;
}
@@ -698,6 +765,7 @@
static void raw_close(BlockDriverState *bs)
{
BDRVRawState *s = bs->opaque;
+ int i;
if (s->fd >= 0) {
close(s->fd);
s->fd = -1;
@@ -706,6 +774,10 @@
qemu_free(s->aligned_buf);
#endif
}
+ for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+ if (s->fd_pool[i].fd != -1)
+ close(s->fd_pool[i].fd);
+ }
}
static int raw_truncate(BlockDriverState *bs, int64_t offset)
@@ -973,6 +1045,18 @@
}
#if defined(__linux__)
+static void raw_invalidate_fd_pool(BDRVRawState *s)
+{
+ int i;
+ for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+ if (s->fd_pool[i].fd != -1) {
+ close(s->fd_pool[i].fd);
+ s->fd_pool[i].fd = -1;
+ s->fd_pool[i].inuse = 0;
+ }
+ }
+ s->fd0_inuse = 0;
+}
/* Note: we do not have a reliable method to detect if the floppy is
present. The current method is to try to open the floppy at every
@@ -989,6 +1073,7 @@
(qemu_get_clock(rt_clock) - s->fd_open_time) >= FD_OPEN_TIMEOUT) {
close(s->fd);
s->fd = -1;
+ raw_invalidate_fd_pool(s);
#ifdef DEBUG_FLOPPY
printf("Floppy closed\n");
#endif
@@ -1089,6 +1174,7 @@
if (s->fd >= 0) {
close(s->fd);
s->fd = -1;
+ raw_invalidate_fd_pool(s);
}
fd = open(bs->filename, s->fd_open_flags | O_NONBLOCK);
if (fd >= 0) {
next prev parent reply other threads:[~2008-09-23 3:33 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-22 23:17 [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
2008-09-22 23:17 ` [PATCH 1/3] Only call aio flush handler if set Ryan Harper
2008-09-23 2:38 ` Anthony Liguori
2008-09-23 14:26 ` [Qemu-devel] " Ryan Harper
2008-09-23 14:34 ` Anthony Liguori
2008-09-23 14:41 ` Ryan Harper
2008-09-23 14:50 ` Anthony Liguori
2008-09-22 23:17 ` [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
2008-09-23 1:16 ` Ryan Harper
2008-09-23 2:45 ` Anthony Liguori
2008-09-23 14:39 ` [Qemu-devel] " Ryan Harper
2008-09-23 14:40 ` Anthony Liguori
2008-09-23 14:53 ` Gerd Hoffmann
2008-09-23 16:06 ` Anthony Liguori
2008-09-23 18:04 ` Gerd Hoffmann
2008-09-23 18:28 ` Anthony Liguori
2008-09-24 22:31 ` Marcelo Tosatti
[not found] ` <1222125454-21744-4-git-send-email-ryanh@us.ibm.com>
2008-09-23 1:22 ` [PATCH 3/3] Add linux aio implementation for raw block devices Ryan Harper
2008-09-23 3:32 ` Anthony Liguori [this message]
2008-09-23 14:43 ` [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
2008-09-23 14:47 ` Anthony Liguori
2008-09-23 16:09 ` Anthony Liguori
2008-09-23 10:27 ` [Qemu-devel] " Jamie Lokier
2008-10-02 22:41 ` john cooper
2008-10-03 13:33 ` Ryan Harper
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48D86361.4000405@us.ibm.com \
--to=aliguori@us.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=qemu-devel@nongnu.org \
--cc=ryanh@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.