From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Jr8aZ-0007Na-0R
	for qemu-devel@nongnu.org; Wed, 30 Apr 2008 05:26:59 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Jr8aX-0007Mk-5V
	for qemu-devel@nongnu.org; Wed, 30 Apr 2008 05:26:57 -0400
Received: from [199.232.76.173] (port=48792 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Jr8aW-0007ML-Ln
	for qemu-devel@nongnu.org; Wed, 30 Apr 2008 05:26:56 -0400
Received: from cantor.suse.de ([195.135.220.2] helo=mx1.suse.de)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <kwolf@suse.de>) id 1Jr8aV-00043a-Km
	for qemu-devel@nongnu.org; Wed, 30 Apr 2008 05:26:56 -0400
Message-ID: <48183A34.7030608@suse.de>
Date: Wed, 30 Apr 2008 11:21:56 +0200
From: Kevin Wolf <kwolf@suse.de>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH] Align file accesses with	cache=off	(O_DIRECT)
References: <4807514B.9040607@suse.de>
	<4815EE89.70407@suse.de>	<1209459667.4328.7.camel@frecb07144>
	<48173591.9010609@suse.de>	<1209484105.4248.27.camel@frecb07144>
	<48174B0B.5070904@suse.de> <1209487719.4248.43.camel@frecb07144>
In-Reply-To: <1209487719.4248.43.camel@frecb07144>
Content-Type: multipart/mixed; boundary="------------080200000808080404050303"
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: qemu-devel@nongnu.org

This is a multi-part message in MIME format.
--------------080200000808080404050303
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Laurent Vivier schrieb:
>> Hm, yes. We could call raw_pread in raw_aio_read when O_DIRECT is used 
>> and the request is not properly aligned. Is this what you meant?
> 
> No, it was just a (stupid) comment. I think we must not convert
> asynchronous I/O to synchronous I/O.

Why not? If I'm not mistaken (but if you think it's wrong, probably I 
am) the only possible drawback should be performance. And we're not 
talking about normal guest IO, these accesses are aligned anyway.

>> I think we agree that it's mostly item 3 why one would use O_DIRECT with 
>> qemu. In terms of reliability, it is important that the data really is 
>> written to the disk when the guest OS thinks so. But when for example 
>> qemu crashes, I don't think it's too important if 40% or 50% of a 
>> snapshot have already been written - it's unusable anyway. A sync 
>> afterwards could be enough there.
> 
> I don't speak about "qemu crashes" but about "host crashes".

Well, I've never seen a host crash where qemu survived. ;-)

What I wanted to say: If the snapshot is incomplete and not usable 
anyway, why bother if some bytes more or less have been written?

> I'm not in the spirit "my patch is better than yours" (and I don't think
> so); but could you try to test my patch ? Because if I remember
> correctly I think It manages all cases and this can help you to find a
> solution (or perhaps you can add to your patch the part of my patch
> about block-qcow2.c)

I really didn't want to say that your code is bad or it wouldn't work or 
something like that. I just tried it and it works fine as well.

But the approaches are quite different. Your patch makes every user of 
the block driver functions aware that O_DIRECT might be in effect. My 
patch tries to do things in one common place, even though possibly at 
the cost of performance (however, I'm not sure anymore about the bad 
performance now that I use your fcntl method).

So what I like about my patch is that it is one change in one place 
which should make everything work. Your patch could be still of use to 
speed things up, e.g. by making qcow2 aware that there is something like 
O_DIRECT (would have to do a real benchmark with both patches) and have 
it align its buffers in the first place.

I'll attach the current version of my patch which emulates AIO through 
synchronous requests for unaligned buffer. In comparison, with your 
patch the bootup of my test VM was slightly faster but loading/saving of 
snapshots was much faster with mine. Perhaps I'll try to combine them to 
get the best of both.

Kevin

--------------080200000808080404050303
Content-Type: text/x-patch;
 name="align-odirect-accesses.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="align-odirect-accesses.patch"

Index: block-raw-posix.c
===================================================================
--- block-raw-posix.c.orig
+++ block-raw-posix.c
@@ -77,6 +77,7 @@
 typedef struct BDRVRawState {
     int fd;
     int type;
+    int flags;
     unsigned int lseek_err_cnt;
 #if defined(__linux__)
     /* linux floppy specific */
@@ -111,6 +112,7 @@ static int raw_open(BlockDriverState *bs
         open_flags |= O_DIRECT;
 #endif
 
+    s->flags = open_flags;
     s->type = FTYPE_FILE;
 
     fd = open(filename, open_flags, 0644);
@@ -141,7 +143,14 @@ static int raw_open(BlockDriverState *bs
 #endif
 */
 
-static int raw_pread(BlockDriverState *bs, int64_t offset,
+/* 
+ * offset and count are in bytes, but must be multiples of 512 for files 
+ * opened with O_DIRECT. buf must be aligned to 512 bytes then.
+ *
+ * This function may be called without alignment if the caller ensures
+ * that O_DIRECT is not in effect.
+ */
+static int raw_pread_aligned(BlockDriverState *bs, int64_t offset,
                      uint8_t *buf, int count)
 {
     BDRVRawState *s = bs->opaque;
@@ -194,7 +203,14 @@ label__raw_read__success:
     return ret;
 }
 
-static int raw_pwrite(BlockDriverState *bs, int64_t offset,
+/* 
+ * offset and count are in bytes, but must be multiples of 512 for files 
+ * opened with O_DIRECT. buf must be aligned to 512 bytes then.
+ *
+ * This function may be called without alignment if the caller ensures
+ * that O_DIRECT is not in effect.
+ */
+static int raw_pwrite_aligned(BlockDriverState *bs, int64_t offset,
                       const uint8_t *buf, int count)
 {
     BDRVRawState *s = bs->opaque;
@@ -230,6 +246,67 @@ label__raw_write__success:
     return ret;
 }
 
+
+#ifdef O_DIRECT
+/* 
+ * offset and count are in bytes and possibly not aligned. For files opened 
+ * with O_DIRECT, necessary alignments are ensured before calling 
+ * raw_pread_aligned to do the actual read.
+ */
+static int raw_pread(BlockDriverState *bs, int64_t offset,
+                     uint8_t *buf, int count)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (unlikely((s->flags & O_DIRECT) &&
+            (offset % 512 || count % 512 || (uintptr_t) buf % 512))) {
+
+        int ret;
+
+        // Temporarily disable O_DIRECT for unaligned access
+        fcntl(s->fd, F_SETFL, s->flags & ~O_DIRECT);
+        ret = raw_pread_aligned(bs, offset, buf, count);
+        fcntl(s->fd, F_SETFL, s->flags);
+
+        return ret;
+
+    } else {
+        return raw_pread_aligned(bs, offset, buf, count);
+    }
+}
+
+/* 
+ * offset and count are in bytes and possibly not aligned. For files opened 
+ * with O_DIRECT, necessary alignments are ensured before calling 
+ * raw_pwrite_aligned to do the actual write.
+ */
+static int raw_pwrite(BlockDriverState *bs, int64_t offset,
+                      const uint8_t *buf, int count)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (unlikely((s->flags & O_DIRECT) &&
+            (offset % 512 || count % 512 || (uintptr_t) buf % 512))) {
+
+        int ret;
+
+        // Temporarily disable O_DIRECT for unaligned access
+        fcntl(s->fd, F_SETFL, s->flags & ~O_DIRECT);
+        ret = raw_pwrite_aligned(bs, offset, buf, count);
+        fcntl(s->fd, F_SETFL, s->flags);
+
+        return ret;
+    } else {
+        return raw_pwrite_aligned(bs, offset, buf, count);
+    }
+}
+
+#else
+#define raw_pread raw_pread_aligned
+#define raw_pwrite raw_pwrite_aligned
+#endif
+
+
 /***********************************************************/
 /* Unix AIO using POSIX AIO */
 
@@ -402,10 +479,26 @@ static BlockDriverAIOCB *raw_aio_read(Bl
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
+
+    /* 
+     * If O_DIRECT is used and the buffer is not aligned fall back
+     * to synchronous IO.
+     */
+    if (unlikely((s->flags & O_DIRECT) && ((uintptr_t) buf % 512))) {
+        int ret;
+
+        acb = qemu_aio_get(bs, cb, opaque);
+        ret = raw_pread(bs, 512 * sector_num, buf, 512 * nb_sectors);
+        acb->common.cb(acb->common.opaque, ret);
+        qemu_aio_release(acb);
+        return &acb->common;
+    }
 
     acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
     if (!acb)
         return NULL;
+
     if (aio_read(&acb->aiocb) < 0) {
         qemu_aio_release(acb);
         return NULL;
@@ -418,6 +511,21 @@ static BlockDriverAIOCB *raw_aio_write(B
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
+
+    /* 
+     * If O_DIRECT is used and the buffer is not aligned fall back
+     * to synchronous IO.
+     */
+    if (unlikely((s->flags & O_DIRECT) && ((uintptr_t) buf % 512))) {
+        int ret;
+
+        acb = qemu_aio_get(bs, cb, opaque);
+        ret = raw_pwrite(bs, 512 * sector_num, buf, 512 * nb_sectors);
+        acb->common.cb(acb->common.opaque, ret);
+        qemu_aio_release(acb);
+        return &acb->common;
+    }
 
     acb = raw_aio_setup(bs, sector_num, (uint8_t*)buf, nb_sectors, cb, opaque);
     if (!acb)

--------------080200000808080404050303--