* [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems
@ 2010-11-02 12:18 Christian Ehrhardt
2010-11-02 14:57 ` Josef Bacik
0 siblings, 1 reply; 3+ messages in thread
From: Christian Ehrhardt @ 2010-11-02 12:18 UTC (permalink / raw)
To: Jeff Moyer, Josef Bacik, Chris Mason,
linux-kernel@vger.kernel.org, linux-btrf
Hi,
this is about an issue newer kernels show, bysplitting direct I/O reque=
sts
into 4k pieces to directly merge them in the Block Device Layer afterwa=
rds.
If anyone is interested in own tests just use a simple command like
dd if=3D/mnt/test/test-dd1 of=3D/dev/null iflag=3Ddirect bs=3D64k count=
=3D1
in combination with blktrace.
The following patch is more a proposal for discussion than a solution, =
well
thats what RFC's are about right.
I'm unsure about names, but also if the approach in general is the righ=
t way.
It should apply to every 2.6.36 and 2.6.37 kernel.
I put everyone on CC who was involved in the patches leading to the cur=
rent
behavior.
Gr=FCsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance=20
--- cut here ---
Subject: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests fo=
r non-btrfs filesystems
=46rom: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Commit c2c6ca41 by Josef Bacik <josef@redhat.com> caused all direct I/O=
's to
be split into 4k requests before arriving in the block device layer.
This was later on partially fixed by Jeff Moyer <jmoyer@redhat.com> in
7a801ac6.
Jeffs fix improved the situation a lot, but eventually it still splits =
I/Os
for non-btrfs file systems as well were it wouldn't have to.
Eventually in my example on a ext2 filesystem it splits it every 4Mb wh=
ere
dio->boundary is evaluated to true.
In blktrace this looks like:
dd-910 [002] 38.762523: 94,8 A R 131264 + 8 <-=
(94,9) 131072
dd-910 [002] 38.762531: 94,8 Q R 131264 + 8 [d=
d]
dd-910 [002] 38.762535: 94,8 G R 131264 + 8 [d=
d]
dd-910 [002] 38.762537: 94,8 P N [dd]
dd-910 [002] 38.762539: 94,8 I R 131264 + 8 [d=
d]
dd-910 [002] 38.762544: 94,8 A R 131272 + 8 <-=
(94,9) 131080
dd-910 [002] 38.762544: 94,8 Q R 131272 + 8 [d=
d]
dd-910 [002] 38.762546: 94,8 M R 131272 + 8 [d=
d]
dd-910 [002] 38.762550: 94,8 A R 131280 + 8 <-=
(94,9) 131088
dd-910 [002] 38.762551: 94,8 Q R 131280 + 8 [d=
d]
dd-910 [002] 38.762551: 94,8 M R 131280 + 8 [d=
d]
dd-910 [002] 38.762556: 94,8 A R 131288 + 8 <-=
(94,9) 131096
dd-910 [002] 38.762557: 94,8 Q R 131288 + 8 [d=
d]
dd-910 [002] 38.762557: 94,8 M R 131288 + 8 [d=
d]
dd-910 [002] 38.762562: 94,8 A R 131296 + 8 <-=
(94,9) 131104
dd-910 [002] 38.762563: 94,8 Q R 131296 + 8 [d=
d]
dd-910 [002] 38.762564: 94,8 M R 131296 + 8 [d=
d]
dd-910 [002] 38.762568: 94,8 A R 131304 + 8 <-=
(94,9) 131112
dd-910 [002] 38.762569: 94,8 Q R 131304 + 8 [d=
d]
dd-910 [002] 38.762570: 94,8 M R 131304 + 8 [d=
d]
dd-910 [002] 38.762577: 94,8 A R 131312 + 8 <-=
(94,9) 131120
dd-910 [002] 38.762578: 94,8 Q R 131312 + 8 [d=
d]
dd-910 [002] 38.762579: 94,8 M R 131312 + 8 [d=
d]
dd-910 [002] 38.762584: 94,8 A R 131320 + 8 <-=
(94,9) 131128
dd-910 [002] 38.762584: 94,8 Q R 131320 + 8 [d=
d]
dd-910 [002] 38.762585: 94,8 M R 131320 + 8 [d=
d]
dd-910 [002] 38.762590: 94,8 A R 131328 + 8 <-=
(94,9) 131136
dd-910 [002] 38.762590: 94,8 Q R 131328 + 8 [d=
d]
dd-910 [002] 38.762591: 94,8 M R 131328 + 8 [d=
d]
dd-910 [002] 38.762596: 94,8 A R 131336 + 8 <-=
(94,9) 131144
dd-910 [002] 38.762597: 94,8 Q R 131336 + 8 [d=
d]
dd-910 [002] 38.762598: 94,8 M R 131336 + 8 [d=
d]
dd-910 [002] 38.762605: 94,8 A R 131344 + 16 <=
- (94,9) 131152
dd-910 [002] 38.762607: 94,8 Q R 131344 + 16 [=
dd]
dd-910 [002] 38.762608: 94,8 M R 131344 + 16 [=
dd]
dd-910 [002] 38.762611: 94,8 A R 131368 + 32 <=
- (94,9) 131176
dd-910 [002] 38.762612: 94,8 Q R 131368 + 32 [=
dd]
dd-910 [002] 38.762616: 94,8 G R 131368 + 32 [=
dd]
dd-910 [002] 38.762617: 94,8 I R 131368 + 32 [=
dd]
dd-910 [002] 38.762619: 94,8 U N [dd] 2
dd-910 [002] 38.762621: 94,8 D R 131264 + 96 [=
dd]
dd-910 [002] 38.762625: 94,8 D R 131368 + 32 [=
dd]
<idle>-0 [012] 38.763363: 94,8 C R 131264 + 96 [=
0]=20
<idle>-0 [015] 38.763797: 94,8 C R 131368 + 32 [=
0]
The usual behavior before both commits was:
dd-919 [002] 37.513685: 94,8 A R 7824 + 96 <- =
(94,9) 7632
dd-919 [002] 37.513693: 94,8 Q R 7824 + 96 [dd=
]
dd-919 [002] 37.513697: 94,8 G R 7824 + 96 [dd=
]
dd-919 [002] 37.513700: 94,8 P N [dd]
dd-919 [002] 37.513701: 94,8 I R 7824 + 96 [dd=
]
dd-919 [002] 37.513794: 94,8 A R 7928 + 32 <- =
(94,9) 7736
dd-919 [002] 37.513795: 94,8 Q R 7928 + 32 [dd=
]
dd-919 [002] 37.513800: 94,8 G R 7928 + 32 [dd=
]
dd-919 [002] 37.513802: 94,8 I R 7928 + 32 [dd=
]
dd-919 [002] 37.513804: 94,8 U N [dd] 2
dd-919 [002] 37.513806: 94,8 D R 7824 + 96 [dd=
]
dd-919 [002] 37.513810: 94,8 D R 7928 + 32 [dd=
]
<idle>-0 [011] 37.514362: 94,8 C R 7824 + 96 [0]=
=20
<idle>-0 [014] 37.514728: 94,8 C R 7928 + 32 [0]
That remaining split is cause by the test for:
"dio->final_block_in_bio !=3D dio->cur_page_block".
As this was in the code for a long time I just assume it is right.
So eventually for the 64k request in the example this patch improves th=
e
effective submissions that get to the block device layer from:
10x4k, 1x8k, 1x16k to 1x48k & 1x16k which is much better.
Througput impact is small, but in terms of cpu consumption this is visi=
ble
by a single digit percentage depending on the incoming request size.
The solution looking for comments or alternatives in this RFC patch add=
s a new
kiocb flag that let filesystems specify if they need these workaround t=
o
separate meta data reads - if not, like all pre-btrfs filesystems the d=
io code
doesn't have to cause this extra work.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
[diffstat]
fs/btrfs/inode.c | 4 ++++
fs/direct-io.c | 9 ++++++++-
include/linux/aio.h | 5 +++++
3 files changed, 17 insertions(+), 1 deletion(-)
[diff]
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..1126185 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -37,6 +37,7 @@
#include <linux/posix_acl.h>
#include <linux/falloc.h>
#include <linux/slab.h>
+#include <linux/aio.h>
#include "compat.h"
#include "ctree.h"
#include "disk-io.h"
@@ -5822,6 +5823,9 @@ static ssize_t btrfs_direct_IO(int rw, struct kio=
cb *iocb,
free_extent_state(cached_state);
cached_state =3D NULL;
=20
+ /* btrfs cannot handle logically non-contiguous requests */
+ kiocb_set_separate_meta_reads(iocb);
+
ret =3D __blockdev_direct_IO(rw, iocb, inode,
BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 48d74c7..6d2dcb2 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/aio.h>
#include <asm/atomic.h>
=20
/*
@@ -79,6 +80,7 @@ struct dio {
sector_t final_block_in_request;/* doesn't change */
unsigned first_block_in_page; /* doesn't change, Used only once */
int boundary; /* prev block is at a boundary */
+ int separate_meta_reads; /* separate in I/O submission */
int reap_counter; /* rate limit reaping */
get_block_t *get_block; /* block mapping function */
dio_iodone_t *end_io; /* IO completion function */
@@ -659,7 +661,7 @@ static int dio_send_cur_page(struct dio *dio)
* Submit now if the underlying fs is about to perform a
* metadata read
*/
- else if (dio->boundary)
+ else if (dio->separate_meta_reads && dio->boundary)
dio_bio_submit(dio);
}
=20
@@ -1245,6 +1247,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb,=
struct inode *inode,
dio->is_async =3D !is_sync_kiocb(iocb) && !((rw & WRITE) &&
(end > i_size_read(inode)));
=20
+ /*
+ * some filesystems e.g. btrfs need to separate metadata read
+ */
+ dio->separate_meta_reads =3D kiocb_needs_separate_meta_reads(iocb);
+
retval =3D direct_io_worker(rw, iocb, inode, iov, offset,
nr_segs, blkbits, get_block, end_io,
submit_io, dio);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7a8db41..9ee9c6e 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -34,6 +34,8 @@ struct kioctx;
/* #define KIF_LOCKED 0 */
#define KIF_KICKED 1
#define KIF_CANCELLED 2
+/* to separate meta reads */
+#define KIF_SEPARATE_META 3
=20
#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_fl=
ags)
#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_fl=
ags)
@@ -50,6 +52,9 @@ struct kioctx;
#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_fla=
gs)
=20
+#define kiocb_set_separate_meta_reads(iocb) set_bit(KIF_SEPARATE_META,=
&(iocb)->ki_flags)
+#define kiocb_needs_separate_meta_reads(iocb) (KIF_SEPARATE_META & (io=
cb)->ki_flags)
+
/* is there a better place to document function pointer methods? */
/**
* ki_retry - iocb forward progress callback
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems
2010-11-02 12:18 [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems Christian Ehrhardt
@ 2010-11-02 14:57 ` Josef Bacik
2010-11-02 18:58 ` Christoph Hellwig
0 siblings, 1 reply; 3+ messages in thread
From: Josef Bacik @ 2010-11-02 14:57 UTC (permalink / raw)
To: Christian Ehrhardt
Cc: Jeff Moyer, Josef Bacik, Chris Mason,
linux-kernel@vger.kernel.org, linux-btrfs,
linux-fsdevel@vger.kernel.org
On Tue, Nov 02, 2010 at 01:18:42PM +0100, Christian Ehrhardt wrote:
> Hi,
> this is about an issue newer kernels show, bysplitting direct I/O req=
uests
> into 4k pieces to directly merge them in the Block Device Layer after=
wards.
>=20
> If anyone is interested in own tests just use a simple command like
> dd if=3D/mnt/test/test-dd1 of=3D/dev/null iflag=3Ddirect bs=3D64k cou=
nt=3D1
> in combination with blktrace.
>=20
> The following patch is more a proposal for discussion than a solution=
, well
> thats what RFC's are about right.
> I'm unsure about names, but also if the approach in general is the ri=
ght way.
> It should apply to every 2.6.36 and 2.6.37 kernel.
>=20
> I put everyone on CC who was involved in the patches leading to the c=
urrent
> behavior.
>=20
> Gr=FCsse / regards,
> Christian Ehrhardt
> IBM Linux Technology Center, System z Linux Performance=20
>=20
> --- cut here ---
>=20
> Subject: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests =
for non-btrfs filesystems
>=20
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>=20
> Commit c2c6ca41 by Josef Bacik <josef@redhat.com> caused all direct I=
/O's to
> be split into 4k requests before arriving in the block device layer.
> This was later on partially fixed by Jeff Moyer <jmoyer@redhat.com> i=
n
> 7a801ac6.
>=20
> Jeffs fix improved the situation a lot, but eventually it still split=
s I/Os
> for non-btrfs file systems as well were it wouldn't have to.
>=20
> Eventually in my example on a ext2 filesystem it splits it every 4Mb =
where
> dio->boundary is evaluated to true.
>=20
> In blktrace this looks like:
> dd-910 [002] 38.762523: 94,8 A R 131264 + 8 =
<- (94,9) 131072
> dd-910 [002] 38.762531: 94,8 Q R 131264 + 8 =
[dd]
> dd-910 [002] 38.762535: 94,8 G R 131264 + 8 =
[dd]
> dd-910 [002] 38.762537: 94,8 P N [dd]
> dd-910 [002] 38.762539: 94,8 I R 131264 + 8 =
[dd]
> dd-910 [002] 38.762544: 94,8 A R 131272 + 8 =
<- (94,9) 131080
> dd-910 [002] 38.762544: 94,8 Q R 131272 + 8 =
[dd]
> dd-910 [002] 38.762546: 94,8 M R 131272 + 8 =
[dd]
> dd-910 [002] 38.762550: 94,8 A R 131280 + 8 =
<- (94,9) 131088
> dd-910 [002] 38.762551: 94,8 Q R 131280 + 8 =
[dd]
> dd-910 [002] 38.762551: 94,8 M R 131280 + 8 =
[dd]
> dd-910 [002] 38.762556: 94,8 A R 131288 + 8 =
<- (94,9) 131096
> dd-910 [002] 38.762557: 94,8 Q R 131288 + 8 =
[dd]
> dd-910 [002] 38.762557: 94,8 M R 131288 + 8 =
[dd]
> dd-910 [002] 38.762562: 94,8 A R 131296 + 8 =
<- (94,9) 131104
> dd-910 [002] 38.762563: 94,8 Q R 131296 + 8 =
[dd]
> dd-910 [002] 38.762564: 94,8 M R 131296 + 8 =
[dd]
> dd-910 [002] 38.762568: 94,8 A R 131304 + 8 =
<- (94,9) 131112
> dd-910 [002] 38.762569: 94,8 Q R 131304 + 8 =
[dd]
> dd-910 [002] 38.762570: 94,8 M R 131304 + 8 =
[dd]
> dd-910 [002] 38.762577: 94,8 A R 131312 + 8 =
<- (94,9) 131120
> dd-910 [002] 38.762578: 94,8 Q R 131312 + 8 =
[dd]
> dd-910 [002] 38.762579: 94,8 M R 131312 + 8 =
[dd]
> dd-910 [002] 38.762584: 94,8 A R 131320 + 8 =
<- (94,9) 131128
> dd-910 [002] 38.762584: 94,8 Q R 131320 + 8 =
[dd]
> dd-910 [002] 38.762585: 94,8 M R 131320 + 8 =
[dd]
> dd-910 [002] 38.762590: 94,8 A R 131328 + 8 =
<- (94,9) 131136
> dd-910 [002] 38.762590: 94,8 Q R 131328 + 8 =
[dd]
> dd-910 [002] 38.762591: 94,8 M R 131328 + 8 =
[dd]
> dd-910 [002] 38.762596: 94,8 A R 131336 + 8 =
<- (94,9) 131144
> dd-910 [002] 38.762597: 94,8 Q R 131336 + 8 =
[dd]
> dd-910 [002] 38.762598: 94,8 M R 131336 + 8 =
[dd]
> dd-910 [002] 38.762605: 94,8 A R 131344 + 16=
<- (94,9) 131152
> dd-910 [002] 38.762607: 94,8 Q R 131344 + 16=
[dd]
> dd-910 [002] 38.762608: 94,8 M R 131344 + 16=
[dd]
> dd-910 [002] 38.762611: 94,8 A R 131368 + 32=
<- (94,9) 131176
> dd-910 [002] 38.762612: 94,8 Q R 131368 + 32=
[dd]
> dd-910 [002] 38.762616: 94,8 G R 131368 + 32=
[dd]
> dd-910 [002] 38.762617: 94,8 I R 131368 + 32=
[dd]
> dd-910 [002] 38.762619: 94,8 U N [dd] 2
> dd-910 [002] 38.762621: 94,8 D R 131264 + 96=
[dd]
> dd-910 [002] 38.762625: 94,8 D R 131368 + 32=
[dd]
> <idle>-0 [012] 38.763363: 94,8 C R 131264 + 96=
[0]=20
> <idle>-0 [015] 38.763797: 94,8 C R 131368 + 32=
[0]
>=20
> The usual behavior before both commits was:
> dd-919 [002] 37.513685: 94,8 A R 7824 + 96 <=
- (94,9) 7632
> dd-919 [002] 37.513693: 94,8 Q R 7824 + 96 [=
dd]
> dd-919 [002] 37.513697: 94,8 G R 7824 + 96 [=
dd]
> dd-919 [002] 37.513700: 94,8 P N [dd]
> dd-919 [002] 37.513701: 94,8 I R 7824 + 96 [=
dd]
> dd-919 [002] 37.513794: 94,8 A R 7928 + 32 <=
- (94,9) 7736
> dd-919 [002] 37.513795: 94,8 Q R 7928 + 32 [=
dd]
> dd-919 [002] 37.513800: 94,8 G R 7928 + 32 [=
dd]
> dd-919 [002] 37.513802: 94,8 I R 7928 + 32 [=
dd]
> dd-919 [002] 37.513804: 94,8 U N [dd] 2
> dd-919 [002] 37.513806: 94,8 D R 7824 + 96 [=
dd]
> dd-919 [002] 37.513810: 94,8 D R 7928 + 32 [=
dd]
> <idle>-0 [011] 37.514362: 94,8 C R 7824 + 96 [=
0]=20
> <idle>-0 [014] 37.514728: 94,8 C R 7928 + 32 [=
0]
>=20
> That remaining split is cause by the test for:
> "dio->final_block_in_bio !=3D dio->cur_page_block".
> As this was in the code for a long time I just assume it is right.
>=20
> So eventually for the 64k request in the example this patch improves =
the
> effective submissions that get to the block device layer from:
> 10x4k, 1x8k, 1x16k to 1x48k & 1x16k which is much better.
>=20
> Througput impact is small, but in terms of cpu consumption this is vi=
sible
> by a single digit percentage depending on the incoming request size.
>=20
> The solution looking for comments or alternatives in this RFC patch a=
dds a new
> kiocb flag that let filesystems specify if they need these workaround=
to
> separate meta data reads - if not, like all pre-btrfs filesystems the=
dio code
> doesn't have to cause this extra work.
>=20
So this brings up an important question, which is how badly do we want =
to honor
buffer_boundary()? The whole reason I added the logical offset check w=
as because
we ignore buffer_boundary() if there is no bio currently. So for BTRFS=
we would
do this
map extent
set buffer boundary
but if this is the first part of the IO and there isn't a bio already s=
etup,
dio_new_bio clears dio->boundary, so the next extent we would get may n=
ot be
logically contiguous and hilarity would ensue. I chose not to fix the
dio->boundary clearing because I was worried I would affect other fs's =
workloads
(which I did anyway because of my bug). So maybe the right idea is to =
rip out
my logical offset tests altogether and fix dio so we treat buffer_bound=
ary()
like gospel. That way Btrfs can get what it needs without having this =
weird
special code, and then we can look at how other fs's set buffer_boundar=
y (I'm
pretty sure ext2/3 are the only ones) and make sure they are setting it=
when
they really mean to. Thanks,
Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems
2010-11-02 14:57 ` Josef Bacik
@ 2010-11-02 18:58 ` Christoph Hellwig
0 siblings, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2010-11-02 18:58 UTC (permalink / raw)
To: Josef Bacik
Cc: Christian Ehrhardt, Jeff Moyer, Josef Bacik, Chris Mason,
linux-kernel@vger.kernel.org, linux-btrfs,
linux-fsdevel@vger.kernel.org
On Tue, Nov 02, 2010 at 10:57:18AM -0400, Josef Bacik wrote:
> (which I did anyway because of my bug). So maybe the right idea is to rip out
> my logical offset tests altogether and fix dio so we treat buffer_boundary()
> like gospel. That way Btrfs can get what it needs without having this weird
> special code, and then we can look at how other fs's set buffer_boundary (I'm
> pretty sure ext2/3 are the only ones) and make sure they are setting it when
> they really mean to.
That sounds pretty reasonable to me. I really don't like the flag in
the kiocb in this patch, and handling it as part of the get_blocks
callback sounds much better to me. I don't know enough about the
bounary blocks to know if we can reuse them - if we can it's perfect,
if not another buffer flag seems like the way to go.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-11-02 18:58 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-02 12:18 [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems Christian Ehrhardt
2010-11-02 14:57 ` Josef Bacik
2010-11-02 18:58 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).