[rfc patch] optimize o_direct on block device

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [rfc patch] optimize o_direct on block device
@ 2006-11-30  3:25 Chen, Kenneth W
  2006-11-30 21:45 ` Zach Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Chen, Kenneth W @ 2006-11-30  3:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

I've been complaining about O_DIRECT I/O processing being exceedingly
complex and slow since March 2005, see posting below:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111033309732261&w=2

At that time, a patch was written for raw device to demonstrate that
large performance head room is achievable (at ~20% speedup for micro-
benchmark and ~2% for db transaction processing benchmark) with a tight
I/O submission processing loop.

Since raw device is being slowly phased out, I've rewritten the patch
for block device.  O_DIRECT on block device is much simpler than O_D
on file system. Part of the reason that direct_io_worker is so complex
is because of O_D on file system, where it needs to perform block
allocation, hole detection, extents file on write, and tons of other
corner cases. The end result is that it takes tons of CPU time to
submit an I/O.

For block device, the block allocation is much simpler and I can write
a really tight double loop to iterate each iovec and each page within
the iovec in order to construct/prepare bio structure and then subsequently
submit it to the block layer.

So here it goes, posted here for comments.

A few notes on the patch:

(1) I need a vector structure similar to pagevec, however, pagevec doesn't
    have everything that I need, i.e., an iterator variable.  So I create a
    new struct pvec.  Maybe something can be worked out with pagevec?

(2) there are some inconsistency for synchronous I/O: condition to update
    ppos and condition to wait on sync_kiocb is incompatible.  Call chain
    looks like the following:

    do_sync_read
       generic_file_aio_read
         ...
           blkdev_direct_IO

    do_sync_read will wait for I/O completion only if lower function returned
    -EIOCBQUEUED. Updating ppos is done via generic_file_aio_read, but only
    if the lower function returned positive value. So I either have to construct
    my own wait_on_sync_kiocb, or hack around the ppos update.

(3) I/O length setup in kiocb is inconsistent between normal read vs vector read
    or aio_read.  One is passed in kiocb->ki_left vs others passing total length
    in kiocb->nbytes.  I've made them consistent in the read path (note to self:
    I need to add the same thing in do_sync_write).



Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


--- ./fs/block_dev.c.orig	2006-11-29 14:52:20.000000000 -0800
+++ ./fs/block_dev.c	2006-11-29 16:45:36.000000000 -0800
@@ -129,43 +129,147 @@ blkdev_get_block(struct inode *inode, se
 	return 0;
 }
 
-static int
-blkdev_get_blocks(struct inode *inode, sector_t iblock,
-		struct buffer_head *bh, int create)
+int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-	sector_t end_block = max_block(I_BDEV(inode));
-	unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
+	struct kiocb* iocb = bio->bi_private;
+	atomic_t* bio_count = (atomic_t*) &iocb->private;
+	long res;
+
+	if ((bio->bi_rw & 1) == READ)
+		bio_check_pages_dirty(bio);
+	else {
+		bio_release_pages(bio);
+		bio_put(bio);
+	}
 
-	if ((iblock + max_blocks) > end_block) {
-		max_blocks = end_block - iblock;
-		if ((long)max_blocks <= 0) {
-			if (create)
-				return -EIO;	/* write fully beyond EOF */
-			/*
-			 * It is a read which is fully beyond EOF.  We return
-			 * a !buffer_mapped buffer
-			 */
-			max_blocks = 0;
-		}
+	if (error)
+		iocb->ki_left = -EIO;
+
+	if (atomic_dec_and_test(bio_count)) {
+		res = (iocb->ki_left < 0) ? iocb->ki_left : iocb->ki_nbytes;
+		aio_complete(iocb, res, 0);
 	}
 
-	bh->b_bdev = I_BDEV(inode);
-	bh->b_blocknr = iblock;
-	bh->b_size = max_blocks << inode->i_blkbits;
-	if (max_blocks)
-		set_buffer_mapped(bh);
 	return 0;
 }
 
+#define VEC_SIZE	16
+struct pvec {
+	unsigned short nr;
+	unsigned short idx;
+	struct page *page[VEC_SIZE];
+};
+
+
+struct page *blk_get_page(unsigned long addr, size_t count, int rw,
+			  struct pvec *pvec)
+{
+	int ret, nr_pages;
+	if (pvec->idx == pvec->nr) {
+		nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+			    addr / PAGE_SIZE;
+		nr_pages = min(nr_pages, VEC_SIZE);
+		down_read(&current->mm->mmap_sem);
+		ret = get_user_pages(current, current->mm, addr, nr_pages,
+				     rw==READ, 0, pvec->page, NULL);
+		up_read(&current->mm->mmap_sem);
+		if (ret < 0)
+			return ERR_PTR(ret);
+		pvec->nr = ret;
+		pvec->idx = 0;
+	}
+	return pvec->page[pvec->idx++];
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-			loff_t offset, unsigned long nr_segs)
+		 loff_t pos, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
+	unsigned blocksize_mask = (1<< blkbits) - 1;
+	unsigned long seg, nvec, cur_off, cur_len;
+
+	unsigned long addr;
+	size_t count, nbytes = iocb->ki_nbytes;
+	loff_t size;
+	struct bio * bio;
+	atomic_t *bio_count = (atomic_t *) &iocb->private;
+	struct page *page;
+	struct pvec pvec = {.nr = 0, .idx = 0, };
+
+	size = i_size_read(inode);
+	if (pos + nbytes > size)
+		nbytes = size - pos;
+
+	seg = 0;
+	addr = (unsigned long) iov[0].iov_base;
+	count = iov[0].iov_len;
+	atomic_set(bio_count, 1);
+
+	/* first check the alignment */
+	if (addr & blocksize_mask || count & blocksize_mask ||
+		pos & blocksize_mask)
+		return -EINVAL;
+
+	while (nbytes) {
+		/* roughly estimate number of bio vec needed */
+		nvec = (nbytes + PAGE_SIZE - 1) / PAGE_SIZE;
+		nvec = max(nvec, nr_segs - seg);
+		nvec = min(nvec, (unsigned long) BIO_MAX_PAGES);
+
+		bio = bio_alloc(GFP_KERNEL, nvec);
+		bio->bi_bdev = I_BDEV(inode);
+		bio->bi_end_io = blk_end_aio;
+		bio->bi_private = iocb;
+		bio->bi_sector = pos >> blkbits;
+same_bio:
+		cur_off = addr & ~PAGE_MASK;
+		cur_len = PAGE_SIZE - cur_off;
+		if (count < cur_len)
+			cur_len = count;
+
+		page = blk_get_page(addr, count, rw, &pvec);
+		if (IS_ERR(page)) {
+			bio_release_pages(bio);
+			bio_put(bio);
+			if (atomic_read(bio_count) == 1)
+				return PTR_ERR(page);
+			break;
+		}
+
+		if (bio_add_page(bio, page, cur_len, cur_off)) {
+			pos += cur_len;
+			addr += cur_len;
+			count -= cur_len;
+			nbytes -= cur_len;
+
+			if (count)
+				goto same_bio;
+			if (++seg < nr_segs) {
+				addr = (unsigned long) iov[seg].iov_base;
+				count = iov[seg].iov_len;
+				if (!(addr & blocksize_mask ||
+				      count & blocksize_mask))
+					goto same_bio;
+			}
+		}
+
+		/* bio is ready, submit it */
+		if (rw == READ)
+			bio_set_pages_dirty(bio);
+		atomic_inc(bio_count);
+		submit_bio(rw, bio);
+	}
+
+	nbytes = iocb->ki_nbytes = iocb->ki_nbytes - nbytes;
+	iocb->ki_pos += nbytes;
+
+	blk_run_address_space(inode->i_mapping);
+	if (atomic_dec_and_test(bio_count))
+		aio_complete(iocb, nbytes, 0);
 
-	return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode),
-				iov, offset, nr_segs, blkdev_get_blocks, NULL);
+	return -EIOCBQUEUED;
 }
 
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
--- ./fs/read_write.c.orig	2006-11-29 14:46:54.000000000 -0800
+++ ./fs/read_write.c	2006-11-29 14:52:43.000000000 -0800
@@ -235,7 +235,7 @@ ssize_t do_sync_read(struct file *filp, 
 
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
-	kiocb.ki_left = len;
+	kiocb.ki_nbytes = kiocb.ki_left = len;
 
 	for (;;) {
 		ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
--- ./fs/bio.c.orig	2006-11-29 14:46:54.000000000 -0800
+++ ./fs/bio.c	2006-11-29 14:52:43.000000000 -0800
@@ -931,7 +931,7 @@ void bio_set_pages_dirty(struct bio *bio
 	}
 }
 
-static void bio_release_pages(struct bio *bio)
+void bio_release_pages(struct bio *bio)
 {
 	struct bio_vec *bvec = bio->bi_io_vec;
 	int i;
--- ./include/linux/bio.h.orig	2006-11-29 15:17:27.000000000 -0800
+++ ./include/linux/bio.h	2006-11-29 15:18:48.000000000 -0800
@@ -309,6 +309,7 @@ extern struct bio *bio_map_kern(struct r
 				gfp_t);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
+extern void bio_release_pages(struct bio *bio);
 extern struct bio *bio_copy_user(struct request_queue *, unsigned long, unsigned int, int);
 extern int bio_uncopy_user(struct bio *);
 void zero_fill_bio(struct bio *bio);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc patch] optimize o_direct on block device
  2006-11-30  3:25 [rfc patch] optimize o_direct on block device Chen, Kenneth W
@ 2006-11-30 21:45 ` Zach Brown
  2006-12-01  6:16   ` Chen, Kenneth W
  0 siblings, 1 reply; 6+ messages in thread
From: Zach Brown @ 2006-11-30 21:45 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Andrew Morton, linux-kernel, Chris Mason

> At that time, a patch was written for raw device to demonstrate that
> large performance head room is achievable (at ~20% speedup for micro-
> benchmark and ~2% for db transaction processing benchmark) with a  
> tight
> I/O submission processing loop.

Where exactly does the benefit come from?  icache misses?  "atomic"  
ops leading to pipeline flushes?

I ask to find out just what exactly it would take to get fs/direct- 
io.c do the job efficiently on behalf of the block devices so that it  
doesn't have to grow code that duplicates the 48 billion fiddly  
little behavioural quirks of O_DIRECT.

I'm cc:ing Chris directly 'cause he's been fighting with fs/direct- 
io.c recently, too.

> (1) I need a vector structure similar to pagevec, however, pagevec  
> doesn't
>     have everything that I need, i.e., an iterator variable.  So I  
> create a
>     new struct pvec.  Maybe something can be worked out with pagevec?

I don't think you need to duplicate the pagevec functionality a new  
struct.  If nothing else you'd embed a pagevec in a wrapper struct  
which includes the other fields you're using.

> (2) there are some inconsistency for synchronous I/O: condition to  
> update
>     ppos and condition to wait on sync_kiocb is incompatible.  Call  
> chain
>     looks like the following:
>
>     do_sync_read
>        generic_file_aio_read
>          ...
>            blkdev_direct_IO
>
>     do_sync_read will wait for I/O completion only if lower  
> function returned
>     -EIOCBQUEUED. Updating ppos is done via generic_file_aio_read,  
> but only
>     if the lower function returned positive value. So I either have  
> to construct
>     my own wait_on_sync_kiocb, or hack around the ppos update.

Yeah, this stuff is a mess.

If memory serves this works out because nothing currently uses the  
AIO completion paths (EIOCBQEUED, aio_complete()) for sync iocbs.   
That is, they test is_sync_iocb() and block internally and return the  
number of bytes transferred.

Changing this implies moving some work that's done in the generic  
return code processing path over into the aio completion path.  Like  
extending i_size.

> +	atomic_t* bio_count = (atomic_t*) &iocb->private;

Is atomic_t defined to fit in a long?  Right now it seems to.. (all  
use 'int' or 's32').  That'll be fun to debug if some joker arch  
decides to put some debugging help in the atomic_t struct, etc.

BUILD_BUG_ON() if the sizes don't fit, I guess, if we have to do it  
this way.  Maybe something somewhere does that already.

> +		nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
> +			    addr / PAGE_SIZE;

Ugh, we have *got* to use some macros for this stuff.  Every single  
time we add code like this a handful of brains strain to make sure  
corner cases (wrapping, off-by-one, type conversion, order of  
operations, page straddling) are covered.

> +	size = i_size_read(inode);
> +	if (pos + nbytes > size)
> +		nbytes = size - pos;

Has something else made sure that pos won't be > size?  I don't know  
the block device world all that well.

> +		bio = bio_alloc(GFP_KERNEL, nvec);
> +		bio->bi_bdev = I_BDEV(inode);

Not checking for bio allocation failure?

> +		page = blk_get_page(addr, count, rw, &pvec);

It looks like count is the iov_len and isn't clamped to the block  
device size like nbytes.  So we could still have 'count' bytes mapped  
after nbytes hits the end of the device and stops building IO.  I  
don't see the function cleaning up these pages that were pinned in  
the user region that extended beyond the block device.  I guess this  
wants to be 'nbytes'.

Indeed, what happens if nbytes runs out when a bio is partially  
full?  Is it not submitted?  I feel like I'm missing something.

> +		if (IS_ERR(page)) {
> +			bio_release_pages(bio);
> +			bio_put(bio);
> +			if (atomic_read(bio_count) == 1)
> +				return PTR_ERR(page);

Does this mean that a fault part-way through an IO that exits from  
here won't update ki_bytes and so the eventual aio_complete() will be  
for the full requested IO rather than the partial IOs that completed?

> +			if (++seg < nr_segs) {
> +				addr = (unsigned long) iov[seg].iov_base;
> +				count = iov[seg].iov_len;

Am I missing where the alignment of the iovec elements beyond the  
first are being checked?

> +	}
> +
> +	nbytes = iocb->ki_nbytes = iocb->ki_nbytes - nbytes;
> +	iocb->ki_pos += nbytes;
> +
> +	blk_run_address_space(inode->i_mapping);
> +	if (atomic_dec_and_test(bio_count))
> +		aio_complete(iocb, nbytes, 0);

If blk_get_page() returned an error, but didn't return from the  
function 'cause it didn't drop the last ref, then that error isn't  
reflected here.  But the pending IO could have completed by the time  
we got here.  I think that makes the third different return code to  
expect from an IO that faults part-way through.

Backing up a bit, I'm not sure it's worth giving ourselves the  
opportunity to make all these mistakes by copying this code over.   
2%, on an extremely aggressive IO subsystem?

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [rfc patch] optimize o_direct on block device
  2006-11-30 21:45 ` Zach Brown
@ 2006-12-01  6:16   ` Chen, Kenneth W
  2006-12-01 15:36     ` Chris Mason
  2006-12-01 19:21     ` Zach Brown
  0 siblings, 2 replies; 6+ messages in thread
From: Chen, Kenneth W @ 2006-12-01  6:16 UTC (permalink / raw)
  To: 'Zach Brown'; +Cc: Andrew Morton, linux-kernel, Chris Mason

Zach Brown wrote on Thursday, November 30, 2006 1:45 PM
> > At that time, a patch was written for raw device to demonstrate that
> > large performance head room is achievable (at ~20% speedup for micro-
> > benchmark and ~2% for db transaction processing benchmark) with a  
> > tight I/O submission processing loop.
> 
> Where exactly does the benefit come from?  icache misses?  "atomic"  
> ops leading to pipeline flushes?

It benefit from shorter path length. It takes much shorter time to process
one I/O request, both in the submit and completion path.  I always think in
terms of how many instructions, or clock ticks does it take to convert user
request into bio, submit it and in the return path, to process the bio call
back function and do the appropriate io completion (sync or async).  The
stock 2.6.19 kernel takes about 5.17 micro-seconds to process one 4K aligned
DIO (just the submit and completion path, less disk I/O latency).  With the
patch, the time is reduced to 4.26 us.

This time may look small compare to disk latency. But with multi-threaded
app using native AIO interface, the metric is important because we are CPU
bound instead of traditionally I/O bound. So software efficiency is critical.

> I ask to find out just what exactly it would take to get fs/direct- 
> io.c do the job efficiently on behalf of the block devices so that it  
> doesn't have to grow code that duplicates the 48 billion fiddly  
> little behavioural quirks of O_DIRECT.

I've attempted that first, but fall flat with all the 48 billion corner
cases needed to handle in the most inner loop.

> > (2) there are some inconsistency for synchronous I/O: condition to  
> > update
> >     ppos and condition to wait on sync_kiocb is incompatible.  Call  
> > chain
> >     looks like the following:
> >
> >     do_sync_read
> >        generic_file_aio_read
> >          ...
> >            blkdev_direct_IO
> >
> >     do_sync_read will wait for I/O completion only if lower  
> > function returned
> >     -EIOCBQUEUED. Updating ppos is done via generic_file_aio_read,  
> > but only
> >     if the lower function returned positive value. So I either have  
> > to construct
> >     my own wait_on_sync_kiocb, or hack around the ppos update.
> 
> Yeah, this stuff is a mess.
> 
> If memory serves this works out because nothing currently uses the  
> AIO completion paths (EIOCBQEUED, aio_complete()) for sync iocbs.   
> That is, they test is_sync_iocb() and block internally and return the  
> number of bytes transferred.
> 
> Changing this implies moving some work that's done in the generic  
> return code processing path over into the aio completion path.  Like  
> extending i_size.

Right. block device never change i_size on write, so I'm utilizing the
blocking call in the generic path that is currently never exercised.

> > +	atomic_t* bio_count = (atomic_t*) &iocb->private;
> 
> Is atomic_t defined to fit in a long?  Right now it seems to.. (all  
> use 'int' or 's32').  That'll be fun to debug if some joker arch  
> decides to put some debugging help in the atomic_t struct, etc.
> 
> BUILD_BUG_ON() if the sizes don't fit, I guess, if we have to do it  
> this way.  Maybe something somewhere does that already.

Good suggestion, I will add that.

> > +	size = i_size_read(inode);
> > +	if (pos + nbytes > size)
> > +		nbytes = size - pos;
> 
> Has something else made sure that pos won't be > size?  I don't know  
> the block device world all that well.

generic_file_aio_read checks pos < size before calling down direct_IO
method, so the check is there already.  But you are right that in the
write path, the equivalent check is not there.  I need to chew on this
a bit more.

> > +		bio = bio_alloc(GFP_KERNEL, nvec);
> > +		bio->bi_bdev = I_BDEV(inode);
> 
> Not checking for bio allocation failure?

I should've add a big fat comments that bio_alloc never fail with
GFP_KERNEL flag. It allocates from mempool that guarantees availability.

> > +		page = blk_get_page(addr, count, rw, &pvec);
> 
> It looks like count is the iov_len and isn't clamped to the block  
> device size like nbytes.  So we could still have 'count' bytes mapped  
> after nbytes hits the end of the device and stops building IO.  I  
> don't see the function cleaning up these pages that were pinned in  
> the user region that extended beyond the block device.  I guess this  
> wants to be 'nbytes'.

I think I adjusted nbytes to never pass beyond the end of disk.  So the
above scenario won't happen.  I will double check.  Using nbytes won't
be correct though, because each vector isn't contiguous in virtual address
space, I really have to make separate call to get_user_pages further down
from the blk_get_page().

> Indeed, what happens if nbytes runs out when a bio is partially  
> full?  Is it not submitted?  I feel like I'm missing something.

No, it will get submitted.

> > +		if (IS_ERR(page)) {
> > +			bio_release_pages(bio);
> > +			bio_put(bio);
> > +			if (atomic_read(bio_count) == 1)
> > +				return PTR_ERR(page);
> 
> Does this mean that a fault part-way through an IO that exits from  
> here won't update ki_bytes and so the eventual aio_complete() will be  
> for the full requested IO rather than the partial IOs that completed?

If nothing is submitted, then it's an easy case.  But if some bio is
already submitted, yes, the error path lacks backing out bytes
accumulated for current vector because it won't get submitted.

> > +			if (++seg < nr_segs) {
> > +				addr = (unsigned long) iov[seg].iov_base;
> > +				count = iov[seg].iov_len;
> 
> Am I missing where the alignment of the iovec elements beyond the  
> first are being checked?

It is checked immediately below addr and count assignment:

+				if (!(addr & blocksize_mask ||
+				      count & blocksize_mask))
+					goto same_bio;

Looking closely, these checks is not complete ....

> Backing up a bit, I'm not sure it's worth giving ourselves the  
> opportunity to make all these mistakes by copying this code over.   
> 2%, on an extremely aggressive IO subsystem?

IMO, it worth the effort because 2% just from DIO is significant.

- Ken

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc patch] optimize o_direct on block device
  2006-12-01  6:16   ` Chen, Kenneth W
@ 2006-12-01 15:36     ` Chris Mason
  2006-12-01 20:03       ` Chen, Kenneth W
  2006-12-01 19:21     ` Zach Brown
  1 sibling, 1 reply; 6+ messages in thread
From: Chris Mason @ 2006-12-01 15:36 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Zach Brown', Andrew Morton, linux-kernel

On Thu, Nov 30, 2006 at 10:16:53PM -0800, Chen, Kenneth W wrote:
> Zach Brown wrote on Thursday, November 30, 2006 1:45 PM
> > > At that time, a patch was written for raw device to demonstrate that
> > > large performance head room is achievable (at ~20% speedup for micro-
> > > benchmark and ~2% for db transaction processing benchmark) with a  
> > > tight I/O submission processing loop.
> > 
> > Where exactly does the benefit come from?  icache misses?  "atomic"  
> > ops leading to pipeline flushes?
> 
> It benefit from shorter path length. It takes much shorter time to process
> one I/O request, both in the submit and completion path.  I always think in
> terms of how many instructions, or clock ticks does it take to convert user
> request into bio, submit it and in the return path, to process the bio call
> back function and do the appropriate io completion (sync or async).  The
> stock 2.6.19 kernel takes about 5.17 micro-seconds to process one 4K aligned
> DIO (just the submit and completion path, less disk I/O latency).  With the
> patch, the time is reduced to 4.26 us.

I'm not completely against a minimal DIO implementation for the block
device, but right now we get block device QA for free when we test the
rest of the DIO code.  Splitting the code base makes DIO (already a
special case) that much harder to test.

It's obvious there's a lot less code in your patch than fs/direct-io.c,
but I'm still interested in which part of the fs/direct-io.c path is
taking the most time.  I would guess it is allocating the dio?

I don't think we should cut out fs/direct-io.c until we understand
exactly where the hit is coming from.  I know you've done lots of
instrumentation already, but can you share some percentages on the hot
paths?

-chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc patch] optimize o_direct on block device
  2006-12-01  6:16   ` Chen, Kenneth W
  2006-12-01 15:36     ` Chris Mason
@ 2006-12-01 19:21     ` Zach Brown
  1 sibling, 0 replies; 6+ messages in thread
From: Zach Brown @ 2006-12-01 19:21 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Andrew Morton, linux-kernel, Chris Mason


On Nov 30, 2006, at 10:16 PM, Chen, Kenneth W wrote:

> Zach Brown wrote on Thursday, November 30, 2006 1:45 PM
>>> At that time, a patch was written for raw device to demonstrate that
>>> large performance head room is achievable (at ~20% speedup for  
>>> micro-
>>> benchmark and ~2% for db transaction processing benchmark) with a
>>> tight I/O submission processing loop.
>>
>> Where exactly does the benefit come from?  icache misses?  "atomic"
>> ops leading to pipeline flushes?
>
> It benefit from shorter path length. It takes much shorter time to  
> process
> one I/O request, both in the submit and completion path.  I always  
> think in
> terms of how many instructions, or clock ticks does it take to  
> convert user
> request into bio, submit it and in the return path, to process the  
> bio call
> back function and do the appropriate io completion (sync or async).

Sure.

What I'm hoping for is an understanding of what exactly the path is  
doing with those cycles.  Do we have any more detailed measurements  
than, say, get_cycles() before and after the call?

Maybe it's time for me to have a good sit down with systemtap :)

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [rfc patch] optimize o_direct on block device
  2006-12-01 15:36     ` Chris Mason
@ 2006-12-01 20:03       ` Chen, Kenneth W
  0 siblings, 0 replies; 6+ messages in thread
From: Chen, Kenneth W @ 2006-12-01 20:03 UTC (permalink / raw)
  To: 'Chris Mason'; +Cc: 'Zach Brown', Andrew Morton, linux-kernel

Chris Mason wrote on Friday, December 01, 2006 7:37 AM
> > It benefit from shorter path length. It takes much shorter time to process
> > one I/O request, both in the submit and completion path.  I always think in
> > terms of how many instructions, or clock ticks does it take to convert user
> > request into bio, submit it and in the return path, to process the bio call
> > back function and do the appropriate io completion (sync or async).  The
> > stock 2.6.19 kernel takes about 5.17 micro-seconds to process one 4K aligned
> > DIO (just the submit and completion path, less disk I/O latency).  With the
> > patch, the time is reduced to 4.26 us.
> 
> I'm not completely against a minimal DIO implementation for the block
> device, but right now we get block device QA for free when we test the
> rest of the DIO code.  Splitting the code base makes DIO (already a
> special case) that much harder to test.
> 
> It's obvious there's a lot less code in your patch than fs/direct-io.c,
> but I'm still interested in which part of the fs/direct-io.c path is
> taking the most time.  I would guess it is allocating the dio?
> 
> I don't think we should cut out fs/direct-io.c until we understand
> exactly where the hit is coming from.  I know you've done lots of
> instrumentation already, but can you share some percentages on the hot
> paths?

It's everywhere in the DIO submit and completion path, here is a profile
taken on a recent measurement:

                        Rank   %
__blockdev_direct_IO    9      3.09%
dio_bio_end_io          12     2.13%
dio_bio_complete        19     0.95%
finished_one_bio        34     0.55%
dio_get_page            76     0.22%
dio_bio_submit          96     0.19%
dio_bio_add_page       101     0.17%
dio_complete           115     0.16%
dio_new_bio            125     0.14%
dio_send_cur_page      150     0.10%
dio_bio_end_aio        201     0.06%

The compiler inlines direct_io_worker into __blockdev_direct_IO, so that
function showed up at the top.  The "rank" field indicates hotness of a
function, i.e., rank 1 is the hottest function. The "%" field is % clock
ticks for the respective function.

Looking at this profile, I see that the submit path is clearly heavy
weight. dio_bio_end_io maybe a bit skewed because of wake-up function
called inside spin_lock_irqsave().  The profiler is not capable of
measuring accurate clock ticks with code running inside irq off section.
Everything is accumulated at the time when irq is enabled.

- Ken

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-12-01 20:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30  3:25 [rfc patch] optimize o_direct on block device Chen, Kenneth W
2006-11-30 21:45 ` Zach Brown
2006-12-01  6:16   ` Chen, Kenneth W
2006-12-01 15:36     ` Chris Mason
2006-12-01 20:03       ` Chen, Kenneth W
2006-12-01 19:21     ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox