[RFC] [PATCH] Btrfs: improve fsync/osync write performance

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] [PATCH] Btrfs: improve fsync/osync write performance
@ 2009-03-31  5:18 Hisashi Hifumi
  2009-03-31 11:27 ` Chris Mason
  2009-04-01 15:17 ` Chris Mason
  0 siblings, 2 replies; 7+ messages in thread
From: Hisashi Hifumi @ 2009-03-31  5:18 UTC (permalink / raw)
  To: chris.mason; +Cc: linux-btrfs

Hi Chris.

I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
very slow as compared to ext3/4. I used blktrace to try to investigate the 
cause of this. One of cause is that unplug is done by kblockd even if the I/O is 
issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
is 3msec, so unplug via blockd can decrease I/O response. To increase 
fsync/osync write performance, speeding up unplug should be done here.

Btrfs's write I/O is issued via kernel thread, not via user application context
that calls fsync(). While waiting for page writeback, wait_on_page_writeback() 
can not unplug I/O sometimes on Btrfs because submit_bio is not called from 
user application context so when submit_bio is called from kernel thread, 
wait_on_page_writeback() sleeps on io_schedule(). 

I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement 
of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while
waiting for page writeback.

I did a performance test using the sysbench.

# sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
--file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
--file-fsync-freq=5  run

The result was:
-2.6.29

Test execution summary:
    total time:                          628.1047s
    total number of events:              10000
    total time taken by event execution: 413.0834
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0413s
         max:                            1.9075s
         approx.  95 percentile:         0.3712s

Threads fairness:
    events (avg/stddev):           2500.0000/29.21
    execution time (avg/stddev):   103.2708/4.04


-2.6.29-patched

Test execution summary:
    total time:                          579.8049s
    total number of events:              10004
    total time taken by event execution: 355.3098
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0355s
         max:                            1.7670s
         approx.  95 percentile:         0.3154s

Threads fairness:
    events (avg/stddev):           2501.0000/8.03
    execution time (avg/stddev):   88.8274/1.94


This patch has some effect for performance improvement. 

I think there are other reasons that should be fixed why fsync() or 
write() with O_SYNC flag is slow on Btrfs.

Thanks.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> 

diff -Nrup linux-2.6.29.org/fs/btrfs/ctree.h linux-2.6.29.btrfs/fs/btrfs/ctree.h
--- linux-2.6.29.org/fs/btrfs/ctree.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/ctree.h	2009-03-24 16:48:36.000000000 +0900
@@ -1703,6 +1703,14 @@ static inline struct dentry *fdentry(str
 	return file->f_path.dentry;
 }
 
+extern void btrfs_wait_on_page_bit(struct page *page);
+
+static inline void btrfs_wait_on_page_writeback(struct page *page)
+{
+	if (PageWriteback(page))
+		btrfs_wait_on_page_bit(page);
+}
+
 /* extent-tree.c */
 int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len);
 int btrfs_lookup_extent_ref(struct btrfs_trans_handle *trans,
diff -Nrup linux-2.6.29.org/fs/btrfs/extent-tree.c linux-2.6.29.btrfs/fs/btrfs/extent-tree.c
--- linux-2.6.29.org/fs/btrfs/extent-tree.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/extent-tree.c	2009-03-24 15:34:12.000000000 +0900
@@ -4529,7 +4529,7 @@ again:
 				goto out_unlock;
 			}
 		}
-		wait_on_page_writeback(page);
+		btrfs_wait_on_page_writeback(page);
 
 		page_start = (u64)page->index << PAGE_CACHE_SHIFT;
 		page_end = page_start + PAGE_CACHE_SIZE - 1;
diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs/fs/btrfs/extent_io.c
--- linux-2.6.29.org/fs/btrfs/extent_io.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/extent_io.c	2009-03-24 15:34:30.000000000 +0900
@@ -2423,7 +2423,7 @@ retry:
 			if (wbc->sync_mode != WB_SYNC_NONE) {
 				if (PageWriteback(page))
 					flush_fn(data);
-				wait_on_page_writeback(page);
+				btrfs_wait_on_page_writeback(page);
 			}
 
 			if (PageWriteback(page) ||
diff -Nrup linux-2.6.29.org/fs/btrfs/file.c linux-2.6.29.btrfs/fs/btrfs/file.c
--- linux-2.6.29.org/fs/btrfs/file.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/file.c	2009-03-24 15:34:49.000000000 +0900
@@ -967,7 +967,7 @@ again:
 			err = -ENOMEM;
 			BUG_ON(1);
 		}
-		wait_on_page_writeback(pages[i]);
+		btrfs_wait_on_page_writeback(pages[i]);
 	}
 	if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
diff -Nrup linux-2.6.29.org/fs/btrfs/inode.c linux-2.6.29.btrfs/fs/btrfs/inode.c
--- linux-2.6.29.org/fs/btrfs/inode.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/inode.c	2009-03-24 15:35:23.000000000 +0900
@@ -2733,7 +2733,7 @@ again:
 			goto out_unlock;
 		}
 	}
-	wait_on_page_writeback(page);
+	btrfs_wait_on_page_writeback(page);
 
 	lock_extent(io_tree, page_start, page_end, GFP_NOFS);
 	set_page_extent_mapped(page);
@@ -4240,7 +4240,7 @@ static void btrfs_invalidatepage(struct 
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
 
-	wait_on_page_writeback(page);
+	btrfs_wait_on_page_writeback(page);
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	if (offset) {
 		btrfs_releasepage(page, GFP_NOFS);
@@ -4322,7 +4322,7 @@ again:
 		/* page got truncated out from underneath us */
 		goto out_unlock;
 	}
-	wait_on_page_writeback(page);
+	btrfs_wait_on_page_writeback(page);
 
 	lock_extent(io_tree, page_start, page_end, GFP_NOFS);
 	set_page_extent_mapped(page);
diff -Nrup linux-2.6.29.org/fs/btrfs/ioctl.c linux-2.6.29.btrfs/fs/btrfs/ioctl.c
--- linux-2.6.29.org/fs/btrfs/ioctl.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/ioctl.c	2009-03-24 15:35:46.000000000 +0900
@@ -400,7 +400,7 @@ again:
 			}
 		}
 
-		wait_on_page_writeback(page);
+		btrfs_wait_on_page_writeback(page);
 
 		page_start = (u64)page->index << PAGE_CACHE_SHIFT;
 		page_end = page_start + PAGE_CACHE_SIZE - 1;
diff -Nrup linux-2.6.29.org/fs/btrfs/ordered-data.c linux-2.6.29.btrfs/fs/btrfs/ordered-data.c
--- linux-2.6.29.org/fs/btrfs/ordered-data.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/ordered-data.c	2009-03-25 11:04:32.000000000 +0900
@@ -21,6 +21,7 @@
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
 #include <linux/pagevec.h>
+#include <linux/hash.h>
 #include "ctree.h"
 #include "transaction.h"
 #include "btrfs_inode.h"
@@ -673,6 +674,46 @@ int btrfs_fdatawrite_range(struct addres
 	return btrfs_writepages(mapping, &wbc);
 }
 
+static void process_timeout(unsigned long __data)
+{
+	wake_up_process((struct task_struct *)__data);
+}
+
+static int btrfs_sync_page(void *word)
+{
+	struct address_space *mapping;
+	struct page *page;
+	struct timer_list timer;
+
+	page = container_of((unsigned long *)word, struct page, flags);
+
+	smp_mb();
+	mapping = page->mapping;
+	if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
+		mapping->a_ops->sync_page(page);
+	setup_timer(&timer, process_timeout, (unsigned long)current);
+	__mod_timer(&timer, jiffies + 1);
+	io_schedule();
+	del_timer_sync(&timer);
+	return 0;
+}
+
+static wait_queue_head_t *page_waitqueue(struct page *page)
+{
+	const struct zone *zone = page_zone(page);
+
+	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
+}
+
+void btrfs_wait_on_page_bit(struct page *page)
+{
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_writeback);
+
+	if (test_bit(PG_writeback, &page->flags))
+		__wait_on_bit(page_waitqueue(page), &wait, btrfs_sync_page,
+							TASK_UNINTERRUPTIBLE);
+}
+
 /**
  * taken from mm/filemap.c because it isn't exported
  *
@@ -710,7 +751,7 @@ int btrfs_wait_on_page_writeback_range(s
 			if (page->index > end)
 				continue;
 
-			wait_on_page_writeback(page);
+			btrfs_wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
 		}
diff -Nrup linux-2.6.29.org/fs/btrfs/transaction.c linux-2.6.29.btrfs/fs/btrfs/transaction.c
--- linux-2.6.29.org/fs/btrfs/transaction.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs/fs/btrfs/transaction.c	2009-03-24 15:37:19.000000000 +0900
@@ -352,7 +352,7 @@ int btrfs_write_and_wait_marked_extents(
 
 			if (PageWriteback(page)) {
 				if (PageDirty(page))
-					wait_on_page_writeback(page);
+					btrfs_wait_on_page_writeback(page);
 				else {
 					unlock_page(page);
 					page_cache_release(page);
@@ -380,12 +380,12 @@ int btrfs_write_and_wait_marked_extents(
 				continue;
 			if (PageDirty(page)) {
 				btree_lock_page_hook(page);
-				wait_on_page_writeback(page);
+				btrfs_wait_on_page_writeback(page);
 				err = write_one_page(page, 0);
 				if (err)
 					werr = err;
 			}
-			wait_on_page_writeback(page);
+			btrfs_wait_on_page_writeback(page);
 			page_cache_release(page);
 			cond_resched();
 		}


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-03-31  5:18 [RFC] [PATCH] Btrfs: improve fsync/osync write performance Hisashi Hifumi
@ 2009-03-31 11:27 ` Chris Mason
  2009-04-02  2:02   ` Hisashi Hifumi
  2009-04-01 15:17 ` Chris Mason
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Mason @ 2009-03-31 11:27 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-btrfs

On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
> Hi Chris.
> 
> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
> very slow as compared to ext3/4. I used blktrace to try to investigate the 
> cause of this. One of cause is that unplug is done by kblockd even if the I/O is 
> issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
> is 3msec, so unplug via blockd can decrease I/O response. To increase 
> fsync/osync write performance, speeding up unplug should be done here.
> 

> Btrfs's write I/O is issued via kernel thread, not via user application context
> that calls fsync(). While waiting for page writeback, wait_on_page_writeback() 
> can not unplug I/O sometimes on Btrfs because submit_bio is not called from 
> user application context so when submit_bio is called from kernel thread, 
> wait_on_page_writeback() sleeps on io_schedule(). 
> 

This is exactly right, and one of the uglier side effects of the async
helper kernel threads.  I've been thinking for a while about a clean way
to fix it.

> I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement 
> of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while
> waiting for page writeback.
> 
> I did a performance test using the sysbench.
> 
> # sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
> --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
> --file-fsync-freq=5  run
> 
> The result was:
> -2.6.29
> 
> Test execution summary:
>     total time:                          628.1047s
>     total number of events:              10000
>     total time taken by event execution: 413.0834
>     per-request statistics:
>          min:                            0.0000s
>          avg:                            0.0413s
>          max:                            1.9075s
>          approx.  95 percentile:         0.3712s
> 
> Threads fairness:
>     events (avg/stddev):           2500.0000/29.21
>     execution time (avg/stddev):   103.2708/4.04
> 
> 
> -2.6.29-patched
> 
> Test execution summary:
>     total time:                          579.8049s
>     total number of events:              10004
>     total time taken by event execution: 355.3098
>     per-request statistics:
>          min:                            0.0000s
>          avg:                            0.0355s
>          max:                            1.7670s
>          approx.  95 percentile:         0.3154s
> 
> Threads fairness:
>     events (avg/stddev):           2501.0000/8.03
>     execution time (avg/stddev):   88.8274/1.94
> 
> 
> This patch has some effect for performance improvement. 
> 
> I think there are other reasons that should be fixed why fsync() or 
> write() with O_SYNC flag is slow on Btrfs.
> 

Very nice.  Could I trouble you to try one more experiment?  The other
way to fix this is to your WRITE_SYNC instead of WRITE.  Could you
please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark
that?

It doesn't cover as many cases as your patch, but it might have a lower
overall impact.

-chris



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-03-31 11:27 ` Chris Mason
@ 2009-04-02  2:02   ` Hisashi Hifumi
  0 siblings, 0 replies; 7+ messages in thread
From: Hisashi Hifumi @ 2009-04-02  2:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs


At 20:27 09/03/31, Chris Mason wrote:
>On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
>> Hi Chris.
>> 
>> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
>> very slow as compared to ext3/4. I used blktrace to try to investigate the 
>> cause of this. One of cause is that unplug is done by kblockd even if the 
>I/O is 
>> issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
>> is 3msec, so unplug via blockd can decrease I/O response. To increase 
>> fsync/osync write performance, speeding up unplug should be done here.
>> 
>
>> Btrfs's write I/O is issued via kernel thread, not via user application context
>> that calls fsync(). While waiting for page writeback, wait_on_page_writeback() 
>> can not unplug I/O sometimes on Btrfs because submit_bio is not called from 
>> user application context so when submit_bio is called from kernel thread, 
>> wait_on_page_writeback() sleeps on io_schedule(). 
>> 
>
>This is exactly right, and one of the uglier side effects of the async
>helper kernel threads.  I've been thinking for a while about a clean way
>to fix it.
>
>> I introduced btrfs_wait_on_page_writeback() on following patch, this is 
>replacement 
>> of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while
>> waiting for page writeback.
>> 
>> I did a performance test using the sysbench.
>> 
>> # sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
>> --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
>> --file-fsync-freq=5  run
>> 
>> The result was:
>> -2.6.29
>> 
>> Test execution summary:
>>     total time:                          628.1047s
>>     total number of events:              10000
>>     total time taken by event execution: 413.0834
>>     per-request statistics:
>>          min:                            0.0000s
>>          avg:                            0.0413s
>>          max:                            1.9075s
>>          approx.  95 percentile:         0.3712s
>> 
>> Threads fairness:
>>     events (avg/stddev):           2500.0000/29.21
>>     execution time (avg/stddev):   103.2708/4.04
>> 
>> 
>> -2.6.29-patched
>> 
>> Test execution summary:
>>     total time:                          579.8049s
>>     total number of events:              10004
>>     total time taken by event execution: 355.3098
>>     per-request statistics:
>>          min:                            0.0000s
>>          avg:                            0.0355s
>>          max:                            1.7670s
>>          approx.  95 percentile:         0.3154s
>> 
>> Threads fairness:
>>     events (avg/stddev):           2501.0000/8.03
>>     execution time (avg/stddev):   88.8274/1.94
>> 
>> 
>> This patch has some effect for performance improvement. 
>> 
>> I think there are other reasons that should be fixed why fsync() or 
>> write() with O_SYNC flag is slow on Btrfs.
>> 
>
>Very nice.  Could I trouble you to try one more experiment?  The other
>way to fix this is to your WRITE_SYNC instead of WRITE.  Could you
>please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark
>that?
>
>It doesn't cover as many cases as your patch, but it might have a lower
>overall impact.


Hi.
I wrote hardcode WRITE_SYNC patch for btrfs submit_bio paths as shown below,
and I did sysbench test.
Later, I will try your unplug patch.

diff -Nrup linux-2.6.29.org/fs/btrfs/disk-io.c linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c
--- linux-2.6.29.org/fs/btrfs/disk-io.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c	2009-04-01 16:26:56.000000000 +0900
@@ -2068,7 +2068,7 @@ static int write_dev_supers(struct btrfs
 		}
 
 		if (i == last_barrier && do_barriers && device->barriers) {
-			ret = submit_bh(WRITE_BARRIER, bh);
+			ret = submit_bh(WRITE_BARRIER|WRITE_SYNC, bh);
 			if (ret == -EOPNOTSUPP) {
 				printk("btrfs: disabling barriers on dev %s\n",
 				       device->name);
@@ -2076,10 +2076,10 @@ static int write_dev_supers(struct btrfs
 				device->barriers = 0;
 				get_bh(bh);
 				lock_buffer(bh);
-				ret = submit_bh(WRITE, bh);
+				ret = submit_bh(WRITE_SYNC, bh);
 			}
 		} else {
-			ret = submit_bh(WRITE, bh);
+			ret = submit_bh(WRITE_SYNC, bh);
 		}
 
 		if (!ret && wait) {
diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c
--- linux-2.6.29.org/fs/btrfs/extent_io.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c	2009-04-01 14:48:08.000000000 +0900
@@ -1851,8 +1851,11 @@ static int submit_one_bio(int rw, struct
 	if (tree->ops && tree->ops->submit_bio_hook)
 		tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
 					   mirror_num, bio_flags);
-	else
+	else {
+		if (rw & WRITE)
+			rw = WRITE_SYNC;
 		submit_bio(rw, bio);
+	}
 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
 		ret = -EOPNOTSUPP;
 	bio_put(bio);
diff -Nrup linux-2.6.29.org/fs/btrfs/volumes.c linux-2.6.29.btrfs_sync/fs/btrfs/volumes.c
--- linux-2.6.29.org/fs/btrfs/volumes.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29.btrfs_sync/fs/btrfs/volumes.c	2009-04-01 16:25:51.000000000 +0900
@@ -195,6 +195,8 @@ loop_lock:
 
 		BUG_ON(atomic_read(&cur->bi_cnt) == 0);
 		bio_get(cur);
+		if (cur->bi_rw & WRITE)
+			cur->bi_rw = WRITE_SYNC;
 		submit_bio(cur->bi_rw, cur);
 		bio_put(cur);
 		num_run++;
@@ -2815,8 +2817,11 @@ int btrfs_map_bio(struct btrfs_root *roo
 			bio->bi_bdev = dev->bdev;
 			if (async_submit)
 				schedule_bio(root, dev, rw, bio);
-			else
+			else {
+				if (rw & WRITE)
+					rw = WRITE_SYNC;
 				submit_bio(rw, bio);
+			}
 		} else {
 			bio->bi_bdev = root->fs_info->fs_devices->latest_bdev;
 			bio->bi_sector = logical >> 9;


 # sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
 --file-fsync-freq=5  run

The result was:

-2.6.29
Test execution summary:
    total time:                          619.6822s
    total number of events:              10003
    total time taken by event execution: 403.1020
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0403s
         max:                            1.4584s
         approx.  95 percentile:         0.3761s

Threads fairness:
    events (avg/stddev):           2500.7500/48.48
    execution time (avg/stddev):   100.7755/7.92


-2.6.29-WRITE_SYNC-patched

Test execution summary:
    total time:                          596.8114s
    total number of events:              10004
    total time taken by event execution: 396.2378
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0396s
         max:                            1.6926s
         approx.  95 percentile:         0.3434s

Threads fairness:
    events (avg/stddev):           2501.0000/58.28
    execution time (avg/stddev):   99.0595/2.84



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-03-31  5:18 [RFC] [PATCH] Btrfs: improve fsync/osync write performance Hisashi Hifumi
  2009-03-31 11:27 ` Chris Mason
@ 2009-04-01 15:17 ` Chris Mason
  2009-04-01 17:01   ` Jens Axboe
  2009-04-02  6:25   ` Hisashi Hifumi
  1 sibling, 2 replies; 7+ messages in thread
From: Chris Mason @ 2009-04-01 15:17 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-btrfs

On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
> Hi Chris.
> 
> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
> very slow as compared to ext3/4. I used blktrace to try to investigate the 
> cause of this. One of cause is that unplug is done by kblockd even if the I/O is 
> issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
> is 3msec, so unplug via blockd can decrease I/O response. To increase 
> fsync/osync write performance, speeding up unplug should be done here.
> 

I realized today that all of the async thread handling btrfs does for
writes gives us plenty of time to queue up IO for the block device.  If
that's true, we can just unplug the block device in async helper thread
and get pretty good coverage for the problem you're describing.

Could you please try the patch below and see if it performs well?  I did
some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s
to 450MB/s for large writes.

Thanks again for digging through this problem.

-chris

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd06e18..bf377ab 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long num_run = 0;
 	unsigned long limit;
 
-	bdi = device->bdev->bd_inode->i_mapping->backing_dev_info;
+	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
 	limit = btrfs_async_submit_limit(fs_info);
 	limit = limit * 2 / 3;
@@ -231,6 +231,19 @@ loop_lock:
 	if (device->pending_bios)
 		goto loop_lock;
 	spin_unlock(&device->io_lock);
+
+	/*
+	 * IO has already been through a long path to get here.  Checksumming,
+	 * async helper threads, perhaps compression.  We've done a pretty
+	 * good job of collecting a batch of IO and should just unplug
+	 * the device right away.
+	 *
+	 * This will help anyone who is waiting on the IO, they might have
+	 * already unplugged, but managed to do so before the bio they
+	 * cared about found its way down here.
+	 */
+	if (bdi->unplug_io_fn)
+		bdi->unplug_io_fn(bdi, NULL);
 done:
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-04-01 15:17 ` Chris Mason
@ 2009-04-01 17:01   ` Jens Axboe
  2009-04-02  6:25   ` Hisashi Hifumi
  1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2009-04-01 17:01 UTC (permalink / raw)
  To: Chris Mason; +Cc: Hisashi Hifumi, linux-btrfs

On Wed, Apr 01 2009, Chris Mason wrote:
> On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
> > Hi Chris.
> > 
> > I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
> > very slow as compared to ext3/4. I used blktrace to try to investigate the 
> > cause of this. One of cause is that unplug is done by kblockd even if the I/O is 
> > issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
> > is 3msec, so unplug via blockd can decrease I/O response. To increase 
> > fsync/osync write performance, speeding up unplug should be done here.
> > 
> 
> I realized today that all of the async thread handling btrfs does for
> writes gives us plenty of time to queue up IO for the block device.  If
> that's true, we can just unplug the block device in async helper thread
> and get pretty good coverage for the problem you're describing.
> 
> Could you please try the patch below and see if it performs well?  I did
> some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s
> to 450MB/s for large writes.
> 
> Thanks again for digging through this problem.
> 
> -chris
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dd06e18..bf377ab 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
>  	unsigned long num_run = 0;
>  	unsigned long limit;
>  
> -	bdi = device->bdev->bd_inode->i_mapping->backing_dev_info;
> +	bdi = blk_get_backing_dev_info(device->bdev);
>  	fs_info = device->dev_root->fs_info;
>  	limit = btrfs_async_submit_limit(fs_info);
>  	limit = limit * 2 / 3;
> @@ -231,6 +231,19 @@ loop_lock:
>  	if (device->pending_bios)
>  		goto loop_lock;
>  	spin_unlock(&device->io_lock);
> +
> +	/*
> +	 * IO has already been through a long path to get here.  Checksumming,
> +	 * async helper threads, perhaps compression.  We've done a pretty
> +	 * good job of collecting a batch of IO and should just unplug
> +	 * the device right away.
> +	 *
> +	 * This will help anyone who is waiting on the IO, they might have
> +	 * already unplugged, but managed to do so before the bio they
> +	 * cared about found its way down here.
> +	 */
> +	if (bdi->unplug_io_fn)
> +		bdi->unplug_io_fn(bdi, NULL);

blk_run_backing_dev(bdi, NULL);

:-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-04-01 15:17 ` Chris Mason
  2009-04-01 17:01   ` Jens Axboe
@ 2009-04-02  6:25   ` Hisashi Hifumi
  2009-04-02 11:25     ` Chris Mason
  1 sibling, 1 reply; 7+ messages in thread
From: Hisashi Hifumi @ 2009-04-02  6:25 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs


At 00:17 09/04/02, Chris Mason wrote:
>On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
>> Hi Chris.
>> 
>> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
>> very slow as compared to ext3/4. I used blktrace to try to investigate the 
>> cause of this. One of cause is that unplug is done by kblockd even if the 
>I/O is 
>> issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
>> is 3msec, so unplug via blockd can decrease I/O response. To increase 
>> fsync/osync write performance, speeding up unplug should be done here.
>> 
>
>I realized today that all of the async thread handling btrfs does for
>writes gives us plenty of time to queue up IO for the block device.  If
>that's true, we can just unplug the block device in async helper thread
>and get pretty good coverage for the problem you're describing.
>
>Could you please try the patch below and see if it performs well?  I did
>some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s
>to 450MB/s for large writes.
>
>Thanks again for digging through this problem.
>
>-chris
>
>diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>index dd06e18..bf377ab 100644
>--- a/fs/btrfs/volumes.c
>+++ b/fs/btrfs/volumes.c
>@@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct 
>btrfs_device *device)
> 	unsigned long num_run = 0;
> 	unsigned long limit;
> 
>-	bdi = device->bdev->bd_inode->i_mapping->backing_dev_info;
>+	bdi = blk_get_backing_dev_info(device->bdev);
> 	fs_info = device->dev_root->fs_info;
> 	limit = btrfs_async_submit_limit(fs_info);
> 	limit = limit * 2 / 3;
>@@ -231,6 +231,19 @@ loop_lock:
> 	if (device->pending_bios)
> 		goto loop_lock;
> 	spin_unlock(&device->io_lock);
>+
>+	/*
>+	* IO has already been through a long path to get here.  Checksumming,
>+	* async helper threads, perhaps compression.  We've done a pretty
>+	* good job of collecting a batch of IO and should just unplug
>+	* the device right away.
>+	*
>+	* This will help anyone who is waiting on the IO, they might have
>+	* already unplugged, but managed to do so before the bio they
>+	* cared about found its way down here.
>+	*/
>+	if (bdi->unplug_io_fn)
>+		bdi->unplug_io_fn(bdi, NULL);
> done:
> 	return 0;
> }

I tested your unplug patch.

# sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
 --file-fsync-freq=5  run

-2.6.29
Test execution summary:
    total time:                          626.9416s
    total number of events:              10004
    total time taken by event execution: 442.5869
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0442s
         max:                            1.4229s
         approx.  95 percentile:         0.3959s

Threads fairness:
    events (avg/stddev):           2501.0000/73.43
    execution time (avg/stddev):   110.6467/7.15



-2.6.29-patched
Operations performed:  0 Read, 10003 Write, 1996 Other = 11999 Total
Read 0b  Written 39.074Mb  Total transferred 39.074Mb  (68.269Kb/sec)
   17.07 Requests/sec executed

Test execution summary:
    total time:                          586.0944s
    total number of events:              10003
    total time taken by event execution: 347.5348
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0347s
         max:                            2.2546s
         approx.  95 percentile:         0.3090s

Threads fairness:
    events (avg/stddev):           2500.7500/54.98
    execution time (avg/stddev):   86.8837/3.06


We can get some performance improvement by this patch.
What if the case write() without O_SYNC ?
I am concerned about decreasing optimization effect on block layer (merge, sort)
when the I/O is not fsync or write with O_SYNC.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
  2009-04-02  6:25   ` Hisashi Hifumi
@ 2009-04-02 11:25     ` Chris Mason
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Mason @ 2009-04-02 11:25 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-btrfs

On Thu, 2009-04-02 at 15:25 +0900, Hisashi Hifumi wrote:
> I tested your unplug patch.
> 
> # sysbench --num-threads=4 --max-requests=10000  --test=fileio --file-num=1 
>  --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
>  --file-fsync-freq=5  run
> 
> -2.6.29
> Test execution summary:
>     total time:                          626.9416s
>     total number of events:              10004
>     total time taken by event execution: 442.5869
>     per-request statistics:
>          min:                            0.0000s
>          avg:                            0.0442s
>          max:                            1.4229s
>          approx.  95 percentile:         0.3959s
> 
> Threads fairness:
>     events (avg/stddev):           2501.0000/73.43
>     execution time (avg/stddev):   110.6467/7.15

> -2.6.29-patched
> Operations performed:  0 Read, 10003 Write, 1996 Other = 11999 Total
> Read 0b  Written 39.074Mb  Total transferred 39.074Mb  (68.269Kb/sec)
>    17.07 Requests/sec executed
> 
> Test execution summary:
>     total time:                          586.0944s
>     total number of events:              10003
>     total time taken by event execution: 347.5348
>     per-request statistics:
>          min:                            0.0000s
>          avg:                            0.0347s
>          max:                            2.2546s
>          approx.  95 percentile:         0.3090s
> 
> Threads fairness:
>     events (avg/stddev):           2500.7500/54.98
>     execution time (avg/stddev):   86.8837/3.06
> 

Very nice.

> 
> We can get some performance improvement by this patch.
> What if the case write() without O_SYNC ?
> I am concerned about decreasing optimization effect on block layer (merge, sort)
> when the I/O is not fsync or write with O_SYNC.

The performance should still be good for normal workloads, mostly
because the async threads try to collect IO already.  Basically what
happens is the bio is first sent to the checksumming threads, which do a
bunch of checksums but still queue things in such a way that the IO is
sent down in order.  This takes some time.

Then the bios are put into the submit bio thread pool, which wakes up a
different process to send it down.  It might be slightly less merging
than before, but it should still give the elevator enough bios to work
with.

-chris



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-04-02 11:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-31  5:18 [RFC] [PATCH] Btrfs: improve fsync/osync write performance Hisashi Hifumi
2009-03-31 11:27 ` Chris Mason
2009-04-02  2:02   ` Hisashi Hifumi
2009-04-01 15:17 ` Chris Mason
2009-04-01 17:01   ` Jens Axboe
2009-04-02  6:25   ` Hisashi Hifumi
2009-04-02 11:25     ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox