[PATCH] Improve buffered streaming write ordering

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Improve buffered streaming write ordering
@ 2008-10-01 18:40 Chris Mason
  2008-10-02  4:52 ` Andrew Morton
  2008-10-02 18:08 ` Aneesh Kumar K.V
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Mason @ 2008-10-01 18:40 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

Hello everyone,

write_cache_pages can use the address space writeback_index field to
try and pick up where it left off between calls.  pdflush and
balance_dirty_pages both enable this mode in hopes of having writeback
evenly walk down the file instead of just servicing pages at the
start of the address space.

But, there is no locking around this field, and concurrent callers of
write_cache_pages on the same inode can get some very strange results.
pdflush uses writeback_acquire function to make sure that only one
pdflush process is servicing a given backing device, but
balance_dirty_pages does not.

When there are a small number of dirty inodes in the system,
balance_dirty_pages is likely to run in parallel with pdflush on one or
two of them, leading to somewhat random updates of the writeback_index
field in struct address space.

The end result is very seeky writeback during streaming IO.  A 4 drive
hardware raid0 array here can do 317MB/s streaming O_DIRECT writes on
ext4.  This is creating a new file, so O_DIRECT is really just a way to
bypass write_cache_pages.

If I do buffered writes instead, XFS does 205MB/s, and ext4 clocks in at
81.7MB/s.  Looking at the buffered IO traces for each one, we can see a
lot of seeks.

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-nopatch.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-nopatch.png

The patch below changes write_cache_pages to only use writeback_index
when current_is_pdflush().  The basic idea is that pdflush is the only
one who has concurrency control against the bdi, so it is the only one
who can safely use and update writeback_index.

The performance changes quite a bit:

        patched        unpatched
XFS     247MB/s        205MB/s
Ext4    246MB/s        81.7MB/s

The graphs after the patch:

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-patched.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-patched.png

The ext4 graph really does look strange.  What's happening there is the
lazy inode table init has dirtied a whole bunch of pages on the block
device inode.  I don't have much of an answer for why my patch makes all
of this writeback happen up front, other then writeback_index is no
longer bouncing all over the address space.

It is also worth noting that before the patch, filefrag shows ext4 using
about 4000 extents on the file.  After the patch it is around 400.  XFS
uses 2 extents both patched and unpatched.

This is just one benchmark, and I'm not convinced this patch is right.
The ordering of pdflush vs balance_dirty pages is very tricky so I
definitely think we need more thought on this one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24de8b6..d799f03 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -884,7 +884,11 @@ int write_cache_pages(struct address_space *mapping,

 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		/* start from previous offset done by pdflush */
+		if (current_is_pdflush())
+			index = mapping->writeback_index;
+		else
+			index = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
@@ -958,7 +962,8 @@ retry:
 		index = 0;
 		goto retry;
 	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+	if (current_is_pdflush() &&
+	    (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)))
 		mapping->writeback_index = index;

 	if (wbc->range_cont)

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-01 18:40 [PATCH] Improve buffered streaming write ordering Chris Mason
@ 2008-10-02  4:52 ` Andrew Morton
  2008-10-02 12:20   ` Chris Mason
  2008-10-02 18:08 ` Aneesh Kumar K.V
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-10-02  4:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, linux-fsdevel

On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason <chris.mason@oracle.com> wrote:

> The patch below changes write_cache_pages to only use writeback_index
> when current_is_pdflush().  The basic idea is that pdflush is the only
> one who has concurrency control against the bdi, so it is the only one
> who can safely use and update writeback_index.

Another approach would be to only update mapping->writeback_index if
nobody else altered it meanwhile.

That being said, I don't really see why we get lots of seekiness when
two threads start their writing the file from the same offset.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02  4:52 ` Andrew Morton
@ 2008-10-02 12:20   ` Chris Mason
  2008-10-02 16:12     ` Chris Mason
  2008-10-02 18:18     ` Aneesh Kumar K.V
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Mason @ 2008-10-02 12:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason <chris.mason@oracle.com> wrote:
> 
> > The patch below changes write_cache_pages to only use writeback_index
> > when current_is_pdflush().  The basic idea is that pdflush is the only
> > one who has concurrency control against the bdi, so it is the only one
> > who can safely use and update writeback_index.
> 
> Another approach would be to only update mapping->writeback_index if
> nobody else altered it meanwhile.
> 

Ok, I can give that a short.

> That being said, I don't really see why we get lots of seekiness when
> two threads start their writing the file from the same offset.

For metadata, it makes sense.  Pages get dirtied in strange order, and
if writeback_index is jumping around, we'll get the seeky metadata
writeback.

Data makes less sense, especially the very high extent count from ext4.
An extra printk shows that ext4 is calling redirty_page_for_writepage
quite a bit in ext4_da_writepage.  This should be enough to make us jump
around in the file.

For a 4.5GB streaming buffered write, this printk inside
ext4_da_writepage shows up 37,2429 times in /var/log/messages.

	if (page_has_buffers(page)) {
		page_bufs = page_buffers(page);
		if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
					ext4_bh_unmapped_or_delay)) {
			/*
			 * We don't want to do  block allocation
			 * So redirty the page and return
			 * We may reach here when we do a journal commit
			 * via journal_submit_inode_data_buffers.
			 * If we don't have mapping block we just ignore
			 * them. We can also reach here via shrink_page_list
			 */
			redirty_page_for_writepage(wbc, page);

printk("redirty page %Lu\n", page_offset(page));

			unlock_page(page);
			return 0;
		}
	} else {

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 12:20   ` Chris Mason
@ 2008-10-02 16:12     ` Chris Mason
  2008-10-02 18:18     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Mason @ 2008-10-02 16:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

On Thu, 2008-10-02 at 08:20 -0400, Chris Mason wrote:
> On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason <chris.mason@oracle.com> wrote:
> > 
> > > The patch below changes write_cache_pages to only use writeback_index
> > > when current_is_pdflush().  The basic idea is that pdflush is the only
> > > one who has concurrency control against the bdi, so it is the only one
> > > who can safely use and update writeback_index.
> > 
> > Another approach would be to only update mapping->writeback_index if
> > nobody else altered it meanwhile.
> > 
> 
> Ok, I can give that a short.
> 

I tried a few variations on letting anyone update writeback_index if it
hadn't changed, including always letting pdflush update it, and only
letting non-pdflush update it when walking forward in the file.

They all performed badly for both xfs and ext4, making me think the real
benefit from my patch comes with making non-pdflush writers start at 0.

So I'm a bit conflicted on this one.  The filesystems could be doing
better, but the current logic in write_cache_pages isn't very
predictable.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 12:20   ` Chris Mason
  2008-10-02 16:12     ` Chris Mason
@ 2008-10-02 18:18     ` Aneesh Kumar K.V
  2008-10-02 19:44       ` Andrew Morton
                         ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-02 18:18 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason <chris.mason@oracle.com> wrote:
> > 
> > > The patch below changes write_cache_pages to only use writeback_index
> > > when current_is_pdflush().  The basic idea is that pdflush is the only
> > > one who has concurrency control against the bdi, so it is the only one
> > > who can safely use and update writeback_index.
> > 
> > Another approach would be to only update mapping->writeback_index if
> > nobody else altered it meanwhile.
> > 
> 
> Ok, I can give that a short.
> 
> > That being said, I don't really see why we get lots of seekiness when
> > two threads start their writing the file from the same offset.
> 
> For metadata, it makes sense.  Pages get dirtied in strange order, and
> if writeback_index is jumping around, we'll get the seeky metadata
> writeback.
> 
> Data makes less sense, especially the very high extent count from ext4.
> An extra printk shows that ext4 is calling redirty_page_for_writepage
> quite a bit in ext4_da_writepage.  This should be enough to make us jump
> around in the file.


We need to do  start the journal before locking the page with jbd2.
That prevent us from doing any block allocation in writepage() call
back. So with ext4/jbd2 we do block allocation only in writepages()
call back where we start the journal with credit needed to write
a single extent. Then we look for contiguous unallocated logical
block and request the block allocator for 'x' blocks. If we get
less than that. The rest of the pages which we iterated in
writepages  are redirtied so that we try to allocate them again.
We loop inside ext4_da_writepages itself looking at wbc->pages_skipped

2481         if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
2482                 /* We skipped pages in this loop */
2483                 wbc->range_start = range_start;
2484                 wbc->nr_to_write = to_write +


> 
> For a 4.5GB streaming buffered write, this printk inside
> ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> 

Part of that can happen due to shrink_page_list -> pageout -> writepagee
call back with lots of unallocated buffer_heads(blocks). Also a journal
commit with jbd2 looks at the inode and all the dirty pages, rather than
the buffer_heads (journal_submit_data_buffers). We don't force commit
pages that doesn't have blocks allocated with the ext4. The consistency
is only with i_size and data.

> 	if (page_has_buffers(page)) {
> 		page_bufs = page_buffers(page);
> 		if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> 					ext4_bh_unmapped_or_delay)) {
> 			/*
> 			 * We don't want to do  block allocation
> 			 * So redirty the page and return
> 			 * We may reach here when we do a journal commit
> 			 * via journal_submit_inode_data_buffers.
> 			 * If we don't have mapping block we just ignore
> 			 * them. We can also reach here via shrink_page_list
> 			 */
> 			redirty_page_for_writepage(wbc, page);
> 
> printk("redirty page %Lu\n", page_offset(page));
> 
> 			unlock_page(page);
> 			return 0;
> 		}
> 	} else {
> 
> -chris
> 
> 

-aneesh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 18:18     ` Aneesh Kumar K.V
@ 2008-10-02 19:44       ` Andrew Morton
  2008-10-02 23:43       ` Dave Chinner
  2008-10-03  1:11       ` Chris Mason
  2 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2008-10-02 19:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Chris Mason, linux-kernel, linux-fsdevel, ext4

On Thu, 2 Oct 2008 23:48:56 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> > 
> > For a 4.5GB streaming buffered write, this printk inside
> > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > 
> 
> Part of that can happen due to shrink_page_list -> pageout -> writepagee
> call back with lots of unallocated buffer_heads(blocks).

That workload shouldn't be using that code path much at all.  It's
supposed to be the case that pdflush and balance_dirty_pages() do most
of the writeback work.

And that _used_ to be the case, but we broke it.  It happened several
years ago and I wasn't able to provoke anyone into finding out why. 
iirc the XFS guys noticed it because their throughput was fairly badly
affected.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 18:18     ` Aneesh Kumar K.V
  2008-10-02 19:44       ` Andrew Morton
@ 2008-10-02 23:43       ` Dave Chinner
  2008-10-03 19:45         ` Chris Mason
  2008-10-09 15:11         ` Chris Mason
  2008-10-03  1:11       ` Chris Mason
  2 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2008-10-02 23:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Chris Mason, Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > For a 4.5GB streaming buffered write, this printk inside
> > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > 
> 
> Part of that can happen due to shrink_page_list -> pageout -> writepagee
> call back with lots of unallocated buffer_heads(blocks).

Quite frankly, a simple streaming buffered write should *never*
trigger writeback from the LRU in memory reclaim. That indicates
that some feedback loop has broken down and we are not cleaning
pages fast enough or perhaps in the correct order. Page reclaim in
this case should be reclaiming clean pages (those that have already
been written back), not writing back random dirty pages.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 23:43       ` Dave Chinner
@ 2008-10-03 19:45         ` Chris Mason
  2008-10-06 10:16           ` Aneesh Kumar K.V
  2008-10-09 15:11         ` Chris Mason
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2008-10-03 19:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Aneesh Kumar K.V, Andrew Morton, linux-kernel, linux-fsdevel,
	ext4

On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > For a 4.5GB streaming buffered write, this printk inside
> > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > 
> > 
> > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > call back with lots of unallocated buffer_heads(blocks).
> 
> Quite frankly, a simple streaming buffered write should *never*
> trigger writeback from the LRU in memory reclaim.

The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
clear if this is because ext4 did the redirty trick or if kswapd didn't
call writepage.

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-03 19:45         ` Chris Mason
@ 2008-10-06 10:16           ` Aneesh Kumar K.V
  2008-10-06 14:21             ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-06 10:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Fri, Oct 03, 2008 at 03:45:55PM -0400, Chris Mason wrote:
> On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > For a 4.5GB streaming buffered write, this printk inside
> > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > 
> > > 
> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks).
> > 
> > Quite frankly, a simple streaming buffered write should *never*
> > trigger writeback from the LRU in memory reclaim.
> 
> The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
> clear if this is because ext4 did the redirty trick or if kswapd didn't
> call writepage.
> 
> -chris

This patch actually reduced the number of extents for the below test
from 564 to 171.

$dd if=/dev/zero of=test bs=1M count=1024

ext4 mballoc block allocator still can be improved to make sure we get
less extents. I am looking into this.

What the below change basically does is to make sure we advance
writeback_index after looking at the pages skipped. With delayed
allocation when we request for 100 blocks we just add each of these
blocks to the in memory extent. Once we get the contiguous chunk
of 100 block request we have the index updated to 100. Now we request
block allocator for 100 blocks. But allocator gives us back 50 blocks
So with the current code we have writeback_index pointing to 100
With the changes below it is pointing to 50.


We also force loop inside the ext4_da_writepags with the below condition
a) if nr_to_write is zero break the loop
b) if we are able to allocate blocks but we have pages_skipped move
the writeback_index/range_start to initial value and try with a new
nr_to_write value. This make sure we will be allocating blocks for
all requested dirty pages. The reason being __fsync_super. It does

sync_inodes_sb(sb, 0);
sync_blockdev(sb->s_bdev);
sync_inodes_sb(sb, 1);

I guess the first call is supposed to have done all the meta data
allocation ? So i was forcing the block allocation without even looking
at WB_SYNC_ALL

c) If we don't have anything to write break the loop

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 21f1d3a..58d010d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2386,14 +2386,16 @@ static int ext4_da_writepages(struct address_space *mapping,
 		wbc->nr_to_write = sbi->s_mb_stream_request;
 	}
 
-	if (!wbc->range_cyclic)
+	if (wbc->range_cyclic) {
+		range_start =  mapping->writeback_index;
+	} else {
 		/*
 		 * If range_cyclic is not set force range_cont
-		 * and save the old writeback_index
+		 * and save the old range_start;
 		 */
 		wbc->range_cont = 1;
-
-	range_start =  wbc->range_start;
+		range_start =  wbc->range_start;
+	}
 	pages_skipped = wbc->pages_skipped;
 
 	mpd.wbc = wbc;
@@ -2440,6 +2442,19 @@ static int ext4_da_writepages(struct address_space *mapping,
 			 */
 			to_write += wbc->nr_to_write;
 			ret = 0;
+			if (pages_skipped != wbc->pages_skipped) {
+				/* writepages skipped some pages */
+				if (wbc->range_cont) {
+					wbc->range_start = range_start;
+				} else {
+					/* range_cyclic */
+					mapping->writeback_index = range_start;
+				}
+				wbc->nr_to_write = to_write +
+					(wbc->pages_skipped - pages_skipped);
+				wbc->pages_skipped = pages_skipped;
+			} else
+				wbc->nr_to_write = to_write;
 		} else if (wbc->nr_to_write) {
 			/*
 			 * There is no more writeout needed
@@ -2449,18 +2464,27 @@ static int ext4_da_writepages(struct address_space *mapping,
 			to_write += wbc->nr_to_write;
 			break;
 		}
-		wbc->nr_to_write = to_write;
 	}
 
 	if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
 		/* We skipped pages in this loop */
 		wbc->range_start = range_start;
 		wbc->nr_to_write = to_write +
-				wbc->pages_skipped - pages_skipped;
+				(wbc->pages_skipped - pages_skipped);
 		wbc->pages_skipped = pages_skipped;
 		goto restart_loop;
 	}
 
+	if (wbc->range_cyclic && (pages_skipped != wbc->pages_skipped)) {
+		/*
+		 * we need to make sure we don't move the
+		 * writeback_index further without looking
+		 * at the pages skipped.
+		 */
+		mapping->writeback_index = mapping->writeback_index -
+					(wbc->pages_skipped - pages_skipped);
+	}
+
 out_writepages:
 	wbc->nr_to_write = to_write - nr_to_writebump;
 	wbc->range_start = range_start;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-06 10:16           ` Aneesh Kumar K.V
@ 2008-10-06 14:21             ` Chris Mason
  2008-10-07  8:45               ` Aneesh Kumar K.V
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2008-10-06 14:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Dave Chinner, Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Mon, 2008-10-06 at 15:46 +0530, Aneesh Kumar K.V wrote:
> On Fri, Oct 03, 2008 at 03:45:55PM -0400, Chris Mason wrote:
> > On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > > For a 4.5GB streaming buffered write, this printk inside
> > > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > > 
> > > > 
> > > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > > call back with lots of unallocated buffer_heads(blocks).
> > > 
> > > Quite frankly, a simple streaming buffered write should *never*
> > > trigger writeback from the LRU in memory reclaim.
> > 
> > The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
> > clear if this is because ext4 did the redirty trick or if kswapd didn't
> > call writepage.
> > 
> > -chris
> 
> This patch actually reduced the number of extents for the below test
> from 564 to 171.
> 

For my array, this patch brings the number of ext4 extents down from
over 4000 to 27.  The throughput reported by dd goes up from ~80MB/s to
330MB/s, which means buffered IO is going as fast as O_DIRECT.

Here's the graph:

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-aneesh.png

The strange metadata writeback for the uninit block groups is gone.

Looking at the patch, I think the ext4_writepages code should just make
its own write_cache_pages.  It's pretty hard to follow the code that is
there for ext4 vs the code that is there to make write_cache_pages do
what ext4 expects it to.

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-06 14:21             ` Chris Mason
@ 2008-10-07  8:45               ` Aneesh Kumar K.V
  2008-10-07  9:05                 ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-07  8:45 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Mon, Oct 06, 2008 at 10:21:43AM -0400, Chris Mason wrote:
> On Mon, 2008-10-06 at 15:46 +0530, Aneesh Kumar K.V wrote:
> > On Fri, Oct 03, 2008 at 03:45:55PM -0400, Chris Mason wrote:
> > > On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > > > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > > > For a 4.5GB streaming buffered write, this printk inside
> > > > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > > > 
> > > > > 
> > > > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > > > call back with lots of unallocated buffer_heads(blocks).
> > > > 
> > > > Quite frankly, a simple streaming buffered write should *never*
> > > > trigger writeback from the LRU in memory reclaim.
> > > 
> > > The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
> > > clear if this is because ext4 did the redirty trick or if kswapd didn't
> > > call writepage.
> > > 
> > > -chris
> > 
> > This patch actually reduced the number of extents for the below test
> > from 564 to 171.
> > 
> 
> For my array, this patch brings the number of ext4 extents down from
> over 4000 to 27.  The throughput reported by dd goes up from ~80MB/s to
> 330MB/s, which means buffered IO is going as fast as O_DIRECT.
> 
> Here's the graph:
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-aneesh.png
> 
> The strange metadata writeback for the uninit block groups is gone.
> 
> Looking at the patch, I think the ext4_writepages code should just make
> its own write_cache_pages.  It's pretty hard to follow the code that is
> there for ext4 vs the code that is there to make write_cache_pages do
> what ext4 expects it to.
> 

How about the below ? The patch is on top of ext4 patchqueue.

commit 9b839065f973eb68e8e28be18e1f31d8fb6854ee
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Oct 7 12:27:52 2008 +0530

    ext4: Fix file fragmentation during large file write.
    
    The range_cyclic writeback mode use the address_space
    writeback_index as the start index for writeback. With
    delayed allocation we were updating writeback_index
    wrongly resulting in highly fragmented file. Number of
    extents reduced from 4000 to 27 for a 3GB file with
    the below patch.
    
    The patch also cleanup the ext4 delayed allocation writepages
    by implementing write_cache_pages locally with needed changes
    for cleanup. Also it drops the range_cont writeback mode added
    for ext4 delayed allocation writeback
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 21f1d3a..b6b0985 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1648,6 +1648,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd)
 	int ret = 0, err, nr_pages, i;
 	unsigned long index, end;
 	struct pagevec pvec;
+	long pages_skipped;
 
 	BUG_ON(mpd->next_page <= mpd->first_page);
 	pagevec_init(&pvec, 0);
@@ -1655,20 +1656,30 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd)
 	end = mpd->next_page - 1;
 
 	while (index <= end) {
-		/* XXX: optimize tail */
-		nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
+		/*
+		 * We can use PAGECACHE_TAG_DIRTY lookup here because
+		 * even though we have cleared the dirty flag on the page
+		 * We still keep the page in the radix tree with tag
+		 * PAGECACHE_TAG_DIRTY. See clear_page_dirty_for_io.
+		 * The PAGECACHE_TAG_DIRTY is cleared in set_page_writeback
+		 * which is called via the below writepage callback.
+		 */
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+					PAGECACHE_TAG_DIRTY,
+					min(end - index,
+					(pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			index = page->index;
-			if (index > end)
-				break;
-			index++;
-
+			pages_skipped = mpd->wbc->pages_skipped;
 			err = mapping->a_ops->writepage(page, mpd->wbc);
-			if (!err)
+			if (!err && (pages_skipped == mpd->wbc->pages_skipped))
+				/*
+				 * have successfully written the page
+				 * without skipping the same
+				 */
 				mpd->pages_written++;
 			/*
 			 * In error case, we have to continue because
@@ -2088,6 +2099,100 @@ static int __mpage_da_writepage(struct page *page,
 	return 0;
 }
 
+static int ext4_write_cache_pages(struct address_space *mapping,
+		      struct writeback_control *wbc, writepage_t writepage,
+		      void *data)
+{
+	struct pagevec pvec;
+	pgoff_t index, end;
+	long to_write = wbc->nr_to_write;
+	int ret = 0, done = 0, scanned = 0, nr_pages;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	if (wbc->nonblocking && bdi_write_congested(bdi)) {
+		wbc->encountered_congestion = 1;
+		return 0;
+	}
+
+	pagevec_init(&pvec, 0);
+	if (wbc->range_cyclic) {
+		index = mapping->writeback_index; /* Start from prev offset */
+		end = -1;
+	} else {
+		index = wbc->range_start >> PAGE_CACHE_SHIFT;
+		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		scanned = 1;
+	}
+retry:
+	while (!done && (index <= end) &&
+	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+					      PAGECACHE_TAG_DIRTY,
+					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+		unsigned i;
+
+		scanned = 1;
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/*
+			 * At this point we hold neither mapping->tree_lock nor
+			 * lock on the page itself: the page may be truncated or
+			 * invalidated (changing page->mapping to NULL), or even
+			 * swizzled back from swapper_space to tmpfs file
+			 * mapping
+			 */
+			lock_page(page);
+
+			if (unlikely(page->mapping != mapping)) {
+				unlock_page(page);
+				continue;
+			}
+			if (!wbc->range_cyclic && page->index > end) {
+				done = 1;
+				unlock_page(page);
+				continue;
+			}
+			if (wbc->sync_mode != WB_SYNC_NONE)
+				wait_on_page_writeback(page);
+
+			if (PageWriteback(page) ||
+			    !clear_page_dirty_for_io(page)) {
+				unlock_page(page);
+				continue;
+			}
+			ret = (*writepage)(page, wbc, data);
+			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
+				unlock_page(page);
+				ret = 0;
+			}
+			if (ret || (--(to_write) <= 0))
+				/*
+				 * writepage either failed.
+				 * or did an extent write.
+				 * We wrote what we are asked to
+				 * write
+				 */
+				done = 1;
+			if (wbc->nonblocking && bdi_write_congested(bdi)) {
+				wbc->encountered_congestion = 1;
+				done = 1;
+			}
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+	if (!scanned && !done) {
+		/*
+		 * We hit the last page and there is more work to be done: wrap
+		 * back to the start of the file
+		 */
+		scanned = 1;
+		index = 0;
+		goto retry;
+	}
+	return ret;
+}
+
 /*
  * mpage_da_writepages - walk the list of dirty pages of the given
  * address space, allocates non-allocated blocks, maps newly-allocated
@@ -2104,7 +2209,6 @@ static int mpage_da_writepages(struct address_space *mapping,
 			       struct writeback_control *wbc,
 			       struct mpage_da_data *mpd)
 {
-	long to_write;
 	int ret;
 
 	if (!mpd->get_block)
@@ -2119,10 +2223,7 @@ static int mpage_da_writepages(struct address_space *mapping,
 	mpd->pages_written = 0;
 	mpd->retval = 0;
 
-	to_write = wbc->nr_to_write;
-
-	ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, mpd);
-
+	ret = ext4_write_cache_pages(mapping, wbc, __mpage_da_writepage, mpd);
 	/*
 	 * Handle last extent of pages
 	 */
@@ -2131,7 +2232,7 @@ static int mpage_da_writepages(struct address_space *mapping,
 			mpage_da_submit_io(mpd);
 	}
 
-	wbc->nr_to_write = to_write - mpd->pages_written;
+	wbc->nr_to_write -= mpd->pages_written;
 	return ret;
 }
 
@@ -2360,12 +2461,13 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
 static int ext4_da_writepages(struct address_space *mapping,
 			      struct writeback_control *wbc)
 {
+	pgoff_t	index;
+	int range_whole = 0;
 	handle_t *handle = NULL;
-	loff_t range_start = 0;
+	long pages_written = 0;
 	struct mpage_da_data mpd;
 	struct inode *inode = mapping->host;
 	int needed_blocks, ret = 0, nr_to_writebump = 0;
-	long to_write, pages_skipped = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 
 	/*
@@ -2385,23 +2487,18 @@ static int ext4_da_writepages(struct address_space *mapping,
 		nr_to_writebump = sbi->s_mb_stream_request - wbc->nr_to_write;
 		wbc->nr_to_write = sbi->s_mb_stream_request;
 	}
+	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+		range_whole = 1;
 
-	if (!wbc->range_cyclic)
-		/*
-		 * If range_cyclic is not set force range_cont
-		 * and save the old writeback_index
-		 */
-		wbc->range_cont = 1;
-
-	range_start =  wbc->range_start;
-	pages_skipped = wbc->pages_skipped;
+	if (wbc->range_cyclic)
+		index = mapping->writeback_index;
+	else
+		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 
 	mpd.wbc = wbc;
 	mpd.inode = mapping->host;
 
-restart_loop:
-	to_write = wbc->nr_to_write;
-	while (!ret && to_write > 0) {
+	while (!ret && wbc->nr_to_write > 0) {
 
 		/*
 		 * we  insert one extent at a time. So we need
@@ -2422,48 +2519,45 @@ static int ext4_da_writepages(struct address_space *mapping,
 			dump_stack();
 			goto out_writepages;
 		}
-		to_write -= wbc->nr_to_write;
-
 		mpd.get_block = ext4_da_get_block_write;
 		ret = mpage_da_writepages(mapping, wbc, &mpd);
 
 		ext4_journal_stop(handle);
 
-		if (mpd.retval == -ENOSPC)
+		if (mpd.retval == -ENOSPC) {
+			/* commit the transaction which would
+			 * free blocks released in the transaction
+			 * and try again
+			 */
 			jbd2_journal_force_commit_nested(sbi->s_journal);
-
-		/* reset the retry count */
-		if (ret == MPAGE_DA_EXTENT_TAIL) {
+			ret = 0;
+		} else if (ret == MPAGE_DA_EXTENT_TAIL) {
 			/*
 			 * got one extent now try with
 			 * rest of the pages
 			 */
-			to_write += wbc->nr_to_write;
+			pages_written += mpd.pages_written;
 			ret = 0;
-		} else if (wbc->nr_to_write) {
+		} else if (wbc->nr_to_write)
 			/*
 			 * There is no more writeout needed
 			 * or we requested for a noblocking writeout
 			 * and we found the device congested
 			 */
-			to_write += wbc->nr_to_write;
 			break;
-		}
-		wbc->nr_to_write = to_write;
 	}
 
-	if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
-		/* We skipped pages in this loop */
-		wbc->range_start = range_start;
-		wbc->nr_to_write = to_write +
-				wbc->pages_skipped - pages_skipped;
-		wbc->pages_skipped = pages_skipped;
-		goto restart_loop;
-	}
+	/* Update index */
+	index += pages_written;
+	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+		/*
+		 * set the writeback_index so that range_cyclic
+		 * mode will write it back later
+		 */
+		mapping->writeback_index = index;
 
 out_writepages:
-	wbc->nr_to_write = to_write - nr_to_writebump;
-	wbc->range_start = range_start;
+	wbc->nr_to_write -= nr_to_writebump;
 	return ret;
 }
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 12b15c5..bd91987 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,7 +63,6 @@ struct writeback_control {
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
-	unsigned range_cont:1;
 };
 
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24de8b6..718efa6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -961,8 +961,6 @@ int write_cache_pages(struct address_space *mapping,
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 
-	if (wbc->range_cont)
-		wbc->range_start = index << PAGE_CACHE_SHIFT;
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07  8:45               ` Aneesh Kumar K.V
@ 2008-10-07  9:05                 ` Christoph Hellwig
  2008-10-07 10:02                   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2008-10-07  9:05 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Chris Mason, Dave Chinner, Andrew Morton, linux-kernel,
	linux-fsdevel, ext4

On Tue, Oct 07, 2008 at 02:15:31PM +0530, Aneesh Kumar K.V wrote:
> +static int ext4_write_cache_pages(struct address_space *mapping,
> +		      struct writeback_control *wbc, writepage_t writepage,
> +		      void *data)
> +{

Looking at this functions the only difference is killing the
writeback_index and range_start updates.  If they are bad why would we
only remove them from ext4?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07  9:05                 ` Christoph Hellwig
@ 2008-10-07 10:02                   ` Aneesh Kumar K.V
  2008-10-07 13:29                     ` Theodore Tso
  2008-10-07 13:55                     ` Peter Staubach
  0 siblings, 2 replies; 24+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-07 10:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Dave Chinner, Andrew Morton, linux-kernel,
	linux-fsdevel, ext4

On Tue, Oct 07, 2008 at 05:05:54AM -0400, Christoph Hellwig wrote:
> On Tue, Oct 07, 2008 at 02:15:31PM +0530, Aneesh Kumar K.V wrote:
> > +static int ext4_write_cache_pages(struct address_space *mapping,
> > +		      struct writeback_control *wbc, writepage_t writepage,
> > +		      void *data)
> > +{
> 
> Looking at this functions the only difference is killing the
> writeback_index and range_start updates.  If they are bad why would we
> only remove them from ext4?

I am also not updating wbc->nr_to_write.

ext4 delayed allocation writeback is bit tricky. It does

a) Look at the dirty pages and build an in memory extent of contiguous
logical file blocks. If we use writecache_pages to do that it will
update nr_to_write, writeback_index etc during this stage.

b) Request the block allocator for 'x' blocks. We get the value x from
step a.

c) block allocator may return less than 'x' contiguous block. That would
mean the variables updated by write_cache_pages need to corrected. The
old code was doing that. Chris Mason suggested it would make it easy
to use a write_cache_pages which doesn't update the variable for ext4.

I don't think other filesystem have this requirement.

-aneesh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07 10:02                   ` Aneesh Kumar K.V
@ 2008-10-07 13:29                     ` Theodore Tso
  2008-10-07 13:36                       ` Christoph Hellwig
  2008-10-07 13:55                     ` Peter Staubach
  1 sibling, 1 reply; 24+ messages in thread
From: Theodore Tso @ 2008-10-07 13:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Christoph Hellwig, Chris Mason, Dave Chinner, Andrew Morton,
	linux-kernel, linux-fsdevel, ext4

On Tue, Oct 07, 2008 at 03:32:57PM +0530, Aneesh Kumar K.V wrote:
> On Tue, Oct 07, 2008 at 05:05:54AM -0400, Christoph Hellwig wrote:
> > On Tue, Oct 07, 2008 at 02:15:31PM +0530, Aneesh Kumar K.V wrote:
> > > +static int ext4_write_cache_pages(struct address_space *mapping,
> > > +		      struct writeback_control *wbc, writepage_t writepage,
> > > +		      void *data)
> > > +{
> > 
> > Looking at this functions the only difference is killing the
> > writeback_index and range_start updates.  If they are bad why would we
> > only remove them from ext4?
> 
> I am also not updating wbc->nr_to_write.
    ...
> I don't think other filesystem have this requirement.

That's true, but there is a lot of code duplication, which means that
bugs or changes in write_cache_pages() would need to be fixed in
ext4_write_cache_pages().  So another approach that might be better
from a long-term code maintenance point of view is to add a flag in
struct writeback_control that tells write_cache_pages() not to update
those fields, and avoid duplicating approximately 95 lines of code.
It means a change in a core mm function, though, so if folks thinks
its too ugly, we can make our own copy in fs/ext4.

Opinions?  Andrew, as someone who often weighs in on fs and mm issues,
what do you think?  My preference would be to make the change to
mm/page-writeback.c, controlled by a flag which ext4 would set be set
by fs/ext4 before it calls write_cache_pages().

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07 13:29                     ` Theodore Tso
@ 2008-10-07 13:36                       ` Christoph Hellwig
  2008-10-07 14:46                         ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2008-10-07 13:36 UTC (permalink / raw)
  To: Theodore Tso, Aneesh Kumar K.V, Christoph Hellwig, Chris Mason,
	Dave Chinner

On Tue, Oct 07, 2008 at 09:29:11AM -0400, Theodore Tso wrote:
> That's true, but there is a lot of code duplication, which means that
> bugs or changes in write_cache_pages() would need to be fixed in
> ext4_write_cache_pages().  So another approach that might be better
> from a long-term code maintenance point of view is to add a flag in
> struct writeback_control that tells write_cache_pages() not to update
> those fields, and avoid duplicating approximately 95 lines of code.
> It means a change in a core mm function, though, so if folks thinks
> its too ugly, we can make our own copy in fs/ext4.
> 
> Opinions?  Andrew, as someone who often weighs in on fs and mm issues,
> what do you think?  My preference would be to make the change to
> mm/page-writeback.c, controlled by a flag which ext4 would set be set
> by fs/ext4 before it calls write_cache_pages().

I agree.  But I'm still not quite sure if that requirement is unique to
ext4 anyway.  Give me some time to dive into the writeback code again,
haven't been there for quite a while.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07 13:36                       ` Christoph Hellwig
@ 2008-10-07 14:46                         ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-10-07 14:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Tso, Aneesh Kumar K.V, Chris Mason, Dave Chinner,
	Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Wednesday 08 October 2008 00:36, Christoph Hellwig wrote:
> On Tue, Oct 07, 2008 at 09:29:11AM -0400, Theodore Tso wrote:
> > That's true, but there is a lot of code duplication, which means that
> > bugs or changes in write_cache_pages() would need to be fixed in
> > ext4_write_cache_pages().  So another approach that might be better
> > from a long-term code maintenance point of view is to add a flag in
> > struct writeback_control that tells write_cache_pages() not to update
> > those fields, and avoid duplicating approximately 95 lines of code.
> > It means a change in a core mm function, though, so if folks thinks
> > its too ugly, we can make our own copy in fs/ext4.

Heh, funny you should mention that. I was looking at it just the other
day. It's riddled with bugs (some of which supposedly are by-design
circumventing of data integrity).

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-10/msg00917.html

I'm also looking (in the same thread) at doing a patch to improve fsync
performance and ensure it doesn't get stuck behind concurrent dirtiers
by adding another tag to the radix-tree. Mikulas is also looking at
improving that problem with another method. It would be nice if fsdevel
gurus would participate (I'll send out my patchset when I get it working,
and recap the situation and competing ideas).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07 10:02                   ` Aneesh Kumar K.V
  2008-10-07 13:29                     ` Theodore Tso
@ 2008-10-07 13:55                     ` Peter Staubach
  2008-10-07 14:38                       ` Chuck Lever
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Staubach @ 2008-10-07 13:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Christoph Hellwig, Chris Mason, Dave Chinner, Andrew Morton,
	linux-kernel, linux-fsdevel, ext4

Aneesh Kumar K.V wrote:
> On Tue, Oct 07, 2008 at 05:05:54AM -0400, Christoph Hellwig wrote:
>   
>> On Tue, Oct 07, 2008 at 02:15:31PM +0530, Aneesh Kumar K.V wrote:
>>     
>>> +static int ext4_write_cache_pages(struct address_space *mapping,
>>> +		      struct writeback_control *wbc, writepage_t writepage,
>>> +		      void *data)
>>> +{
>>>       
>> Looking at this functions the only difference is killing the
>> writeback_index and range_start updates.  If they are bad why would we
>> only remove them from ext4?
>>     
>
> I am also not updating wbc->nr_to_write.
>
> ext4 delayed allocation writeback is bit tricky. It does
>
> a) Look at the dirty pages and build an in memory extent of contiguous
> logical file blocks. If we use writecache_pages to do that it will
> update nr_to_write, writeback_index etc during this stage.
>
> b) Request the block allocator for 'x' blocks. We get the value x from
> step a.
>
> c) block allocator may return less than 'x' contiguous block. That would
> mean the variables updated by write_cache_pages need to corrected. The
> old code was doing that. Chris Mason suggested it would make it easy
> to use a write_cache_pages which doesn't update the variable for ext4.
>
> I don't think other filesystem have this requirement.

The NFS client can benefit from only writing pages in strictly
ascending offset order.  The benefit comes from helping the
server to do better allocations by not sending file data to the
server in random order.

There is also an NFS server in the market which requires data
to be sent in strict ascending offset order.  This sort of
support would make interoperating with that server much easier.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-07 13:55                     ` Peter Staubach
@ 2008-10-07 14:38                       ` Chuck Lever
  0 siblings, 0 replies; 24+ messages in thread
From: Chuck Lever @ 2008-10-07 14:38 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Aneesh Kumar K.V, Christoph Hellwig, Chris Mason, Dave Chinner,
	Andrew Morton, linux-kernel, linux-fsdevel, ext4


On Oct 7, 2008, at Oct 7, 2008, 9:55 AM, Peter Staubach wrote:

> Aneesh Kumar K.V wrote:
>> On Tue, Oct 07, 2008 at 05:05:54AM -0400, Christoph Hellwig wrote:
>>
>>> On Tue, Oct 07, 2008 at 02:15:31PM +0530, Aneesh Kumar K.V wrote:
>>>
>>>> +static int ext4_write_cache_pages(struct address_space *mapping,
>>>> +		      struct writeback_control *wbc, writepage_t writepage,
>>>> +		      void *data)
>>>> +{
>>>>
>>> Looking at this functions the only difference is killing the
>>> writeback_index and range_start updates.  If they are bad why  
>>> would we
>>> only remove them from ext4?
>>>
>>
>> I am also not updating wbc->nr_to_write.
>>
>> ext4 delayed allocation writeback is bit tricky. It does
>>
>> a) Look at the dirty pages and build an in memory extent of  
>> contiguous
>> logical file blocks. If we use writecache_pages to do that it will
>> update nr_to_write, writeback_index etc during this stage.
>>
>> b) Request the block allocator for 'x' blocks. We get the value x  
>> from
>> step a.
>>
>> c) block allocator may return less than 'x' contiguous block. That  
>> would
>> mean the variables updated by write_cache_pages need to corrected.  
>> The
>> old code was doing that. Chris Mason suggested it would make it easy
>> to use a write_cache_pages which doesn't update the variable for  
>> ext4.
>>
>> I don't think other filesystem have this requirement.
>
> The NFS client can benefit from only writing pages in strictly
> ascending offset order.  The benefit comes from helping the
> server to do better allocations by not sending file data to the
> server in random order.

For the record, it would also help prevent the creation of temporary  
holes in O_APPEND files.

If an NFS client writes the front and back ends of a request before it  
writes the middle, other clients will see a temporary hole in that  
file.  Applications (especially simple ones like "tail") are often not  
prepared for the appearance of such holes.

Over a client crash, data integrity would improve if the client was  
less likely to create temporary holes in files.

> There is also an NFS server in the market which requires data
> to be sent in strict ascending offset order.  This sort of
> support would make interoperating with that server much easier.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 23:43       ` Dave Chinner
  2008-10-03 19:45         ` Chris Mason
@ 2008-10-09 15:11         ` Chris Mason
  2008-10-10  5:13           ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2008-10-09 15:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Aneesh Kumar K.V, Andrew Morton, linux-kernel, linux-fsdevel,
	ext4, Christoph Hellwig

On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > For a 4.5GB streaming buffered write, this printk inside
> > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > 
> > 
> > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > call back with lots of unallocated buffer_heads(blocks).
> 
> Quite frankly, a simple streaming buffered write should *never*
> trigger writeback from the LRU in memory reclaim. That indicates
> that some feedback loop has broken down and we are not cleaning
> pages fast enough or perhaps in the correct order. Page reclaim in
> this case should be reclaiming clean pages (those that have already
> been written back), not writing back random dirty pages.

Here are some go faster stripes for the XFS buffered writeback.  This
patch has a lot of debatable features to it, but the idea is to show
which knobs are slowing us down today.

The first change is to avoid calling balance_dirty_pages_ratelimited on
every page.  When we know we're doing a largeish write it makes more
sense to balance things less often.  This might just mean our
ratelimit_pages magic value is too small.

The second change makes xfs bump wbc->nr_to_write (suggested by
Christoph), which probably makes delalloc go in bigger chunks.

On unpatched kernels, XFS does streaming writes to my 4 drive array at
around 205MB/s.  With the patch below, I come in at 326MB/s.  O_DIRECT
runs at 330MB/s, so that's pretty good.

With just the nr_to_write change, I get around 315MB/s.

With just the balance_dirty_pages_nr change, I get around 240MB/s.

-chris

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index a44d68e..c72bd54 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -944,6 +944,9 @@ xfs_page_state_convert(
 	int			trylock = 0;
 	int			all_bh = unmapped;
 
+
+	wbc->nr_to_write *= 4;
+
 	if (startio) {
 		if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
 			trylock |= BMAPI_TRYLOCK;
diff --git a/mm/filemap.c b/mm/filemap.c
index 876bc59..b6c26e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2389,6 +2389,7 @@ static ssize_t generic_perform_write(struct file *file,
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned long nr = 0;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2460,11 +2461,17 @@ again:
 		}
 		pos += copied;
 		written += copied;
-
-		balance_dirty_pages_ratelimited(mapping);
+		nr++;
+		if (nr > 256) {
+			balance_dirty_pages_ratelimited_nr(mapping, nr);
+			nr = 0;
+		}
 
 	} while (iov_iter_count(i));
 
+	if (nr)
+		balance_dirty_pages_ratelimited_nr(mapping, nr);
+
 	return written ? written : status;
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-09 15:11         ` Chris Mason
@ 2008-10-10  5:13           ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2008-10-10  5:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Aneesh Kumar K.V, Andrew Morton, linux-kernel, linux-fsdevel,
	ext4, Christoph Hellwig

On Thu, Oct 09, 2008 at 11:11:20AM -0400, Chris Mason wrote:
> On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > For a 4.5GB streaming buffered write, this printk inside
> > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > 
> > > 
> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks).
> > 
> > Quite frankly, a simple streaming buffered write should *never*
> > trigger writeback from the LRU in memory reclaim. That indicates
> > that some feedback loop has broken down and we are not cleaning
> > pages fast enough or perhaps in the correct order. Page reclaim in
> > this case should be reclaiming clean pages (those that have already
> > been written back), not writing back random dirty pages.
> 
> Here are some go faster stripes for the XFS buffered writeback.  This
> patch has a lot of debatable features to it, but the idea is to show
> which knobs are slowing us down today.
> 
> The first change is to avoid calling balance_dirty_pages_ratelimited on
> every page.  When we know we're doing a largeish write it makes more
> sense to balance things less often.  This might just mean our
> ratelimit_pages magic value is too small.

Ok, so how about doing something like this to reduce the
number of balances on large writes, but causing at least one
balance call for every write that occurs:

	int	nr = 0;
	.....
	while() {
		....
		if (!(nr % 256)) {
			/* do balance */
		}
		nr++;
		....
	}

That way you get a balance on the first page on every write,
but then hold off balancing on that write again for some
number of pages.

> The second change makes xfs bump wbc->nr_to_write (suggested by
> Christoph), which probably makes delalloc go in bigger chunks.

Hmmmm.  Reasonable theory. We used to do gigantic delalloc extents -
we paid no attention to congestion and could allocate and write
several GB at a time. Latency was an issue, though, so it got
changed to be bound by nr_to_write.

I guess we need to be issuing larger allocations. Can you remove
you patches and see what effect using the allocsize mount
option has on throughput? This changes the default delalloc EOF
preallocation size, which means more or less allocations. The
default is 64k and it can go as high as 1GB, IIRC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-02 18:18     ` Aneesh Kumar K.V
  2008-10-02 19:44       ` Andrew Morton
  2008-10-02 23:43       ` Dave Chinner
@ 2008-10-03  1:11       ` Chris Mason
  2008-10-03  2:43         ` Nick Piggin
  2 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2008-10-03  1:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Andrew Morton, linux-kernel, linux-fsdevel, ext4

On Thu, 2008-10-02 at 23:48 +0530, Aneesh Kumar K.V wrote:
> On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason <chris.mason@oracle.com> wrote:
> > > 
> > > > The patch below changes write_cache_pages to only use writeback_index
> > > > when current_is_pdflush().  The basic idea is that pdflush is the only
> > > > one who has concurrency control against the bdi, so it is the only one
> > > > who can safely use and update writeback_index.
> > > 
> > > Another approach would be to only update mapping->writeback_index if
> > > nobody else altered it meanwhile.
> > > 
> > 
> > Ok, I can give that a short.
> > 
> > > That being said, I don't really see why we get lots of seekiness when
> > > two threads start their writing the file from the same offset.
> > 
> > For metadata, it makes sense.  Pages get dirtied in strange order, and
> > if writeback_index is jumping around, we'll get the seeky metadata
> > writeback.
> > 
> > Data makes less sense, especially the very high extent count from ext4.
> > An extra printk shows that ext4 is calling redirty_page_for_writepage
> > quite a bit in ext4_da_writepage.  This should be enough to make us jump
> > around in the file.
> 
> 
> We need to do  start the journal before locking the page with jbd2.
> That prevent us from doing any block allocation in writepage() call
> back. So with ext4/jbd2 we do block allocation only in writepages()
> call back where we start the journal with credit needed to write
> a single extent. Then we look for contiguous unallocated logical
> block and request the block allocator for 'x' blocks. If we get
> less than that. The rest of the pages which we iterated in
> writepages  are redirtied so that we try to allocate them again.
> We loop inside ext4_da_writepages itself looking at wbc->pages_skipped
> 
> 2481         if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
> 2482                 /* We skipped pages in this loop */
> 2483                 wbc->range_start = range_start;
> 2484                 wbc->nr_to_write = to_write +
> 
> 
> > 
> > For a 4.5GB streaming buffered write, this printk inside
> > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > 
> 
> Part of that can happen due to shrink_page_list -> pageout -> writepagee
> call back with lots of unallocated buffer_heads(blocks). Also a journal
> commit with jbd2 looks at the inode and all the dirty pages, rather than
> the buffer_heads (journal_submit_data_buffers). We don't force commit
> pages that doesn't have blocks allocated with the ext4. The consistency
> is only with i_size and data.

In general, I don't think pdflush or the VM expect
redirty_pages_for_writepage to be used this aggressively.

At this point I think we're best off if one of the ext4 developers is
able to reproduce and explain things in better detail than my hand
waving.

My patch is pretty lame, but it isn't a horrible bandage until we can
rethink the pdflush<->balance_dirty_pages<->kudpate interactions in
detail.

Two other data points, ext3 runs at 200MB/s with and without the patch.
Btrfs runs at 320MB/s with and without the patch, but only when I turn
checksums off.  The IO isn't quite as sequential with checksumming on
because the helper threads submit things slightly out of order (220MB/s
with checksums on).

Btrfs does use redirty_page_for_writepage if (current->flags &
PF_MEMALLOC) in my writepage call, but doesn't call it from the
writepages path.

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-03  1:11       ` Chris Mason
@ 2008-10-03  2:43         ` Nick Piggin
  2008-10-03 12:07           ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-10-03  2:43 UTC (permalink / raw)
  To: Chris Mason
  Cc: Aneesh Kumar K.V, Andrew Morton, linux-kernel, linux-fsdevel,
	ext4

On Friday 03 October 2008 11:11, Chris Mason wrote:
> On Thu, 2008-10-02 at 23:48 +0530, Aneesh Kumar K.V wrote:
> > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > On Wed, 01 Oct 2008 14:40:51 -0400 Chris Mason 
<chris.mason@oracle.com> wrote:
> > > > > The patch below changes write_cache_pages to only use
> > > > > writeback_index when current_is_pdflush().  The basic idea is that
> > > > > pdflush is the only one who has concurrency control against the
> > > > > bdi, so it is the only one who can safely use and update
> > > > > writeback_index.
> > > >
> > > > Another approach would be to only update mapping->writeback_index if
> > > > nobody else altered it meanwhile.
> > >
> > > Ok, I can give that a short.
> > >
> > > > That being said, I don't really see why we get lots of seekiness when
> > > > two threads start their writing the file from the same offset.
> > >
> > > For metadata, it makes sense.  Pages get dirtied in strange order, and
> > > if writeback_index is jumping around, we'll get the seeky metadata
> > > writeback.
> > >
> > > Data makes less sense, especially the very high extent count from ext4.
> > > An extra printk shows that ext4 is calling redirty_page_for_writepage
> > > quite a bit in ext4_da_writepage.  This should be enough to make us
> > > jump around in the file.
> >
> > We need to do  start the journal before locking the page with jbd2.
> > That prevent us from doing any block allocation in writepage() call
> > back. So with ext4/jbd2 we do block allocation only in writepages()
> > call back where we start the journal with credit needed to write
> > a single extent. Then we look for contiguous unallocated logical
> > block and request the block allocator for 'x' blocks. If we get
> > less than that. The rest of the pages which we iterated in
> > writepages  are redirtied so that we try to allocate them again.
> > We loop inside ext4_da_writepages itself looking at wbc->pages_skipped
> >
> > 2481         if (wbc->range_cont && (pages_skipped !=
> > wbc->pages_skipped)) { 2482                 /* We skipped pages in this
> > loop */
> > 2483                 wbc->range_start = range_start;
> > 2484                 wbc->nr_to_write = to_write +
> >
> > > For a 4.5GB streaming buffered write, this printk inside
> > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> >
> > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > call back with lots of unallocated buffer_heads(blocks). Also a journal
> > commit with jbd2 looks at the inode and all the dirty pages, rather than
> > the buffer_heads (journal_submit_data_buffers). We don't force commit
> > pages that doesn't have blocks allocated with the ext4. The consistency
> > is only with i_size and data.
>
> In general, I don't think pdflush or the VM expect
> redirty_pages_for_writepage to be used this aggressively.

BTW. redirty_page_for_writepage and the whole model of cleaning the page's
dirty bit *before* calling into the filesystem is really nasty IMO. For
one thing it opens races that mean a filesystem can't keep metadata about
the pagecache properly in synch with the page's dirty bit.

I have a patch in my fsblock series that fixes this and has the writepage()
function itself clear the page's dirty bit. This basically makes
redirty_page_for_writepages go away completely (at least the uses I looked
at, I didn't look at ext4 though).

Shall I break it out and submit it?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-03  2:43         ` Nick Piggin
@ 2008-10-03 12:07           ` Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2008-10-03 12:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Aneesh Kumar K.V, Andrew Morton, linux-kernel, linux-fsdevel,
	ext4

On Fri, 2008-10-03 at 12:43 +1000, Nick Piggin wrote:
> On Friday 03 October 2008 11:11, Chris Mason wrote:

> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks). Also a journal
> > > commit with jbd2 looks at the inode and all the dirty pages, rather than
> > > the buffer_heads (journal_submit_data_buffers). We don't force commit
> > > pages that doesn't have blocks allocated with the ext4. The consistency
> > > is only with i_size and data.
> >
> > In general, I don't think pdflush or the VM expect
> > redirty_pages_for_writepage to be used this aggressively.
> 
> BTW. redirty_page_for_writepage and the whole model of cleaning the page's
> dirty bit *before* calling into the filesystem is really nasty IMO. For
> one thing it opens races that mean a filesystem can't keep metadata about
> the pagecache properly in synch with the page's dirty bit.
> 
> I have a patch in my fsblock series that fixes this and has the writepage()
> function itself clear the page's dirty bit. This basically makes
> redirty_page_for_writepages go away completely (at least the uses I looked
> at, I didn't look at ext4 though).
> 
> Shall I break it out and submit it?

It's a fair amount of churn in the FS code, and the part I'm not sure of
is if the bigger problem is lock ordering around the page lock and the
FS locks or just the dirty bit.

Personally I'd rather see writepages used everywhere, giving the FS the
chance to do more efficient IO.

-chris




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Improve buffered streaming write ordering
  2008-10-01 18:40 [PATCH] Improve buffered streaming write ordering Chris Mason
  2008-10-02  4:52 ` Andrew Morton
@ 2008-10-02 18:08 ` Aneesh Kumar K.V
  1 sibling, 0 replies; 24+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-02 18:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, linux-fsdevel, ext4

On Wed, Oct 01, 2008 at 02:40:51PM -0400, Chris Mason wrote:
> Hello everyone,
> 
> write_cache_pages can use the address space writeback_index field to
> try and pick up where it left off between calls.  pdflush and
> balance_dirty_pages both enable this mode in hopes of having writeback
> evenly walk down the file instead of just servicing pages at the
> start of the address space.
> 
> But, there is no locking around this field, and concurrent callers of
> write_cache_pages on the same inode can get some very strange results.
> pdflush uses writeback_acquire function to make sure that only one
> pdflush process is servicing a given backing device, but
> balance_dirty_pages does not.
> 
> When there are a small number of dirty inodes in the system,
> balance_dirty_pages is likely to run in parallel with pdflush on one or
> two of them, leading to somewhat random updates of the writeback_index
> field in struct address space.
> 
> The end result is very seeky writeback during streaming IO.  A 4 drive
> hardware raid0 array here can do 317MB/s streaming O_DIRECT writes on
> ext4.  This is creating a new file, so O_DIRECT is really just a way to
> bypass write_cache_pages.
> 
> If I do buffered writes instead, XFS does 205MB/s, and ext4 clocks in at
> 81.7MB/s.  Looking at the buffered IO traces for each one, we can see a
> lot of seeks.
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-nopatch.png
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-nopatch.png
> 
> The patch below changes write_cache_pages to only use writeback_index
> when current_is_pdflush().  The basic idea is that pdflush is the only
> one who has concurrency control against the bdi, so it is the only one
> who can safely use and update writeback_index.
> 
> The performance changes quite a bit:
> 
>         patched        unpatched
> XFS     247MB/s        205MB/s
> Ext4    246MB/s        81.7MB/s


That is nice.

> 
> The graphs after the patch:
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-patched.png
> 
> http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-patched.png
> 
> The ext4 graph really does look strange.  What's happening there is the
> lazy inode table init has dirtied a whole bunch of pages on the block
> device inode.  I don't have much of an answer for why my patch makes all
> of this writeback happen up front, other then writeback_index is no
> longer bouncing all over the address space.
> 
> It is also worth noting that before the patch, filefrag shows ext4 using
> about 4000 extents on the file.  After the patch it is around 400.  XFS
> uses 2 extents both patched and unpatched.
> 

Ext4 do block allocation in ext4_da_writepages. So if we are feeding the
block allocation with different(highly bouncing) index values we may end up with larger
number of extents. Although the new mballoc block allocator should
perform better because it reserve space based on logical block number
in the file.

-aneesh

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-10-10  5:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-01 18:40 [PATCH] Improve buffered streaming write ordering Chris Mason
2008-10-02  4:52 ` Andrew Morton
2008-10-02 12:20   ` Chris Mason
2008-10-02 16:12     ` Chris Mason
2008-10-02 18:18     ` Aneesh Kumar K.V
2008-10-02 19:44       ` Andrew Morton
2008-10-02 23:43       ` Dave Chinner
2008-10-03 19:45         ` Chris Mason
2008-10-06 10:16           ` Aneesh Kumar K.V
2008-10-06 14:21             ` Chris Mason
2008-10-07  8:45               ` Aneesh Kumar K.V
2008-10-07  9:05                 ` Christoph Hellwig
2008-10-07 10:02                   ` Aneesh Kumar K.V
2008-10-07 13:29                     ` Theodore Tso
2008-10-07 13:36                       ` Christoph Hellwig
2008-10-07 14:46                         ` Nick Piggin
2008-10-07 13:55                     ` Peter Staubach
2008-10-07 14:38                       ` Chuck Lever
2008-10-09 15:11         ` Chris Mason
2008-10-10  5:13           ` Dave Chinner
2008-10-03  1:11       ` Chris Mason
2008-10-03  2:43         ` Nick Piggin
2008-10-03 12:07           ` Chris Mason
2008-10-02 18:08 ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).