linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
       [not found]               ` <20101201133818.GA13377@localhost>
@ 2010-12-05 16:14                 ` Wu Fengguang
  2010-12-06  2:42                   ` Ted Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Wu Fengguang @ 2010-12-05 16:14 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel@vger.kernel.org, LKML,
	Tang, Feng, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3733 bytes --]

On Wed, Dec 01, 2010 at 09:38:18PM +0800, Wu Fengguang wrote:
> [restore CC list for new findings]
> 
> On Wed, Dec 01, 2010 at 06:39:25AM +0800, Peter Zijlstra wrote:
> > On Tue, 2010-11-30 at 23:35 +0100, Peter Zijlstra wrote:
> > > On Tue, 2010-11-30 at 12:37 +0800, Wu Fengguang wrote:
> > > > On Tue, Nov 30, 2010 at 04:53:33AM +0800, Peter Zijlstra wrote:
> > > > > On Mon, 2010-11-29 at 23:17 +0800, Wu Fengguang wrote:
> > > > > > Hi Peter,
> > > > > >
> > > > > > I'm drawing funny graphs to track the writeback dynamics :)
> > > > > >
> > > > > > In the attached graphs, I find abnormals in dirty-pages-3000.png and
> > > > > > dirty-pages-200.png.  The task limit is what's returned by
> > > > > > task_dirty_limit(), which should be very stable. However from the
> > > > > > graph it seems the task weight (numerator/denominator) will suddenly
> > > > > > drop to near 0 on every 9-10 seconds.  Do you have immediate insight
> > > > > > on what's going on? If not, I'm going to do some tracing to track down
> > > > > > how the numbers change over time.
> > > > >
> > > > > No immediate thoughts there.. I need to look through the math again, but
> > > > > I'm kinda swamped atm. (and my primary dev machine had its disk die this
> > > > > morning). I'll try and get around to it soon..
> > > >
> > > > Peter, I did a simple debug patch (attached) and collected these
> > > > numbers. I noticed that at the "task_weight=27%" and "task_weight=14%"
> > > > lines, "period" increases, "num" is decreased while "den" is still
> > > > increasing.
> > > >
> > > > num=db2e den=e8c0 period=3f8000 shift=10
> > > > num=e04c den=ede0 period=3f8000 shift=10
> > > > num=e56a den=f300 period=3f8000 shift=10
> > >
> > > > num=3e78 den=e400 period=408000 shift=10
> > >
> > > > num=1341 den=8900 period=418000 shift=10
> > > > num=185f den=8e20 period=418000 shift=10
> > > > num=1d7d den=9340 period=418000 shift=10
> > > > num=229b den=9860 period=418000 shift=10
> > > > num=27b9 den=9da0 period=418000 shift=10
> > > > num=2cd7 den=a2c0 period=418000 shift=10
> > >
> > >
> > > This looks sane.. the period indicates someone else was dirtying lots of
> > > pages. Every time the period increases (its shifted right by shift) we
> > > divide the events (num) by 2.
> >
> > Its actually shifted left by shift-1.. see prop_norm_single(), which
> > would make the below:
> >
> > > So the increment from 3f8000 to 408000 is 4064 to 4128, or 64, that
> > > should reset events to 0, seeing that it didn't means it got incremented
> > > as well.
> > >
> > > Funny enough, the second jump is again exactly 64..
> > >
> > > Anyway, as you can see, den increases as long as period stays constant,
> > > it takes a dip when period increments.
> >
> > two steps of 128, which is terribly large.
> >
> > then again, a period of 512 pages is very very small.
> 
> Peter, I also collected prop_norm_single() traces, hope it helps.
> 
> Again, you can find time points when the task limit suddenly skip high
> in graphs "dirty-pages*.png", and then find the corresponding data
> point in file "trace". Sorry I compute something wrong: the "ratio"
> field in the trace data is always 0, please just ignore them.
> 
> I noticed that jbd2/sda8-8-2811 dirtied lots of pages, perhaps by
> ext4_bio_write_page(). This should happen only on -ENOMEM.  I also

Ah I seem to find the root cause. See the attached graphs. Ext4 should
be calling redirty_page_for_writepage() to redirty ~300MB pages on
every ~10s. The redirties happen in big bursts, so not surprisingly
the dd task's dirty weight will suddenly drop to 0.

It should be the same ext4 issue discussed here:

        http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Thanks,
Fengguang

[-- Attachment #2: vmstat-written-300.png --]
[-- Type: image/png, Size: 44152 bytes --]

[-- Attachment #3: vmstat-written.png --]
[-- Type: image/png, Size: 40715 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-05 16:14                 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-12-06  2:42                   ` Ted Ts'o
  2010-12-06  9:52                     ` Dmitry
  0 siblings, 1 reply; 4+ messages in thread
From: Ted Ts'o @ 2010-12-06  2:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel@vger.kernel.org, LKML,
	Tang, Feng, linux-ext4

On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> 
> Ah I seem to find the root cause. See the attached graphs. Ext4 should
> be calling redirty_page_for_writepage() to redirty ~300MB pages on
> every ~10s. The redirties happen in big bursts, so not surprisingly
> the dd task's dirty weight will suddenly drop to 0.
> 
> It should be the same ext4 issue discussed here:
> 
>         http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Yeah, unfortunately the fix suggested isn't the right one.

The right fix is going to involve making much more radical changes to
the ext4 write submission path, which is on my todo queue.  For now,
if people don't like these nasty writeback dynamics, my suggestion for
now is to mount the filesystem data=writeback.

This is basically the clean equivalent of the patch suggested by Feng
Tang in his e-mail referenced above.  Given that ext4 uses delayed
allocation, most of the time unwritten blocks are not allocated, and
so stale data isn't exposed.

The case which you're seeing here is where both the jbd2 data=order
forced writeback is colliding with the writeback thread, and
unfortunately, the forced writeback in the jbd2 layer is done in an
extremely inefficient manner.  So data=writeback is the workaround,
and unlike ext3, it's not a serious security leak.  It is possible for
some stale data to get exposed if you get unlucky when you crash,
though, so there is a potential for some security exposure.

The long-term solution to this problem is to rework the ext4 writeback
path so that we write the data blocks when they are newly allocated,
and then only update fs metadata once they are written.  As I said,
it's on my queue.  Until then, the only suggestion I can give folks is
data=writeback.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  2:42                   ` Ted Ts'o
@ 2010-12-06  9:52                     ` Dmitry
  2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Dmitry @ 2010-12-06  9:52 UTC (permalink / raw)
  To: Ted Ts'o, Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel@vger.kernel.org, LKML,
	Tang, Feng, linux-ext4

On Sun, 5 Dec 2010 21:42:31 -0500, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> > 
> > Ah I seem to find the root cause. See the attached graphs. Ext4 should
> > be calling redirty_page_for_writepage() to redirty ~300MB pages on
> > every ~10s. The redirties happen in big bursts, so not surprisingly
> > the dd task's dirty weight will suddenly drop to 0.
> > 
> > It should be the same ext4 issue discussed here:
> > 
> >         http://www.spinics.net/lists/linux-fsdevel/msg39555.html
> 
> Yeah, unfortunately the fix suggested isn't the right one.
> 
> The right fix is going to involve making much more radical changes to
> the ext4 write submission path, which is on my todo queue.  For now,
> if people don't like these nasty writeback dynamics, my suggestion for
> now is to mount the filesystem data=writeback.
> 
> This is basically the clean equivalent of the patch suggested by Feng
> Tang in his e-mail referenced above.  Given that ext4 uses delayed
> allocation, most of the time unwritten blocks are not allocated, and
> so stale data isn't exposed.
May be it is reasonable to introduce new mount option which control
dynamic delalloc on/off behavior for example like this:
0) -odelalloc=off : analog of nodelalloc
1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
2) -odelalloc=force  : delalloc mode always enabled, so we have to do
                     writeback more aggressive in case of ENOSPC.

So one can force delalloc and can safely use this writeback mode in 
multi-user environment. Openvz already has this. I'll prepare the patch
if you are interesting in that feature?
> 
> The case which you're seeing here is where both the jbd2 data=order
> forced writeback is colliding with the writeback thread, and
> unfortunately, the forced writeback in the jbd2 layer is done in an
> extremely inefficient manner.  So data=writeback is the workaround,
> and unlike ext3, it's not a serious security leak.  It is possible for
> some stale data to get exposed if you get unlucky when you crash,
> though, so there is a potential for some security exposure.
> 
> The long-term solution to this problem is to rework the ext4 writeback
> path so that we write the data blocks when they are newly allocated,
> and then only update fs metadata once they are written.  As I said,
> it's on my queue.  Until then, the only suggestion I can give folks is
> data=writeback.
> 
> 						- Ted
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  9:52                     ` Dmitry
@ 2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 0 replies; 4+ messages in thread
From: Ted Ts'o @ 2010-12-06 12:34 UTC (permalink / raw)
  To: Dmitry
  Cc: Wu Fengguang, Peter Zijlstra, Andrew Morton, Chris Mason,
	Dave Chinner, Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML, Tang, Feng, linux-ext4

On Mon, Dec 06, 2010 at 12:52:21PM +0300, Dmitry wrote:
> May be it is reasonable to introduce new mount option which control
> dynamic delalloc on/off behavior for example like this:
> 0) -odelalloc=off : analog of nodelalloc
> 1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
> 2) -odelalloc=force  : delalloc mode always enabled, so we have to do
>                      writeback more aggressive in case of ENOSPC.
> 
> So one can force delalloc and can safely use this writeback mode in 
> multi-user environment. Openvz already has this. I'll prepare the patch
> if you are interesting in that feature?

Yeah, I'd really rather not do that.  There are significant downsides
with your proposed odelalloc=force mode.  One of which is that we
could run out of space and not notice.  If the application doesn't
call fsync() and check the return value, and simply closes()'s the
file and then exits, when the writeback threads do get around to
writing the file, the block allocation could fail, and oops, data gets
lost.  There's a _reason_ why we disable delalloc when we're close to
a full fs.  The only alternative is to super conservative when doing
your block reservation calculations, and in that case, you end up
returning ENOSPC far too soon.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-12-06 12:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20101117042720.033773013@intel.com>
     [not found] ` <20101117042849.410279291@intel.com>
     [not found]   ` <1290085474.2109.1480.camel@laptop>
     [not found]     ` <20101129151719.GA30590@localhost>
     [not found]       ` <1291064013.32004.393.camel@laptop>
     [not found]         ` <20101130043735.GA22947@localhost>
     [not found]           ` <1291156522.32004.1359.camel@laptop>
     [not found]             ` <1291156765.32004.1365.camel@laptop>
     [not found]               ` <20101201133818.GA13377@localhost>
2010-12-05 16:14                 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-12-06  2:42                   ` Ted Ts'o
2010-12-06  9:52                     ` Dmitry
2010-12-06 12:34                       ` Ted Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).