* [RFC][PATCH] try not to let dirty inodes fester
@ 2010-10-01 19:14 Dave Hansen
2010-10-02 11:32 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2010-10-01 19:14 UTC (permalink / raw)
To: linux-kernel; +Cc: hch, lnxninja, axboe, pbadari, Dave Hansen
I've got a bug that I've been investigating. The inode cache for a
certain fs grows and grows, desptite running
echo 2 > /proc/sys/vm/drop_caches
all the time. Not that running drop_caches is a good idea, but it
_should_ force things to stay under control. That is, unless the
inodes are dirty.
I think I'm seeing a case where the inode's dentry goes away, it
hits iput_final(). It is dirty, so it stays off the inode_unused
list waiting around for writeback.
Then, the periodic writeback happens, and we end up in
wb_writeback(). One of the first things we do in the loop (before
writing out inodes) is this:
if (work->for_background && !over_bground_thresh())
break;
over_bground_thresh() doesn't take dirty inodes into account. So
if we are in a situation where there are no dirty pages, we will
trip this, and break. If the system continues to dirty inodes
without dirtying any pages along the way, I don't think we will
ever do periodic writeback of the dirty inodes.
The attached patch moves the check down below some of the inode
writeback. It seems to do some good, but I'm worried that it
will cause additional I/O when we are below the writeback
thresholds.
---
linux-2.6.git-dave/fs/fs-writeback.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff -puN fs/fs-writeback.c~wb.diff fs/fs-writeback.c
--- linux-2.6.git/fs/fs-writeback.c~wb.diff 2010-10-01 12:12:11.000000000 -0700
+++ linux-2.6.git-dave/fs/fs-writeback.c 2010-10-01 12:12:11.000000000 -0700
@@ -625,12 +625,10 @@ static long wb_writeback(struct bdi_writ
break;
/*
- * For background writeout, stop when we are below the
- * background dirty threshold
+ * inodes are not accounted for in the background thresholds
+ * so we might leave too many of them dirty unless we do
+ * _some_ writeout without concern for over_bground_thresh()
*/
- if (work->for_background && !over_bground_thresh())
- break;
-
wbc.more_io = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -646,6 +644,13 @@ static long wb_writeback(struct bdi_writ
wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
/*
+ * For background writeout, stop when we are below the
+ * background dirty threshold
+ */
+ if (work->for_background && !over_bground_thresh())
+ break;
+
+ /*
* If we consumed everything, see if we have more
*/
if (wbc.nr_to_write <= 0)
diff -puN MAINTAINERS~wb.diff MAINTAINERS
_
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC][PATCH] try not to let dirty inodes fester
2010-10-01 19:14 [RFC][PATCH] try not to let dirty inodes fester Dave Hansen
@ 2010-10-02 11:32 ` Dave Chinner
2010-10-05 15:25 ` Dave Hansen
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2010-10-02 11:32 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-kernel, hch, lnxninja, axboe, pbadari
On Fri, Oct 01, 2010 at 12:14:49PM -0700, Dave Hansen wrote:
>
> I've got a bug that I've been investigating. The inode cache for a
> certain fs grows and grows, desptite running
>
> echo 2 > /proc/sys/vm/drop_caches
>
> all the time. Not that running drop_caches is a good idea, but it
> _should_ force things to stay under control. That is, unless the
> inodes are dirty.
What's the filesystem, and what's the test case?
> I think I'm seeing a case where the inode's dentry goes away, it
> hits iput_final(). It is dirty, so it stays off the inode_unused
> list waiting around for writeback.
Right - it should be on the bdi->wb->b_dirty list waiting to be
expired and written back or already of the expired writeback queueѕ
and waiting to be written again.
> Then, the periodic writeback happens, and we end up in
> wb_writeback(). One of the first things we do in the loop (before
> writing out inodes) is this:
>
> if (work->for_background && !over_bground_thresh())
> break;
Sure, but the periodic ->for_kupdate flushing should be writing
any inode older than 30s and should be running every 5s. hence the
background writeback aborting should not be affecting the cleaning
of dirty inodes. Hence I don't think this is the problem your are
looking for.
Without knowing what filesystem or what you are doing to grow the
inode cache, it's pretty hard to say much more than this....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC][PATCH] try not to let dirty inodes fester
2010-10-02 11:32 ` Dave Chinner
@ 2010-10-05 15:25 ` Dave Hansen
2010-10-05 15:36 ` Christoph Hellwig
0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2010-10-05 15:25 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, hch, lnxninja, axboe, pbadari, Yuri L Volobuev
On Sat, 2010-10-02 at 21:32 +1000, Dave Chinner wrote:
> On Fri, Oct 01, 2010 at 12:14:49PM -0700, Dave Hansen wrote:
> >
> > I've got a bug that I've been investigating. The inode cache for a
> > certain fs grows and grows, desptite running
> >
> > echo 2 > /proc/sys/vm/drop_caches
> >
> > all the time. Not that running drop_caches is a good idea, but it
> > _should_ force things to stay under control. That is, unless the
> > inodes are dirty.
>
> What's the filesystem, and what's the test case?
It's GPFS, which is a binary blob to me, unfortunately. I've seen some
of the same behavior with ext3, but only after changing some of the
dirty writeout tunables to absurd values. I think the complication with
GPFS in particular is that it doesn't use Linux's buffer cache. We
don't trigger any of the page-based dirty watermarks since no _pages_
are being dirtied.
I've seen it happen when creating or touching large numbers of empty
files. Yuri (cc'd) has seen it happen when mmap()'ing files but not
modifying them, since noatime is not set.
The original case that we were seeing was an NFS server serving up a
GPFS filesystem.
> > I think I'm seeing a case where the inode's dentry goes away, it
> > hits iput_final(). It is dirty, so it stays off the inode_unused
> > list waiting around for writeback.
>
> Right - it should be on the bdi->wb->b_dirty list waiting to be
> expired and written back or already of the expired writeback queueѕ
> and waiting to be written again.
>
> > Then, the periodic writeback happens, and we end up in
> > wb_writeback(). One of the first things we do in the loop (before
> > writing out inodes) is this:
> >
> > if (work->for_background && !over_bground_thresh())
> > break;
>
> Sure, but the periodic ->for_kupdate flushing should be writing
> any inode older than 30s and should be running every 5s. hence the
> background writeback aborting should not be affecting the cleaning
> of dirty inodes. Hence I don't think this is the problem your are
> looking for.
Yeah, I think you're right. I missed that call site when I was going
through it.
> Without knowing what filesystem or what you are doing to grow the
> inode cache, it's pretty hard to say much more than this....
Thanks for looking at it. I'm trying to see if I can reproduce any of
this with any of the in-tree fs's.
-- Dave
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC][PATCH] try not to let dirty inodes fester
2010-10-05 15:25 ` Dave Hansen
@ 2010-10-05 15:36 ` Christoph Hellwig
2010-10-11 22:01 ` Dave Hansen
0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2010-10-05 15:36 UTC (permalink / raw)
To: Dave Hansen
Cc: Dave Chinner, linux-kernel, hch, lnxninja, axboe, pbadari,
Yuri L Volobuev
On Tue, Oct 05, 2010 at 08:25:45AM -0700, Dave Hansen wrote:
> It's GPFS, which is a binary blob to me, unfortunately.
Let them fix their junk themselves then. I'm rather annoyed that you
actually are brave enough to annoy us with dealing with this. Seriously,
if just about IBMer is now a GPFS henchman in disguise I might as well
stop helping IBM at all. And in FS land it really looks like that recently.
What about contributing to the in-kernel cluster filesystems instead of
making our lives a pain for your personal gain?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC][PATCH] try not to let dirty inodes fester
2010-10-05 15:36 ` Christoph Hellwig
@ 2010-10-11 22:01 ` Dave Hansen
0 siblings, 0 replies; 5+ messages in thread
From: Dave Hansen @ 2010-10-11 22:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, linux-kernel, hch, lnxninja, axboe, pbadari,
Yuri L Volobuev
Dave, Christoph,
Thanks again for looking at this. It turned out not to be dirty data at
all that was causing this. It was dispose_list() taking extraordinarily
long to complete. There were cases where it was taking 2 or 3 minutes
per batch of ~75 inodes. This was all due to the underlying filesystem
(GPFS) taking a couple of seconds for each clear_inode(). This kept the
kernel from being able to do any slab reclaim effectively.
Sorry for the noise.
-- Dave
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-10-11 22:01 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-01 19:14 [RFC][PATCH] try not to let dirty inodes fester Dave Hansen
2010-10-02 11:32 ` Dave Chinner
2010-10-05 15:25 ` Dave Hansen
2010-10-05 15:36 ` Christoph Hellwig
2010-10-11 22:01 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).