From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Cc: chris.mason@oracle.com
Subject: Re: [PATCH 5/5] xfs: kick inode writeback when low on memory
Date: Wed, 2 Mar 2011 14:06:02 +1100 [thread overview]
Message-ID: <20110302030602.GD4905@dastard> (raw)
In-Reply-To: <1298412969-14389-6-git-send-email-david@fromorbit.com>
On Wed, Feb 23, 2011 at 09:16:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When the inode cache shrinker runs, we may have lots of dirty inodes queued up
> in the VFS dirty queues that have not been expired. The typical case for this
> with XFS is atime updates. The result is that a highly concurrent workload that
> copies files and then later reads them (say to verify checksums) dirties all
> the inodes again, even when relatime is used.
>
> In a constrained memory environment, this results in a large number of dirty
> inodes using all of available memory and memory reclaim being unable to free
> them as dirty inodes areconsidered active. This problem was uncovered by Chris
> Mason during recent low memory stress testing.
>
> The fix is to trigger VFS level writeback from the XFS inode cache shrinker if
> there isn't already writeback in progress. This ensures that when we enter a
> low memory situation we start cleaning inodes (via the flusher thread) on the
> filesystem immediately, thereby making it more likely that we will be able to
> evict those dirty inodes from the VFS in the near future.
>
> The mechanism is not perfect - it only acts on the current filesystem, so if
> all the dirty inodes are on a different filesystem it won't help. However, it
> seems to be a valid assumption is that the filesystem with lots of dirty inodes
> is going to have the shrinker called very soon after the memory shortage
> begins, so this shouldn't be an issue.
>
> The other flaw is that there is no guarantee that the flusher thread will make
> progress fast enough to clean the dirty inodes so they can be reclaimed in the
> near future. However, this mechanism does improve the resilience of the
> filesystem under the test conditions - instead of reliably triggering the OOM
> killer 20 minutes into the stress test, it took more than 6 hours before it
> happened.
>
> This small addition definitely improves the low memory resilience of XFS on
> this type of workload, and best of all it has no impact on performance when
> memory is not constrained.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/linux-2.6/xfs_sync.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 35138dc..3abde91 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -1044,6 +1044,17 @@ xfs_reclaim_inode_shrink(
> if (!(gfp_mask & __GFP_FS))
> return -1;
>
> + /*
> + * make sure VFS is cleaning inodes so they can be pruned
> + * and marked for reclaim in the XFS inode cache. If we don't
> + * do this the VFS can accumulate dirty inodes and we can OOM
> + * before they are cleaned by the periodic VFS writeback.
> + *
> + * This takes VFS level locks, so we can only do this after
> + * the __GFP_FS checks otherwise lockdep gets really unhappy.
> + */
> + writeback_inodes_sb_nr_if_idle(mp->m_super, nr_to_scan);
> +
Well, this generates a deadlock if we get a low memory situation
before the bdi flusher thread for the underly device has been
created. That is, we get low memory, kick
writeback_inodes_sb_nr_if_idle(), we end up with the bdi-default
thread trying to create the flush-x:y thread, which gets stuck
waiting for kthread_create() to complete.
kthread_create() never completes because the do_fork() call in the
kthreadd fails memory allocation and again calls (via the shrinker)
writeback_inodes_sb_nr_if_idle(), which thinks that
writeback_in_progress(bdi) is false, so tries to start
writeback again....
So, writeback_inodes_sb_nr_if_idle() is busted w.r.t. only queuing a
single writeback instance as writeback is only marked as in progress
once the queued callback is running. Perhaps writeback_in_progress()
should return try if the BDI_Pending bit is set, indicating the
flusher thread is being created right now, but I'm not sure that is
sufficient to avoid all the potential races here.
I'm open to ideas here - I could convert the bdi flusher
infrastructure to cmwqs rather than using worker threads, or move
all dirty inode tracking and writeback into XFS, or ???
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2011-03-02 3:03 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-22 22:16 [RFC, PATCH 0/5] xfs: Reduce OOM kill problems under heavy load Dave Chinner
2011-02-22 22:16 ` [PATCH 1/5] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
2011-03-03 15:55 ` Christoph Hellwig
2011-03-03 22:04 ` Dave Chinner
2011-02-22 22:16 ` [PATCH 2/5] xfs: introduce a xfssyncd workqueue Dave Chinner
2011-02-22 22:16 ` [PATCH 3/5] xfs: convert ENOSPC inode flushing to use new syncd workqueue Dave Chinner
2011-03-03 15:34 ` Christoph Hellwig
2011-03-03 22:41 ` Dave Chinner
2011-03-04 12:40 ` Christoph Hellwig
2011-02-22 22:16 ` [PATCH 4/5] xfs: introduce background inode reclaim work Dave Chinner
2011-03-03 15:36 ` Christoph Hellwig
2011-03-03 22:43 ` Dave Chinner
2011-02-22 22:16 ` [PATCH 5/5] xfs: kick inode writeback when low on memory Dave Chinner
2011-03-02 3:06 ` Dave Chinner [this message]
2011-03-02 14:12 ` Christoph Hellwig
2011-03-03 2:42 ` Dave Chinner
2011-03-03 15:48 ` Christoph Hellwig
2011-03-03 16:19 ` Christoph Hellwig
2011-03-09 5:46 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110302030602.GD4905@dastard \
--to=david@fromorbit.com \
--cc=chris.mason@oracle.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox