From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
Johannes Weiner <hannes@cmpxchg.org>,
Christoph Hellwig <hch@lst.de>,
Jan Engelhardt <jengelh@medozas.de>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/5] writeback: avoid livelocking WB_SYNC_ALL writeback
Date: Wed, 10 Nov 2010 00:18:40 +0100 [thread overview]
Message-ID: <20101109231840.GC11214@quack.suse.cz> (raw)
In-Reply-To: <20101109144346.21d6a5e4.akpm@linux-foundation.org>
On Tue 09-11-10 14:43:46, Andrew Morton wrote:
> On Tue, 09 Nov 2010 07:09:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > From: Jan Kara <jack@suse.cz>
> >
> > When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
> > usually set to LONG_MAX. The logic in wb_writeback() then calls
> > __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and thus
> > we easily end up with negative nr_to_write after the function returns.
>
> No, nr_to_write can only be negative if the filesystem wrote back more
> pages than requested.
Since some time, write_cache_pages() does not stop when nr_to_write
<= 0 in WB_SYNC_ALL mode as that is a possible data-integrity issue (we
could have written newly created pages but not the ones written before
sync was called). So nr_to_write gets negative rather easily in
WB_SYNC_ALL mode.
> > wb_writeback() then decides we need another round of writeback but this
> > is wrong in some cases! For example when a single large file is
> > continuously dirtied, we would never finish syncing it because each pass
> > would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
> > never gets updated (as inode is never completely clean).
>
> Well we shouldn't have asked the function to write LONG_MAX pages then!
>
> The way this used to work was to try to write back N=(total dirty pages
> + total unstable pages + various fudge factors) to each superblock. So
> each superblock will get fully written back unless someone is madly
> writing to it. If that _is_ happening then we'll write a large amount
> of data to it and will then give up and move onto the next superblock.
>
> But the "large amount of data" is constrained to a sane upper limit:
> total amount of dirty memory plus fudge factors. Increasing that sane
> upper limit to an insane 2^63-1 pages will *of course* cause sync() to
> livelock.
>
> Why was that sane->insane change made?
Note that we are speaking about WB_SYNC_ALL mode and for above mentioned
data integrity reason any finite nr_to_write is just wrong... That's why we
do all that complex page tagging livelock avoidance thing in
write_cache_pages().
> > Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
> > do not need nr_to_write in WB_SYNC_ALL mode anyway since livelock
> > avoidance is done differently for it.
>
> Here the changelog should spell out what "done differently" means.
> Because I really am unsure what is begin referred to.
>
> I don't really see how this patch changes anything. For WB_SYNC_ALL
> requests the code will still try to write out 2^63 pages, only it does
> it all in a single writeback_inodes_wb() call. What prevents that call
> itself from getting livelocked?
I'm referring to the livelock avoidance using page tagging. Fengguang
actually added a note about this into a comment in the code but it's not
in the changelog. And you're right it should be here.
> Perhaps the unmentioned problem here is that each call to
> writeback_inodes_wb(MAX_WRITEBACK_PAGES) will restart its walk across
> the inode lists. So instead of giving up on a being-written-to-file,
> we continuously revisit it again and again and again.
>
> Correct? If so, please add the description. If incorrect, please add
> the description as well ;)
Yes, that's the problem.
> Root cause time: it's those damn per-sb inode lists *again*. They're
> just awful. We need some data structure there which is more amenable
> to being iterated over. Something against which we can store cursors,
> for a start.
This would be definitely nice. But in this particular case, since we have
that page tagging livelock avoidance, we can just do all we need in a one
big sweep so we are OK.
Suggestion for the new changelog:
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and
we easily end up with negative nr_to_write after the function returns.
This is because write_cache_pages() does not stop writing when
nr_to_write drops to zero in WB_SYNC_ALL mode.
When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases! For example when a single large
file is continuously dirtied, we would never finish syncing it because each
pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
never gets updated (as inode is never completely clean). Thus
__writeback_inodes_sb() would write the redirtied inode again and again.
Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.
After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.
-
Is this better?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
Johannes Weiner <hannes@cmpxchg.org>,
Christoph Hellwig <hch@lst.de>,
Jan Engelhardt <jengelh@medozas.de>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/5] writeback: avoid livelocking WB_SYNC_ALL writeback
Date: Wed, 10 Nov 2010 00:18:40 +0100 [thread overview]
Message-ID: <20101109231840.GC11214@quack.suse.cz> (raw)
In-Reply-To: <20101109144346.21d6a5e4.akpm@linux-foundation.org>
On Tue 09-11-10 14:43:46, Andrew Morton wrote:
> On Tue, 09 Nov 2010 07:09:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > From: Jan Kara <jack@suse.cz>
> >
> > When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
> > usually set to LONG_MAX. The logic in wb_writeback() then calls
> > __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and thus
> > we easily end up with negative nr_to_write after the function returns.
>
> No, nr_to_write can only be negative if the filesystem wrote back more
> pages than requested.
Since some time, write_cache_pages() does not stop when nr_to_write
<= 0 in WB_SYNC_ALL mode as that is a possible data-integrity issue (we
could have written newly created pages but not the ones written before
sync was called). So nr_to_write gets negative rather easily in
WB_SYNC_ALL mode.
> > wb_writeback() then decides we need another round of writeback but this
> > is wrong in some cases! For example when a single large file is
> > continuously dirtied, we would never finish syncing it because each pass
> > would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
> > never gets updated (as inode is never completely clean).
>
> Well we shouldn't have asked the function to write LONG_MAX pages then!
>
> The way this used to work was to try to write back N=(total dirty pages
> + total unstable pages + various fudge factors) to each superblock. So
> each superblock will get fully written back unless someone is madly
> writing to it. If that _is_ happening then we'll write a large amount
> of data to it and will then give up and move onto the next superblock.
>
> But the "large amount of data" is constrained to a sane upper limit:
> total amount of dirty memory plus fudge factors. Increasing that sane
> upper limit to an insane 2^63-1 pages will *of course* cause sync() to
> livelock.
>
> Why was that sane->insane change made?
Note that we are speaking about WB_SYNC_ALL mode and for above mentioned
data integrity reason any finite nr_to_write is just wrong... That's why we
do all that complex page tagging livelock avoidance thing in
write_cache_pages().
> > Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
> > do not need nr_to_write in WB_SYNC_ALL mode anyway since livelock
> > avoidance is done differently for it.
>
> Here the changelog should spell out what "done differently" means.
> Because I really am unsure what is begin referred to.
>
> I don't really see how this patch changes anything. For WB_SYNC_ALL
> requests the code will still try to write out 2^63 pages, only it does
> it all in a single writeback_inodes_wb() call. What prevents that call
> itself from getting livelocked?
I'm referring to the livelock avoidance using page tagging. Fengguang
actually added a note about this into a comment in the code but it's not
in the changelog. And you're right it should be here.
> Perhaps the unmentioned problem here is that each call to
> writeback_inodes_wb(MAX_WRITEBACK_PAGES) will restart its walk across
> the inode lists. So instead of giving up on a being-written-to-file,
> we continuously revisit it again and again and again.
>
> Correct? If so, please add the description. If incorrect, please add
> the description as well ;)
Yes, that's the problem.
> Root cause time: it's those damn per-sb inode lists *again*. They're
> just awful. We need some data structure there which is more amenable
> to being iterated over. Something against which we can store cursors,
> for a start.
This would be definitely nice. But in this particular case, since we have
that page tagging livelock avoidance, we can just do all we need in a one
big sweep so we are OK.
Suggestion for the new changelog:
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and
we easily end up with negative nr_to_write after the function returns.
This is because write_cache_pages() does not stop writing when
nr_to_write drops to zero in WB_SYNC_ALL mode.
When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases! For example when a single large
file is continuously dirtied, we would never finish syncing it because each
pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
never gets updated (as inode is never completely clean). Thus
__writeback_inodes_sb() would write the redirtied inode again and again.
Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.
After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.
-
Is this better?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-11-09 23:18 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-08 23:09 [PATCH 0/5] writeback livelock fixes Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` [PATCH 1/5] writeback: integrated background writeback work Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` [PATCH 2/5] writeback: trace wakeup event for background writeback Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` [PATCH 3/5] writeback: stop background/kupdate works from livelocking other works Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-09 21:13 ` Andrew Morton
2010-11-09 21:13 ` Andrew Morton
2010-11-09 22:28 ` Jan Kara
2010-11-09 22:28 ` Jan Kara
2010-11-09 23:00 ` Andrew Morton
2010-11-09 23:00 ` Andrew Morton
2010-11-09 23:56 ` Jan Kara
2010-11-09 23:56 ` Jan Kara
2010-11-10 23:37 ` Andrew Morton
2010-11-10 23:37 ` Andrew Morton
2010-11-11 0:40 ` Wu Fengguang
2010-11-11 0:40 ` Wu Fengguang
2010-11-11 13:32 ` Christoph Hellwig
2010-11-11 13:32 ` Christoph Hellwig
2010-11-11 16:44 ` Jan Kara
2010-11-11 16:44 ` Jan Kara
2010-11-08 23:09 ` [PATCH 4/5] writeback: avoid livelocking WB_SYNC_ALL writeback Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-09 22:43 ` Andrew Morton
2010-11-09 22:43 ` Andrew Morton
2010-11-09 23:18 ` Jan Kara [this message]
2010-11-09 23:18 ` Jan Kara
2010-11-10 2:26 ` Wu Fengguang
2010-11-10 2:26 ` Wu Fengguang
2010-11-08 23:09 ` [PATCH 5/5] writeback: check skipped pages on WB_SYNC_ALL Wu Fengguang
2010-11-08 23:09 ` Wu Fengguang
2010-11-09 22:47 ` Andrew Morton
2010-11-09 22:47 ` Andrew Morton
2010-11-09 23:16 ` Wu Fengguang
2010-11-09 23:16 ` Wu Fengguang
2010-11-08 23:23 ` [PATCH 0/5] writeback livelock fixes Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2010-11-10 2:35 [PATCH 0/5] writeback livelock fixes v2 Wu Fengguang
2010-11-10 2:35 ` [PATCH 4/5] writeback: avoid livelocking WB_SYNC_ALL writeback Wu Fengguang
2010-11-10 2:35 ` Wu Fengguang
2010-11-10 2:35 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101109231840.GC11214@quack.suse.cz \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hch@lst.de \
--cc=jengelh@medozas.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.