* Re: what is the point of nr_pages information for the flusher thread?
2010-07-07 23:16 what is the point of nr_pages information for the flusher thread? Christoph Hellwig
@ 2010-07-07 23:37 ` Andrew Morton
2010-07-07 23:43 ` Christoph Hellwig
2010-07-10 14:58 ` Wu Fengguang
1 sibling, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2010-07-07 23:37 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: fengguang.wu, mel, npiggin, linux-fsdevel, linux-mm
On Wed, 7 Jul 2010 19:16:11 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> Currently there's three possible values we pass into the flusher thread
> for the nr_pages arguments:
I assume you're referring to wakeup_flusher_threads().
> - in sync_inodes_sb and bdi_start_background_writeback:
>
> LONG_MAX
>
> - in writeback_inodes_sb and wb_check_old_data_flush:
>
> global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused)
>
> - in wakeup_flusher_threads and laptop_mode_timer_fn:
>
> global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS)
There's also free_more_memory() and do_try_to_free_pages().
You'd need to do some deep git archeology to work out what the thinking
was at those two callsites. My git machine is presently at the other
end of a slow link.
wakeup_flusher_threads() apepars to have been borked. It passes
nr_pages() into *each* bdi hence can write back far more than it was
asked to.
> The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
> value for data integrity writepage in the lowlevel writeback code, and
> the for_background in bdi_start_background_writeback has it's own check
> for the background threshold. So far so good, and now it gets
> interesting.
>
> Why does writeback_inodes_sb add the number of used inodes into a value
> that is in units of pages? And why don't the other callers do this?
Again, git archeology is needed. The code's been like that for some
time. IIRC there was a bug long long ago wherein the system could have
lots of dirty inodes but zero dirty pages. The writeback code would
say "gee, no dirty pages" and would bale out, thus failing to write the
dirty inodes. Perhaps this hack was a "fix" for that behaviour. Or
perhaps not. Apparently it was so obvious that no code comment was
needed.
> But seriously, how is the _global_ number of dirty and unstable pages
> a good indicator for the amount of writeback per-bdi or superblock
> anyway?
It isn't. This appears to have been an attempt to transport the
wakeup_pdflush() functionality into the new wakeup_flusher_threads()
regime. Badly.
> Somehow I'd feel much better about doing this calculation all the way
> down in wb_writeback instead of the callers so we'll at least have
> one documented place for these insanities.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: what is the point of nr_pages information for the flusher thread?
2010-07-07 23:16 what is the point of nr_pages information for the flusher thread? Christoph Hellwig
2010-07-07 23:37 ` Andrew Morton
@ 2010-07-10 14:58 ` Wu Fengguang
1 sibling, 0 replies; 5+ messages in thread
From: Wu Fengguang @ 2010-07-10 14:58 UTC (permalink / raw)
To: Christoph Hellwig
Cc: mel@csn.ul.ie, akpm@linux-foundation.org, npiggin@suse.de,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Hi Christoph,
Here are some of my findings.
On Thu, Jul 08, 2010 at 07:16:11AM +0800, Christoph Hellwig wrote:
> Currently there's three possible values we pass into the flusher thread
> for the nr_pages arguments:
The current wb_writeback_work.nr_pages parameter semantic is actually quite
different from the _min_pages argument in 2.6.30. Current semantic is
"max pages to write", the old one is "min pages to write (until all written)".
current wb_writeback():
for (;;) {
/*
* Stop writeback when nr_pages has been consumed
*/
if (work->nr_pages <= 0)
break;
2.6.30 background_writeout(_min_pages):
if (global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
> - in sync_inodes_sb and bdi_start_background_writeback:
>
> LONG_MAX
>
> - in writeback_inodes_sb and wb_check_old_data_flush:
>
> global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused)
>
> - in wakeup_flusher_threads and laptop_mode_timer_fn:
>
> global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS)
>
> The LONG_MAX cases are triviall explained, as we ignore the nr_to_write
> value for data integrity writepage in the lowlevel writeback code, and
> the for_background in bdi_start_background_writeback has it's own check
> for the background threshold. So far so good, and now it gets
> interesting.
Yeah.
> Why does writeback_inodes_sb add the number of used inodes into a value
> that is in units of pages? And why don't the other callers do this?
The 2.6.30 sync_inodes_sb() has this comment:
* We add in the number of potentially dirty inodes, because each inode write
* can dirty pagecache in the underlying blockdev.
The periodic writeback also referenced it:
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
if (nr_pages) {
struct wb_writeback_work work = {
.nr_pages = nr_pages,
Here it looks more sane to do
if (wb_has_dirty_io(wb)) {
struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
> But seriously, how is the _global_ number of dirty and unstable pages
> a good indicator for the amount of writeback per-bdi or superblock
> anyway?
Good point.
> Somehow I'd feel much better about doing this calculation all the way
> down in wb_writeback instead of the callers so we'll at least have
> one documented place for these insanities.
I guess the current "max pages to write" semantic serves as a poor
man's live-lock prevention guard. sync() want that semantic (in this
sense the old "min pages to write" has never worked as expected). When
proper live-lock preventions are ready, this guard will no longer be
necessary.
However the current semantic is not suitable for other users. "To write at most
nr_pages until hitting background dirty threshold" is basically a no-op,
because the callers may as well let the normal background writeback do the job
for them.
For example, laptop_mode_timer_fn() actually want to write the whole world, so
it wants nr_pages=LONG_MAX with the old "min pages to write" semantic.
There are other cases that try to write some pages
- free_more_memory()
- do_try_to_free_pages()
- ubifs shrink_liability()
- ext4 ext4_nonda_switch()
They don't really know or care about the exact nr_pages to write.
The latter two functions even sync everything for simplicity..
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread