* [PATCH 0/6] writeback fixes and trace events @ 2011-05-04 9:17 Wu Fengguang 2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang ` (6 more replies) 0 siblings, 7 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Wu Fengguang, LKML, linux-fsdevel Andrew, Here are one more collection of simple writeback patches. two already reviewed fixes [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs a straightforward write size optimization [PATCH 3/6] writeback: make nr_to_write a per-file limit two new trace events [PATCH 4/6] writeback: trace event writeback_single_inode [PATCH 5/6] writeback: trace event writeback_queue_io an RFC wbc_writeback_* trace format change (please drop it if cannot get ACKs from others) [PATCH 6/6] writeback: convert to relative older_than_this in trace events Thanks, Fengguang ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang ` (5 subsequent siblings) 6 siblings, 0 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel [-- Attachment #1: writeback-task_dirty_limit-comment.patch --] [-- Type: text/plain, Size: 1193 bytes --] Clarify the bdi_dirty_limit() comment. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-04-19 10:18:16.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-04-19 10:18:32.000000000 +0800 @@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long * *pdirty = dirty; } -/* +/** * bdi_dirty_limit - @bdi's share of dirty throttling threshold + * @bdi: the backing_dev_info to query + * @dirty: global dirty limit in pages + * + * Returns @bdi's dirty limit in pages. The term "dirty" in the context of + * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. + * And the "limit" in the name is not seriously taken as hard limit in + * balance_dirty_pages(). * - * Allocate high/low dirty limits to fast/slow devices, in order to prevent + * It allocates high/low dirty limits to fast/slow devices, in order to prevent * - starving fast devices * - piling up dirty pages (that will take long time to sync) on slow devices * ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang 2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang ` (4 subsequent siblings) 6 siblings, 0 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel [-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --] [-- Type: text/plain, Size: 2964 bytes --] This avoids unnecessary checks and dirty throttling on tmpfs/ramfs. It can also prevent [ 388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 in the balance_dirty_pages tracepoint, which will call dev_name(mapping->backing_dev_info->dev) but shmem_backing_dev_info.dev is NULL. Summary notes about the tmpfs/ramfs behavior changes: As for 2.6.36 and older kernels, the tmpfs writes will sleep inside balance_dirty_pages() as long as we are over the (dirty+background)/2 global throttle threshold. This is because both the dirty pages and threshold will be 0 for tmpfs/ramfs. Hence this test will always evaluate to TRUE: dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) || (nr_reclaimable + nr_writeback >= dirty_thresh); For 2.6.37, someone complained that the current logic does not allow the users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh) || (nr_reclaimable + nr_writeback > dirty_thresh); So 2.6.37 will behave differently for tmpfs/ramfs: it will never get throttled unless the global dirty threshold is exceeded (which is very unlikely to happen; once happen, will block many tasks). I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means for a busy writing server, tmpfs write()s may get livelocked! The "inadvertent" throttling can hardly bring help to any workload because of its "either no throttling, or get throttled to death" property. So based on 2.6.37, this patch won't bring more noticeable changes. CC: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-04-13 17:18:10.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-04-13 17:18:10.000000000 +0800 @@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct * static void bdi_writeout_fraction(struct backing_dev_info *bdi, long *numerator, long *denominator) { - if (bdi_cap_writeback_dirty(bdi)) { - prop_fraction_percpu(&vm_completions, &bdi->completions, + prop_fraction_percpu(&vm_completions, &bdi->completions, numerator, denominator); - } else { - *numerator = 0; - *denominator = 1; - } } static inline void task_dirties_fraction(struct task_struct *tsk, @@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a bool dirty_exceeded = false; struct backing_dev_info *bdi = mapping->backing_dev_info; + if (!bdi_cap_account_dirty(bdi)) + return; + for (;;) { struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang 2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang 2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-04 9:42 ` Christoph Hellwig 2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang ` (3 subsequent siblings) 6 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, Wu Fengguang, LKML, linux-fsdevel [-- Attachment #1: writeback-single-file-limit.patch --] [-- Type: text/plain, Size: 2001 bytes --] This ensures large dirty files can be written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. CC: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 11 +++++++++++ include/linux/writeback.h | 1 + 2 files changed, 12 insertions(+) --- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:00:45.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-04 16:01:00.000000000 +0800 @@ -346,6 +346,8 @@ writeback_single_inode(struct inode *ino struct writeback_control *wbc) { struct address_space *mapping = inode->i_mapping; + long per_file_limit = wbc->per_file_limit; + long uninitialized_var(nr_to_write); unsigned dirty; int ret; @@ -385,8 +387,16 @@ writeback_single_inode(struct inode *ino spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); + if (per_file_limit) { + nr_to_write = wbc->nr_to_write; + wbc->nr_to_write = per_file_limit; + } + ret = do_writepages(mapping, wbc); + if (per_file_limit) + wbc->nr_to_write += nr_to_write - per_file_limit; + /* * Make sure to wait on the data before writing out the metadata. * This is important for filesystems that modify metadata on data @@ -710,6 +720,7 @@ static long wb_writeback(struct bdi_writ wbc.more_io = 0; wbc.nr_to_write = write_chunk; + wbc.per_file_limit = write_chunk; wbc.pages_skipped = 0; wbc.inodes_cleaned = 0; --- linux-next.orig/include/linux/writeback.h 2011-05-04 16:00:45.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-05-04 16:01:00.000000000 +0800 @@ -28,6 +28,7 @@ struct writeback_control { older than this */ long nr_to_write; /* Write this many pages, and decrement this for each page written */ + long per_file_limit; /* Write this many pages for one file */ long pages_skipped; /* Pages which were not written */ long inodes_cleaned; /* # of inodes cleaned */ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang @ 2011-05-04 9:42 ` Christoph Hellwig 2011-05-04 11:52 ` Wu Fengguang 0 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2011-05-04 9:42 UTC (permalink / raw) To: Wu Fengguang; +Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel On Wed, May 04, 2011 at 05:17:10PM +0800, Wu Fengguang wrote: > This ensures large dirty files can be written in the full 4MB writeback > chunk size, rather than whatever remained quota in wbc->nr_to_write. I like the high-level idea, but the implementation of overriding nr_to_write and then copying it back seems rather ugly. The basic problem seems to be that struct writeback_control is designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes and its callers. It seems like we should only build the struct writeback_control from struct wb_writeback_work down in writeback_sb_inodes, even if that means passing some more information to it either in struct wb_writeback_work or on the stack. Then writeback_sb_inodes can do something like if (wbc.sync_mode == WB_SYNC_NONE) wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages); else wbc.nr_to_write = LONG_MAX; for each inode it writes. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 9:42 ` Christoph Hellwig @ 2011-05-04 11:52 ` Wu Fengguang 2011-05-04 15:51 ` Wu Fengguang 0 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 11:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org On Wed, May 04, 2011 at 05:42:21PM +0800, Christoph Hellwig wrote: > On Wed, May 04, 2011 at 05:17:10PM +0800, Wu Fengguang wrote: > > This ensures large dirty files can be written in the full 4MB writeback > > chunk size, rather than whatever remained quota in wbc->nr_to_write. > > I like the high-level idea, but the implementation of overriding > nr_to_write and then copying it back seems rather ugly. > > The basic problem seems to be that struct writeback_control is > designed to control writeback of a single file, but we keep abuse it > for writing multiple files in writeback_sb_inodes and its callers. > > It seems like we should only build the struct writeback_control from > struct wb_writeback_work down in writeback_sb_inodes, even if that > means passing some more information to it either in struct > wb_writeback_work or on the stack. Yes it's very reasonable and possible according to your notes in another email. > Then writeback_sb_inodes can do something like > > if (wbc.sync_mode == WB_SYNC_NONE) > wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages); I like the min() idea. However it have the side effect of (very possible) smallish IO from balance_dirty_pages(), which may call us with small ->nr_pages. We may explicitly do "write_chunk = max(MAX_WRITEBACK_PAGES, write_chunk)" in balance_dirty_pages() to retain the old behavior. > else > wbc.nr_to_write = LONG_MAX; > > for each inode it writes. Thanks, Fengguang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 11:52 ` Wu Fengguang @ 2011-05-04 15:51 ` Wu Fengguang 2011-05-04 16:18 ` Christoph Hellwig 0 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 15:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org > > Then writeback_sb_inodes can do something like > > > > if (wbc.sync_mode == WB_SYNC_NONE) > > wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages); > > I like the min() idea. However it have the side effect of (very possible) > smallish IO from balance_dirty_pages(), which may call us with small > ->nr_pages. > > We may explicitly do "write_chunk = max(MAX_WRITEBACK_PAGES, write_chunk)" > in balance_dirty_pages() to retain the old behavior. Sorry I was confused and please ignore the above. At least VFS won't enlarge the writeback request from balance_dirty_pages().. Here is the scratch patch. I'll need to double check it tomorrow, but it's already running fine in the dd+tar+sync test and get pretty good performance numbers. # test-tar-dd.sh [...] 246.62 1941.38 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 11.4383 s, 91.7 MB/s dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 2.57s system 22% cpu 11.499 total tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.03s user 4.36s system 56% cpu 28.770 total 286.04 2204.50 elapsed: 263.23000000000025 Thanks, Fengguang --- Subject: writeback: make writeback_control.nr_to_write straight Date: Wed May 04 19:54:37 CST 2011 As suggested by Christoph: Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. change set: - move writeback_control from wb_writeback() down to writeback_sb_inodes() - change return value of writeback_sb_inodes()/__writeback_inodes_wb() to enum writeback_progress - move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h - add wb_writeback_work.older_than_this - remove writeback_control.inodes_cleaned - remove wbc_writeback_* trace events for now CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 187 +++++++++++++++--------------------- include/linux/writeback.h | 29 +++++ mm/backing-dev.c | 6 - mm/page-writeback.c | 11 -- 4 files changed, 116 insertions(+), 117 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-05-04 21:30:32.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-04 23:40:59.000000000 +0800 @@ -30,22 +30,6 @@ #include "internal.h" /* - * Passed into wb_writeback(), essentially a subset of writeback_control - */ -struct wb_writeback_work { - long nr_pages; - struct super_block *sb; - enum writeback_sync_modes sync_mode; - unsigned int tagged_sync:1; - unsigned int for_kupdate:1; - unsigned int range_cyclic:1; - unsigned int for_background:1; - - struct list_head list; /* pending work list */ - struct completion *done; /* set if the caller waits */ -}; - -/* * Include the creation of the trace points after defining the * wb_writeback_work structure so that the definition remains local to this * file. @@ -463,7 +447,6 @@ writeback_single_inode(struct inode *ino * No need to add it back to the LRU. */ list_del_init(&inode->i_wb_list); - wbc->inodes_cleaned++; } } inode_sync_complete(inode); @@ -502,19 +485,53 @@ static bool pin_sb_for_writeback(struct * If @only_this_sb is true, then find and write all such * inodes. Otherwise write only ones which go sequentially * in reverse order. - * - * Return 1, if the caller writeback routine should be - * interrupted. Otherwise return 0. */ -static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb, - struct writeback_control *wbc, bool only_this_sb) +enum writeback_progress { + WROTE_FULL_CHUNK, + WROTE_SOME_PAGES, + WROTE_SOME_INODES, + WROTE_NOTHING, +}; + +static int writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) { + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_sync = work->tagged_sync, + .older_than_this = work->older_than_this, + .for_kupdate = work->for_kupdate, + .for_background = work->for_background, + .range_cyclic = work->range_cyclic, + .range_start = 0; + .range_end = LLONG_MAX; + }; + long write_chunk = MAX_WRITEBACK_PAGES; + long wrote = 0; + bool inode_cleaned = false; + + /* + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX + * here avoids calling into writeback_inodes_wb() more than once. + * + * The intended call sequence for WB_SYNC_ALL writeback is: + * + * wb_writeback() + * writeback_sb_inodes() <== called only once + * write_cache_pages() <== called once for each inode + * (quickly) tag currently dirty pages + * (maybe slowly) sync all tagged pages + */ + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) + write_chunk = LONG_MAX; + while (!list_empty(&wb->b_io)) { - long pages_skipped; struct inode *inode = wb_inode(wb->b_io.prev); if (inode->i_sb != sb) { - if (only_this_sb) { + if (work->sb) { /* * We only want to write back data for this * superblock, move all inodes not belonging @@ -529,7 +546,7 @@ static int writeback_sb_inodes(struct su * Bounce back to the caller to unpin this and * pin the next superblock. */ - return 0; + break; } /* @@ -543,34 +560,44 @@ static int writeback_sb_inodes(struct su requeue_io(inode, wb); continue; } - __iget(inode); + write_chunk = min(write_chunk, work->nr_pages); + wbc.nr_to_write = write_chunk; + wbc.pages_skipped = 0; + + writeback_single_inode(inode, wb, &wbc); - pages_skipped = wbc->pages_skipped; - writeback_single_inode(inode, wb, wbc); - if (wbc->pages_skipped != pages_skipped) { + work->nr_pages -= write_chunk - wbc.nr_to_write; + wrote += write_chunk - wbc.nr_to_write; + if (wbc.pages_skipped) { /* * writeback is not making progress due to locked * buffers. Skip this inode for now. */ redirty_tail(inode, wb); - } + } else if (!(inode->i_state & I_DIRTY)) + inode_cleaned = true; spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); iput(inode); cond_resched(); spin_lock(&wb->list_lock); - if (wbc->nr_to_write <= 0) - return 1; + if (wrote >= MAX_WRITEBACK_PAGES) + return WROTE_FULL_CHUNK; + if (work->nr_pages <= 0) + return WROTE_FULL_CHUNK; } - /* b_io is empty */ - return 1; + if (wrote) + return WROTE_SOME_PAGES; + if (inode_cleaned) + return WROTE_SOME_INODES; + return WROTE_NOTHING; } -static void __writeback_inodes_wb(struct bdi_writeback *wb, - struct writeback_control *wbc) +static int __writeback_inodes_wb(struct bdi_writeback *wb, + struct wb_writeback_work *work) { - int ret = 0; + int ret = WROTE_NOTHING; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -580,34 +607,26 @@ static void __writeback_inodes_wb(struct requeue_io(inode, wb); continue; } - ret = writeback_sb_inodes(sb, wb, wbc, false); + ret = writeback_sb_inodes(sb, wb, work); drop_super(sb); - if (ret) + if (ret == WROTE_FULL_CHUNK) break; } /* Leave any unwritten inodes on b_io */ + return ret; } void writeback_inodes_wb(struct bdi_writeback *wb, - struct writeback_control *wbc) + struct wb_writeback_work *work) { spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) - queue_io(wb, wbc->older_than_this); - __writeback_inodes_wb(wb, wbc); + queue_io(wb, work->older_than_this); + __writeback_inodes_wb(wb, work); spin_unlock(&wb->list_lock); } -/* - * The maximum number of pages to writeout in a single bdi flush/kupdate - * operation. We do this so we don't hold I_SYNC against an inode for - * enormous amounts of time, which would block a userspace task which has - * been forced to throttle against that inode. Also, the code reevaluates - * the dirty each time it has written this many pages. - */ -#define MAX_WRITEBACK_PAGES 1024 - static inline bool over_bground_thresh(void) { unsigned long background_thresh, dirty_thresh; @@ -636,42 +655,12 @@ static inline bool over_bground_thresh(v static long wb_writeback(struct bdi_writeback *wb, struct wb_writeback_work *work) { - struct writeback_control wbc = { - .sync_mode = work->sync_mode, - .tagged_sync = work->tagged_sync, - .older_than_this = NULL, - .for_kupdate = work->for_kupdate, - .for_background = work->for_background, - .range_cyclic = work->range_cyclic, - }; - unsigned long oldest_jif; - long wrote = 0; - long write_chunk = MAX_WRITEBACK_PAGES; + long nr_pages = work->nr_pages; + unsigned long oldest_jif = jiffies; struct inode *inode; + int progress; - if (!wbc.range_cyclic) { - wbc.range_start = 0; - wbc.range_end = LLONG_MAX; - } - - /* - * WB_SYNC_ALL mode does livelock avoidance by syncing dirty - * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX - * here avoids calling into writeback_inodes_wb() more than once. - * - * The intended call sequence for WB_SYNC_ALL writeback is: - * - * wb_writeback() - * writeback_sb_inodes() <== called only once - * write_cache_pages() <== called once for each inode - * (quickly) tag currently dirty pages - * (maybe slowly) sync all tagged pages - */ - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) { - write_chunk = LONG_MAX; - oldest_jif = jiffies; - wbc.older_than_this = &oldest_jif; - } + work->older_than_this = &oldest_jif; for (;;) { /* @@ -700,27 +689,18 @@ static long wb_writeback(struct bdi_writ if (work->for_kupdate || work->for_background) { oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); - wbc.older_than_this = &oldest_jif; + work->older_than_this = &oldest_jif; } - wbc.nr_to_write = write_chunk; - wbc.pages_skipped = 0; - wbc.inodes_cleaned = 0; - retry: - trace_wbc_writeback_start(&wbc, wb->bdi); spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) - queue_io(wb, wbc->older_than_this); + queue_io(wb, work->older_than_this); if (work->sb) - writeback_sb_inodes(work->sb, wb, &wbc, true); + progress = writeback_sb_inodes(work->sb, wb, work); else - __writeback_inodes_wb(wb, &wbc); + progress = __writeback_inodes_wb(wb, work); spin_unlock(&wb->list_lock); - trace_wbc_writeback_written(&wbc, wb->bdi); - - work->nr_pages -= write_chunk - wbc.nr_to_write; - wrote += write_chunk - wbc.nr_to_write; /* * Did we write something? Try for more @@ -730,9 +710,7 @@ retry: * mean the overall work is done. So we keep looping as long * as made some progress on cleaning pages or inodes. */ - if (wbc.nr_to_write < write_chunk) - continue; - if (wbc.inodes_cleaned) + if (progress != WROTE_NOTHING) continue; /* * background writeback will start with expired inodes, and @@ -741,10 +719,10 @@ retry: * lists and cause trouble to the page reclaim. */ if (work->for_background && - wbc.older_than_this && + work->older_than_this && list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) { - wbc.older_than_this = NULL; + work->older_than_this = NULL; goto retry; } /* @@ -760,7 +738,6 @@ retry: spin_lock(&wb->list_lock); if (!list_empty(&wb->b_more_io)) { inode = wb_inode(wb->b_more_io.prev); - trace_wbc_writeback_wait(&wbc, wb->bdi); spin_lock(&inode->i_lock); inode_wait_for_writeback(inode, wb); spin_unlock(&inode->i_lock); @@ -768,7 +745,7 @@ retry: spin_unlock(&wb->list_lock); } - return wrote; + return nr_pages - work->nr_pages; } /* --- linux-next.orig/mm/backing-dev.c 2011-05-04 21:30:22.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-05-04 21:49:07.000000000 +0800 @@ -262,14 +262,14 @@ int bdi_has_dirty_io(struct backing_dev_ static void bdi_flush_io(struct backing_dev_info *bdi) { - struct writeback_control wbc = { + struct wb_writeback_work work = { .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .range_cyclic = 1, - .nr_to_write = 1024, + .nr_pages = 1024, }; - writeback_inodes_wb(&bdi->wb, &wbc); + writeback_inodes_wb(&bdi->wb, &work); } /* --- linux-next.orig/mm/page-writeback.c 2011-05-04 21:30:22.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-05-04 21:57:25.000000000 +0800 @@ -494,10 +494,10 @@ static void balance_dirty_pages(struct a return; for (;;) { - struct writeback_control wbc = { + struct wb_writeback_work work = { .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, - .nr_to_write = write_chunk, + .nr_pages = write_chunk, .range_cyclic = 1, }; @@ -562,15 +562,12 @@ static void balance_dirty_pages(struct a * threshold otherwise wait until the disk writes catch * up. */ - trace_wbc_balance_dirty_start(&wbc, bdi); if (bdi_nr_reclaimable > bdi_thresh) { - writeback_inodes_wb(&bdi->wb, &wbc); - pages_written += write_chunk - wbc.nr_to_write; - trace_wbc_balance_dirty_written(&wbc, bdi); + writeback_inodes_wb(&bdi->wb, &work); + pages_written += write_chunk - work.nr_pages; if (pages_written >= write_chunk) break; /* We've done our duty */ } - trace_wbc_balance_dirty_wait(&wbc, bdi); __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); --- linux-next.orig/include/linux/writeback.h 2011-05-04 21:30:41.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-05-04 21:43:04.000000000 +0800 @@ -7,6 +7,15 @@ #include <linux/sched.h> #include <linux/fs.h> +/* + * The maximum number of pages to writeout in a single bdi flush/kupdate + * operation. We do this so we don't hold I_SYNC against an inode for + * enormous amounts of time, which would block a userspace task which has + * been forced to throttle against that inode. Also, the code reevaluates + * the dirty each time it has written this many pages. + */ +#define MAX_WRITEBACK_PAGES 1024 + struct backing_dev_info; /* @@ -29,7 +38,6 @@ struct writeback_control { long nr_to_write; /* Write this many pages, and decrement this for each page written */ long pages_skipped; /* Pages which were not written */ - long inodes_cleaned; /* # of inodes cleaned */ /* * For a_ops->writepages(): is start or end are non-zero then this is @@ -49,6 +57,23 @@ struct writeback_control { }; /* + * Passed into wb_writeback(), essentially a subset of writeback_control + */ +struct wb_writeback_work { + long nr_pages; + struct super_block *sb; + enum writeback_sync_modes sync_mode; + unsigned long *older_than_this; + unsigned int tagged_sync:1; + unsigned int for_kupdate:1; + unsigned int range_cyclic:1; + unsigned int for_background:1; + + struct list_head list; /* pending work list */ + struct completion *done; /* set if the caller waits */ +}; + +/* * fs/fs-writeback.c */ struct bdi_writeback; @@ -59,7 +84,7 @@ int writeback_inodes_sb_if_idle(struct s int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr); void sync_inodes_sb(struct super_block *); void writeback_inodes_wb(struct bdi_writeback *wb, - struct writeback_control *wbc); + struct wb_writeback_work *work); long wb_do_writeback(struct bdi_writeback *wb, int force_wait); void wakeup_flusher_threads(long nr_pages); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 15:51 ` Wu Fengguang @ 2011-05-04 16:18 ` Christoph Hellwig 2011-05-05 10:47 ` Wu Fengguang 0 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2011-05-04 16:18 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org > - move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h I think it would be a good idea to keep this in fs/fs-writeback.c, which means we'd need a small writeback_inodes_wb wrapper for balance_dirty_pages and bdi_flush_io. But IIRC your tree already has __writeback_inodes_wb for use in wb_writeback, so writeback_inodes_wb could be that wrapper. > + long write_chunk = MAX_WRITEBACK_PAGES; > + long wrote = 0; > + bool inode_cleaned = false; > + > + /* > + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty > + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX > + * here avoids calling into writeback_inodes_wb() more than once. > + * > + * The intended call sequence for WB_SYNC_ALL writeback is: > + * > + * wb_writeback() > + * writeback_sb_inodes() <== called only once > + * write_cache_pages() <== called once for each inode > + * (quickly) tag currently dirty pages > + * (maybe slowly) sync all tagged pages > + */ > + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) > + write_chunk = LONG_MAX; I think this would be easier to read if kept as and if / else clause with the MAX_WRITEBACK_PAGES usage. > + write_chunk = min(write_chunk, work->nr_pages); Or in fact done here - for the WB_SYNC_ALL case LONG_MAX should always be larger than work->nr_pages, so the whole thing could be simplified to: if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) write_chunk = LONG_MAX; else write_chunk = min(MAX_WRITEBACK_PAGES, work->nr_pages); Other notes: - older_than_this in writeback_control shouldn't be needed anymore - is the early return for the mis-matching sb in writeback_sb_inodes handled correctly? Before it had the special 0 return value, and I'm not quite sure how that fits into your new enum scheme. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit 2011-05-04 16:18 ` Christoph Hellwig @ 2011-05-05 10:47 ` Wu Fengguang 0 siblings, 0 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-05 10:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org On Thu, May 05, 2011 at 12:18:16AM +0800, Christoph Hellwig wrote: > > - move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h > > I think it would be a good idea to keep this in fs/fs-writeback.c, which > means we'd need a small writeback_inodes_wb wrapper for > balance_dirty_pages and bdi_flush_io. But IIRC your tree already has > __writeback_inodes_wb for use in wb_writeback, so writeback_inodes_wb > could be that wrapper. No problem. We can later move bdi_flush_io() to fs-writeback.c and kill the wbc=>work=>wbc double conversions once balance_dirty_pages() switches to IO-less. > > + long write_chunk = MAX_WRITEBACK_PAGES; > > + long wrote = 0; > > + bool inode_cleaned = false; > > + > > + /* > > + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty > > + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX > > + * here avoids calling into writeback_inodes_wb() more than once. > > + * > > + * The intended call sequence for WB_SYNC_ALL writeback is: > > + * > > + * wb_writeback() > > + * writeback_sb_inodes() <== called only once > > + * write_cache_pages() <== called once for each inode > > + * (quickly) tag currently dirty pages > > + * (maybe slowly) sync all tagged pages > > + */ > > + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) > > + write_chunk = LONG_MAX; > > I think this would be easier to read if kept as and if / else clause > with the MAX_WRITEBACK_PAGES usage. > > > + write_chunk = min(write_chunk, work->nr_pages); > > Or in fact done here - for the WB_SYNC_ALL case LONG_MAX should > always be larger than work->nr_pages, so the whole thing could be > simplified to: > > if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) > write_chunk = LONG_MAX; > else > write_chunk = min(MAX_WRITEBACK_PAGES, work->nr_pages); OK, I created a standalone function writeback_chunk_size() for that. Otherwise the comment block will look very weird deep inside the function and deeply indented. > > Other notes: > > - older_than_this in writeback_control shouldn't be needed anymore Yes. Good catch! > - is the early return for the mis-matching sb in writeback_sb_inodes > handled correctly? Before it had the special 0 return value, and > I'm not quite sure how that fits into your new enum scheme. Good question. This is the tricky point of the patch. The original 0/1 return values are to tell __writeback_inodes_wb() whether it's done with the current batch. Now __writeback_inodes_wb() will do the test on itself (the below v2 patch). The v1 patch does have some tricky problem regarding the early return: if there are two super blocks that return WROTE_SOME_PAGES and WROTE_NOTHING in turn, __writeback_inodes_wb() will wrongly pick up the latter as the final return value. I updated the patch to accumulate the number of pages/inodes cleaned over multiple writeback_sb_inodes() invocations inside __writeback_inodes_wb(). This way it can reliably return to wb_writeback() to check various terminate conditions when made enough progress. Thanks, Fengguang --- Subject: writeback: make writeback_control.nr_to_write straight Date: Wed May 04 19:54:37 CST 2011 As suggested by Christoph: Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. change set: - move writeback_control from wb_writeback() down to writeback_sb_inodes() - change return value of writeback_sb_inodes()/__writeback_inodes_wb() to the number of pages and/or inodes written - move writeback_control.older_than_this to struct wb_writeback_work - remove writeback_control.inodes_cleaned - remove wbc_writeback_* trace events for now CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/btrfs/extent_io.c | 2 fs/fs-writeback.c | 184 +++++++++++++++-------------- include/linux/writeback.h | 3 include/trace/events/writeback.h | 6 mm/backing-dev.c | 1 mm/page-writeback.c | 1 6 files changed, 99 insertions(+), 98 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-05 18:24:53.000000000 +0800 @@ -30,11 +30,21 @@ #include "internal.h" /* + * The maximum number of pages to writeout in a single bdi flush/kupdate + * operation. We do this so we don't hold I_SYNC against an inode for + * enormous amounts of time, which would block a userspace task which has + * been forced to throttle against that inode. Also, the code reevaluates + * the dirty each time it has written this many pages. + */ +#define MAX_WRITEBACK_PAGES 1024L + +/* * Passed into wb_writeback(), essentially a subset of writeback_control */ struct wb_writeback_work { long nr_pages; struct super_block *sb; + unsigned long *older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_sync:1; unsigned int for_kupdate:1; @@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino * No need to add it back to the LRU. */ list_del_init(&inode->i_wb_list); - wbc->inodes_cleaned++; } } inode_sync_complete(inode); @@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct return false; } +static long writeback_chunk_size(struct wb_writeback_work *work) +{ + long pages; + + /* + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX + * here avoids calling into writeback_inodes_wb() more than once. + * + * The intended call sequence for WB_SYNC_ALL writeback is: + * + * wb_writeback() + * writeback_sb_inodes() <== called only once + * write_cache_pages() <== called once for each inode + * (quickly) tag currently dirty pages + * (maybe slowly) sync all tagged pages + */ + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync) + pages = LONG_MAX; + else + pages = min(MAX_WRITEBACK_PAGES, work->nr_pages); + + return pages; +} + /* * Write a portion of b_io inodes which belong to @sb. * @@ -503,18 +537,29 @@ static bool pin_sb_for_writeback(struct * inodes. Otherwise write only ones which go sequentially * in reverse order. * - * Return 1, if the caller writeback routine should be - * interrupted. Otherwise return 0. + * Return the number of pages and/or inodes cleaned. */ -static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb, - struct writeback_control *wbc, bool only_this_sb) +static long writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) { + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_sync = work->tagged_sync, + .for_kupdate = work->for_kupdate, + .for_background = work->for_background, + .range_cyclic = work->range_cyclic, + .range_start = 0, + .range_end = LLONG_MAX, + }; + long write_chunk; + long wrote = 0; /* count both pages and inodes */ + while (!list_empty(&wb->b_io)) { - long pages_skipped; struct inode *inode = wb_inode(wb->b_io.prev); if (inode->i_sb != sb) { - if (only_this_sb) { + if (work->sb) { /* * We only want to write back data for this * superblock, move all inodes not belonging @@ -529,7 +574,7 @@ static int writeback_sb_inodes(struct su * Bounce back to the caller to unpin this and * pin the next superblock. */ - return 0; + break; } /* @@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su requeue_io(inode, wb); continue; } - __iget(inode); + write_chunk = writeback_chunk_size(work); + wbc.nr_to_write = write_chunk; + wbc.pages_skipped = 0; + + writeback_single_inode(inode, wb, &wbc); - pages_skipped = wbc->pages_skipped; - writeback_single_inode(inode, wb, wbc); - if (wbc->pages_skipped != pages_skipped) { + work->nr_pages -= write_chunk - wbc.nr_to_write; + wrote += write_chunk - wbc.nr_to_write; + if (wbc.pages_skipped) { /* * writeback is not making progress due to locked * buffers. Skip this inode for now. */ redirty_tail(inode, wb); - } + } else if (!(inode->i_state & I_DIRTY)) + wrote++; spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); iput(inode); cond_resched(); spin_lock(&wb->list_lock); - if (wbc->nr_to_write <= 0) - return 1; + if (wrote >= MAX_WRITEBACK_PAGES) + break; + if (work->nr_pages <= 0) + break; } - /* b_io is empty */ - return 1; + return wrote; } -static void __writeback_inodes_wb(struct bdi_writeback *wb, - struct writeback_control *wbc) +static long __writeback_inodes_wb(struct bdi_writeback *wb, + struct wb_writeback_work *work) { - int ret = 0; + long wrote = 0; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -580,33 +631,35 @@ static void __writeback_inodes_wb(struct requeue_io(inode, wb); continue; } - ret = writeback_sb_inodes(sb, wb, wbc, false); + wrote += writeback_sb_inodes(sb, wb, work); drop_super(sb); - if (ret) + if (wrote >= MAX_WRITEBACK_PAGES) + break; + if (work->nr_pages <= 0) break; } /* Leave any unwritten inodes on b_io */ + return wrote; } void writeback_inodes_wb(struct bdi_writeback *wb, - struct writeback_control *wbc) + struct writeback_control *wbc) { + struct wb_writeback_work work = { + .nr_pages = wbc->nr_to_write, + .sync_mode = wbc->sync_mode, + .range_cyclic = wbc->range_cyclic, + }; + spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) - queue_io(wb, wbc->older_than_this); - __writeback_inodes_wb(wb, wbc); + queue_io(wb, NULL); + __writeback_inodes_wb(wb, &work); spin_unlock(&wb->list_lock); -} -/* - * The maximum number of pages to writeout in a single bdi flush/kupdate - * operation. We do this so we don't hold I_SYNC against an inode for - * enormous amounts of time, which would block a userspace task which has - * been forced to throttle against that inode. Also, the code reevaluates - * the dirty each time it has written this many pages. - */ -#define MAX_WRITEBACK_PAGES 1024 + wbc->nr_to_write = work.nr_pages; +} static inline bool over_bground_thresh(void) { @@ -636,42 +689,13 @@ static inline bool over_bground_thresh(v static long wb_writeback(struct bdi_writeback *wb, struct wb_writeback_work *work) { - struct writeback_control wbc = { - .sync_mode = work->sync_mode, - .tagged_sync = work->tagged_sync, - .older_than_this = NULL, - .for_kupdate = work->for_kupdate, - .for_background = work->for_background, - .range_cyclic = work->range_cyclic, - }; + long nr_pages = work->nr_pages; unsigned long oldest_jif; - long wrote = 0; - long write_chunk = MAX_WRITEBACK_PAGES; struct inode *inode; - - if (!wbc.range_cyclic) { - wbc.range_start = 0; - wbc.range_end = LLONG_MAX; - } - - /* - * WB_SYNC_ALL mode does livelock avoidance by syncing dirty - * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX - * here avoids calling into writeback_inodes_wb() more than once. - * - * The intended call sequence for WB_SYNC_ALL writeback is: - * - * wb_writeback() - * writeback_sb_inodes() <== called only once - * write_cache_pages() <== called once for each inode - * (quickly) tag currently dirty pages - * (maybe slowly) sync all tagged pages - */ - if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) - write_chunk = LONG_MAX; + long progress; oldest_jif = jiffies; - wbc.older_than_this = &oldest_jif; + work->older_than_this = &oldest_jif; for (;;) { /* @@ -700,27 +724,18 @@ static long wb_writeback(struct bdi_writ if (work->for_kupdate || work->for_background) { oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); - wbc.older_than_this = &oldest_jif; + work->older_than_this = &oldest_jif; } - wbc.nr_to_write = write_chunk; - wbc.pages_skipped = 0; - wbc.inodes_cleaned = 0; - retry: - trace_wbc_writeback_start(&wbc, wb->bdi); spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) - queue_io(wb, wbc->older_than_this); + queue_io(wb, work->older_than_this); if (work->sb) - writeback_sb_inodes(work->sb, wb, &wbc, true); + progress = writeback_sb_inodes(work->sb, wb, work); else - __writeback_inodes_wb(wb, &wbc); + progress = __writeback_inodes_wb(wb, work); spin_unlock(&wb->list_lock); - trace_wbc_writeback_written(&wbc, wb->bdi); - - work->nr_pages -= write_chunk - wbc.nr_to_write; - wrote += write_chunk - wbc.nr_to_write; /* * Did we write something? Try for more @@ -730,9 +745,7 @@ retry: * mean the overall work is done. So we keep looping as long * as made some progress on cleaning pages or inodes. */ - if (wbc.nr_to_write < write_chunk) - continue; - if (wbc.inodes_cleaned) + if (progress) continue; /* * background writeback will start with expired inodes, and @@ -741,10 +754,10 @@ retry: * lists and cause trouble to the page reclaim. */ if (work->for_background && - wbc.older_than_this && + work->older_than_this && list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) { - wbc.older_than_this = NULL; + work->older_than_this = NULL; goto retry; } /* @@ -760,7 +773,6 @@ retry: spin_lock(&wb->list_lock); if (!list_empty(&wb->b_more_io)) { inode = wb_inode(wb->b_more_io.prev); - trace_wbc_writeback_wait(&wbc, wb->bdi); spin_lock(&inode->i_lock); inode_wait_for_writeback(inode, wb); spin_unlock(&inode->i_lock); @@ -768,7 +780,7 @@ retry: spin_unlock(&wb->list_lock); } - return wrote; + return nr_pages - work->nr_pages; } /* --- linux-next.orig/include/linux/writeback.h 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-05-05 18:24:53.000000000 +0800 @@ -24,12 +24,9 @@ enum writeback_sync_modes { */ struct writeback_control { enum writeback_sync_modes sync_mode; - unsigned long *older_than_this; /* If !NULL, only write back inodes - older than this */ long nr_to_write; /* Write this many pages, and decrement this for each page written */ long pages_skipped; /* Pages which were not written */ - long inodes_cleaned; /* # of inodes cleaned */ /* * For a_ops->writepages(): is start or end are non-zero then this is --- linux-next.orig/mm/backing-dev.c 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-05-05 18:24:53.000000000 +0800 @@ -264,7 +264,6 @@ static void bdi_flush_io(struct backing_ { struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, - .older_than_this = NULL, .range_cyclic = 1, .nr_to_write = 1024, }; --- linux-next.orig/mm/page-writeback.c 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-05-05 18:24:53.000000000 +0800 @@ -496,7 +496,6 @@ static void balance_dirty_pages(struct a for (;;) { struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, - .older_than_this = NULL, .nr_to_write = write_chunk, .range_cyclic = 1, }; --- linux-next.orig/include/trace/events/writeback.h 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-05-05 18:24:53.000000000 +0800 @@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class, __field(int, for_background) __field(int, for_reclaim) __field(int, range_cyclic) - __field(unsigned long, older_than_this) __field(long, range_start) __field(long, range_end) ), @@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class, __entry->for_background = wbc->for_background; __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; - __entry->older_than_this = wbc->older_than_this ? - *wbc->older_than_this : 0; __entry->range_start = (long)wbc->range_start; __entry->range_end = (long)wbc->range_end; ), TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d " - "bgrd=%d reclm=%d cyclic=%d older=0x%lx " + "bgrd=%d reclm=%d cyclic=%d " "start=0x%lx end=0x%lx", __entry->name, __entry->nr_to_write, @@ -132,7 +129,6 @@ DECLARE_EVENT_CLASS(wbc_class, __entry->for_background, __entry->for_reclaim, __entry->range_cyclic, - __entry->older_than_this, __entry->range_start, __entry->range_end) ) --- linux-next.orig/fs/btrfs/extent_io.c 2011-05-05 18:24:21.000000000 +0800 +++ linux-next/fs/btrfs/extent_io.c 2011-05-05 18:24:53.000000000 +0800 @@ -2587,7 +2587,6 @@ int extent_write_full_page(struct extent }; struct writeback_control wbc_writepages = { .sync_mode = wbc->sync_mode, - .older_than_this = NULL, .nr_to_write = 64, .range_start = page_offset(page) + PAGE_CACHE_SIZE, .range_end = (loff_t)-1, @@ -2620,7 +2619,6 @@ int extent_write_locked_range(struct ext }; struct writeback_control wbc_writepages = { .sync_mode = mode, - .older_than_this = NULL, .nr_to_write = nr_pages * 2, .range_start = start, .range_end = end + 1, ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/6] writeback: trace event writeback_single_inode 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang ` (2 preceding siblings ...) 2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang ` (2 subsequent siblings) 6 siblings, 0 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel [-- Attachment #1: writeback-trace-writeback_single_inode.patch --] [-- Type: text/plain, Size: 3776 bytes --] It is valuable to know how the dirty inodes are iterated and their IO size. "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 wrote=0 to_write=1024" - "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "wrote" is the number of pages written in this writeback_single_inode() - "to_write" is the remained wbc->nr_to_write Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 12 +++--- include/trace/events/writeback.h | 56 +++++++++++++++++++++++++++++ 2 files changed, 63 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:01:00.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-04 16:26:34.000000000 +0800 @@ -347,7 +347,7 @@ writeback_single_inode(struct inode *ino { struct address_space *mapping = inode->i_mapping; long per_file_limit = wbc->per_file_limit; - long uninitialized_var(nr_to_write); + long nr_to_write = wbc->nr_to_write; unsigned dirty; int ret; @@ -370,7 +370,8 @@ writeback_single_inode(struct inode *ino */ if (wbc->sync_mode != WB_SYNC_ALL) { requeue_io(inode, wb); - return 0; + ret = 0; + goto out; } /* @@ -387,10 +388,8 @@ writeback_single_inode(struct inode *ino spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); - if (per_file_limit) { - nr_to_write = wbc->nr_to_write; + if (per_file_limit) wbc->nr_to_write = per_file_limit; - } ret = do_writepages(mapping, wbc); @@ -477,6 +476,9 @@ writeback_single_inode(struct inode *ino } } inode_sync_complete(inode); +out: + trace_writeback_single_inode(inode, wbc, + nr_to_write - wbc->nr_to_write); return ret; } --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 15:59:23.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:27:18.000000000 +0800 @@ -8,6 +8,19 @@ #include <linux/device.h> #include <linux/writeback.h> +#define show_inode_state(state) \ + __print_flags(state, "|", \ + {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \ + {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \ + {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \ + {I_NEW, "I_NEW"}, \ + {I_WILL_FREE, "I_WILL_FREE"}, \ + {I_FREEING, "I_FREEING"}, \ + {I_CLEAR, "I_CLEAR"}, \ + {I_SYNC, "I_SYNC"}, \ + {I_REFERENCED, "I_REFERENCED"} \ + ) + struct wb_writeback_work; DECLARE_EVENT_CLASS(writeback_work_class, @@ -187,6 +200,49 @@ DEFINE_EVENT(writeback_congest_waited_te TP_ARGS(usec_timeout, usec_delayed) ); +TRACE_EVENT(writeback_single_inode, + + TP_PROTO(struct inode *inode, + struct writeback_control *wbc, + unsigned long wrote + ), + + TP_ARGS(inode, wbc, wrote), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(unsigned long, ino) + __field(unsigned long, state) + __field(unsigned long, age) + __field(unsigned long, writeback_index) + __field(unsigned long, wrote) + __field(long, nr_to_write) + ), + + TP_fast_assign( + strncpy(__entry->name, + dev_name(inode->i_mapping->backing_dev_info->dev), 32); + __entry->ino = inode->i_ino; + __entry->state = inode->i_state; + __entry->age = (jiffies - inode->dirtied_when) * + 1000 / HZ; + __entry->writeback_index = inode->i_mapping->writeback_index; + __entry->wrote = wrote; + __entry->nr_to_write = wbc->nr_to_write; + ), + + TP_printk("bdi %s: ino=%lu state=%s age=%lu " + "index=%lu wrote=%lu to_write=%ld", + __entry->name, + __entry->ino, + show_inode_state(__entry->state), + __entry->age, + __entry->writeback_index, + __entry->wrote, + __entry->nr_to_write + ) +); + #endif /* _TRACE_WRITEBACK_H */ /* This part must be outside protection */ ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/6] writeback: trace event writeback_queue_io 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang ` (3 preceding siblings ...) 2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang 2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang 2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig 6 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel [-- Attachment #1: writeback-trace-queue_io.patch --] [-- Type: text/plain, Size: 3305 bytes --] Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 16 +++++++++++----- include/trace/events/writeback.h | 23 +++++++++++++++++++++++ 2 files changed, 34 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:01:27.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-04 16:09:01.000000000 +0800 @@ -248,15 +248,16 @@ static bool inode_dirtied_after(struct i /* * Move expired dirty inodes from @delaying_queue to @dispatch_queue. */ -static void move_expired_inodes(struct list_head *delaying_queue, - struct list_head *dispatch_queue, - struct writeback_control *wbc) +static int move_expired_inodes(struct list_head *delaying_queue, + struct list_head *dispatch_queue, + struct writeback_control *wbc) { LIST_HEAD(tmp); struct list_head *pos, *node; struct super_block *sb = NULL; struct inode *inode; int do_sb_sort = 0; + int moved = 0; while (!list_empty(delaying_queue)) { inode = wb_inode(delaying_queue->prev); @@ -267,12 +268,13 @@ static void move_expired_inodes(struct l do_sb_sort = 1; sb = inode->i_sb; list_move(&inode->i_wb_list, &tmp); + moved++; } /* just one sb in list, splice to dispatch_queue and we're done */ if (!do_sb_sort) { list_splice(&tmp, dispatch_queue); - return; + goto out; } /* Move inodes from one superblock together */ @@ -284,6 +286,8 @@ static void move_expired_inodes(struct l list_move(&inode->i_wb_list, dispatch_queue); } } +out: + return moved; } /* @@ -299,9 +303,11 @@ static void move_expired_inodes(struct l */ static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc) { + int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); - move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc); + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc); + trace_writeback_queue_io(wb, wbc, moved); } static int write_inode(struct inode *inode, struct writeback_control *wbc) --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:06:50.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:08:58.000000000 +0800 @@ -165,6 +165,29 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt DEFINE_WBC_EVENT(wbc_balance_dirty_wait); DEFINE_WBC_EVENT(wbc_writepage); +TRACE_EVENT(writeback_queue_io, + TP_PROTO(struct bdi_writeback *wb, + struct writeback_control *wbc, + int moved), + TP_ARGS(wb, wbc, moved), + TP_STRUCT__entry( + __array(char, name, 32) + __field(int, older) + __field(int, moved) + ), + TP_fast_assign( + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + __entry->older = wbc->older_than_this ? + (jiffies - *wbc->older_than_this) * 1000 / HZ + : -1; + __entry->moved = moved; + ), + TP_printk("bdi %s: older=%d enqueue=%d", + __entry->name, + __entry->older, + __entry->moved) +); + DECLARE_EVENT_CLASS(writeback_congest_waited_template, TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/6] writeback: trace event writeback_queue_io (v2) 2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang @ 2011-05-05 16:37 ` Wu Fengguang 2011-05-05 17:26 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-05 16:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org [changes from v1: due to concerns about the relative older_than_this, I added the raw older_than_this to the trace output. Hope the added overheads are not a big concern.] Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 14 ++++++++++---- include/trace/events/writeback.h | 25 +++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 4 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:53:33.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-05-05 23:53:34.000000000 +0800 @@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i /* * Move expired dirty inodes from @delaying_queue to @dispatch_queue. */ -static void move_expired_inodes(struct list_head *delaying_queue, +static int move_expired_inodes(struct list_head *delaying_queue, struct list_head *dispatch_queue, - unsigned long *older_than_this) + unsigned long *older_than_this) { LIST_HEAD(tmp); struct list_head *pos, *node; struct super_block *sb = NULL; struct inode *inode; int do_sb_sort = 0; + int moved = 0; while (!list_empty(delaying_queue)) { inode = wb_inode(delaying_queue->prev); @@ -277,12 +278,13 @@ static void move_expired_inodes(struct l do_sb_sort = 1; sb = inode->i_sb; list_move(&inode->i_wb_list, &tmp); + moved++; } /* just one sb in list, splice to dispatch_queue and we're done */ if (!do_sb_sort) { list_splice(&tmp, dispatch_queue); - return; + goto out; } /* Move inodes from one superblock together */ @@ -294,6 +296,8 @@ static void move_expired_inodes(struct l list_move(&inode->i_wb_list, dispatch_queue); } } +out: + return moved; } /* @@ -309,9 +313,11 @@ static void move_expired_inodes(struct l */ static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this) { + int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); - move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); + trace_writeback_queue_io(wb, older_than_this, moved); } static int write_inode(struct inode *inode, struct writeback_control *wbc) --- linux-next.orig/include/trace/events/writeback.h 2011-05-05 23:53:33.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-05-06 00:14:09.000000000 +0800 @@ -158,6 +158,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt DEFINE_WBC_EVENT(wbc_balance_dirty_wait); DEFINE_WBC_EVENT(wbc_writepage); +TRACE_EVENT(writeback_queue_io, + TP_PROTO(struct bdi_writeback *wb, + unsigned long *older_than_this, + int moved), + TP_ARGS(wb, older_than_this, moved), + TP_STRUCT__entry( + __array(char, name, 32) + __field(unsigned long, older) + __field(long, age) + __field(int, moved) + ), + TP_fast_assign( + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + __entry->older = older_than_this ? *older_than_this : 0; + __entry->age = older_than_this ? + (jiffies - *older_than_this) * 1000 / HZ : -1; + __entry->moved = moved; + ), + TP_printk("bdi %s: older=%lu age=%ld enqueue=%d", + __entry->name, + __entry->older, /* older_than_this in jiffies */ + __entry->age, /* older_than_this in relative milliseconds */ + __entry->moved) +); + DECLARE_EVENT_CLASS(writeback_congest_waited_template, TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] writeback: trace event writeback_queue_io (v2) 2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang @ 2011-05-05 17:26 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2011-05-05 17:26 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org On Fri 06-05-11 00:37:30, Wu Fengguang wrote: > [changes from v1: due to concerns about the relative older_than_this, > I added the raw older_than_this to the trace output. Hope the added > overheads are not a big concern.] > > Note that it adds a little overheads to account the moved/enqueued > inodes from b_dirty to b_io. The "moved" accounting may be later used to > limit the number of inodes that can be moved in one shot, in order to > keep spinlock hold time under control. This is fine by me. Honza > --- > fs/fs-writeback.c | 14 ++++++++++---- > include/trace/events/writeback.h | 25 +++++++++++++++++++++++++ > 2 files changed, 35 insertions(+), 4 deletions(-) > > --- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:53:33.000000000 +0800 > +++ linux-next/fs/fs-writeback.c 2011-05-05 23:53:34.000000000 +0800 > @@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i > /* > * Move expired dirty inodes from @delaying_queue to @dispatch_queue. > */ > -static void move_expired_inodes(struct list_head *delaying_queue, > +static int move_expired_inodes(struct list_head *delaying_queue, > struct list_head *dispatch_queue, > - unsigned long *older_than_this) > + unsigned long *older_than_this) > { > LIST_HEAD(tmp); > struct list_head *pos, *node; > struct super_block *sb = NULL; > struct inode *inode; > int do_sb_sort = 0; > + int moved = 0; > > while (!list_empty(delaying_queue)) { > inode = wb_inode(delaying_queue->prev); > @@ -277,12 +278,13 @@ static void move_expired_inodes(struct l > do_sb_sort = 1; > sb = inode->i_sb; > list_move(&inode->i_wb_list, &tmp); > + moved++; > } > > /* just one sb in list, splice to dispatch_queue and we're done */ > if (!do_sb_sort) { > list_splice(&tmp, dispatch_queue); > - return; > + goto out; > } > > /* Move inodes from one superblock together */ > @@ -294,6 +296,8 @@ static void move_expired_inodes(struct l > list_move(&inode->i_wb_list, dispatch_queue); > } > } > +out: > + return moved; > } > > /* > @@ -309,9 +313,11 @@ static void move_expired_inodes(struct l > */ > static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this) > { > + int moved; > assert_spin_locked(&wb->list_lock); > list_splice_init(&wb->b_more_io, &wb->b_io); > - move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > + trace_writeback_queue_io(wb, older_than_this, moved); > } > > static int write_inode(struct inode *inode, struct writeback_control *wbc) > --- linux-next.orig/include/trace/events/writeback.h 2011-05-05 23:53:33.000000000 +0800 > +++ linux-next/include/trace/events/writeback.h 2011-05-06 00:14:09.000000000 +0800 > @@ -158,6 +158,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt > DEFINE_WBC_EVENT(wbc_balance_dirty_wait); > DEFINE_WBC_EVENT(wbc_writepage); > > +TRACE_EVENT(writeback_queue_io, > + TP_PROTO(struct bdi_writeback *wb, > + unsigned long *older_than_this, > + int moved), > + TP_ARGS(wb, older_than_this, moved), > + TP_STRUCT__entry( > + __array(char, name, 32) > + __field(unsigned long, older) > + __field(long, age) > + __field(int, moved) > + ), > + TP_fast_assign( > + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); > + __entry->older = older_than_this ? *older_than_this : 0; > + __entry->age = older_than_this ? > + (jiffies - *older_than_this) * 1000 / HZ : -1; > + __entry->moved = moved; > + ), > + TP_printk("bdi %s: older=%lu age=%ld enqueue=%d", > + __entry->name, > + __entry->older, /* older_than_this in jiffies */ > + __entry->age, /* older_than_this in relative milliseconds */ > + __entry->moved) > +); > + > DECLARE_EVENT_CLASS(writeback_congest_waited_template, > > TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 6/6] writeback: convert to relative older_than_this in trace events 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang ` (4 preceding siblings ...) 2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang @ 2011-05-04 9:17 ` Wu Fengguang 2011-05-04 22:23 ` Jan Kara 2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig 6 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, Wu Fengguang, LKML, linux-fsdevel [-- Attachment #1: writeback-trace-older.patch --] [-- Type: text/plain, Size: 1897 bytes --] This is a format change that could break established scripts. But hopefully it's more human friendly to show the relative values like "30000" (30s old) or "0" (all fresh ones) than long absolute jiffy numbers. Note that the initial 30000 or 0 may increase over time as they are now relative numbers. CC: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800 @@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class, __field(int, for_reclaim) __field(int, range_cyclic) __field(int, more_io) - __field(unsigned long, older_than_this) + __field(int, older) __field(long, range_start) __field(long, range_end) ), @@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class, __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; __entry->more_io = wbc->more_io; - __entry->older_than_this = wbc->older_than_this ? - *wbc->older_than_this : 0; + __entry->older = wbc->older_than_this ? + (jiffies - *wbc->older_than_this) * 1000 / HZ + : -1; __entry->range_start = (long)wbc->range_start; __entry->range_end = (long)wbc->range_end; ), TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d " - "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx " + "bgrd=%d reclm=%d cyclic=%d more=%d older=%d " "start=0x%lx end=0x%lx", __entry->name, __entry->nr_to_write, @@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class, __entry->for_reclaim, __entry->range_cyclic, __entry->more_io, - __entry->older_than_this, + __entry->older, __entry->range_start, __entry->range_end) ) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] writeback: convert to relative older_than_this in trace events 2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang @ 2011-05-04 22:23 ` Jan Kara 2011-05-05 12:33 ` Wu Fengguang 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2011-05-04 22:23 UTC (permalink / raw) To: Wu Fengguang; +Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel On Wed 04-05-11 17:17:13, Wu Fengguang wrote: > This is a format change that could break established scripts. > > But hopefully it's more human friendly to show the relative values like > "30000" (30s old) or "0" (all fresh ones) than long absolute jiffy > numbers. It's possibly more human friendly but probably also harder to handle in scripts. I especially don't like the fact that the number will be changing at different tracepoints - you'll then have to do some comparisons to find out whether just the timestamp has been changed or whether just the time passes by. It's not that hard to write a script doing absolute->relative conversion (since you have time stamps in the traces) for those cases where relative time is really what interests you. So do you have any particular usecase in mind for the relative value? Honza > > Note that the initial 30000 or 0 may increase over time as they are now > relative numbers. > > CC: Jan Kara <jack@suse.cz> > CC: Dave Chinner <david@fromorbit.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/trace/events/writeback.h | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800 > +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800 > @@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class, > __field(int, for_reclaim) > __field(int, range_cyclic) > __field(int, more_io) > - __field(unsigned long, older_than_this) > + __field(int, older) > __field(long, range_start) > __field(long, range_end) > ), > @@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class, > __entry->for_reclaim = wbc->for_reclaim; > __entry->range_cyclic = wbc->range_cyclic; > __entry->more_io = wbc->more_io; > - __entry->older_than_this = wbc->older_than_this ? > - *wbc->older_than_this : 0; > + __entry->older = wbc->older_than_this ? > + (jiffies - *wbc->older_than_this) * 1000 / HZ > + : -1; > __entry->range_start = (long)wbc->range_start; > __entry->range_end = (long)wbc->range_end; > ), > > TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d " > - "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx " > + "bgrd=%d reclm=%d cyclic=%d more=%d older=%d " > "start=0x%lx end=0x%lx", > __entry->name, > __entry->nr_to_write, > @@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class, > __entry->for_reclaim, > __entry->range_cyclic, > __entry->more_io, > - __entry->older_than_this, > + __entry->older, > __entry->range_start, > __entry->range_end) > ) > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] writeback: convert to relative older_than_this in trace events 2011-05-04 22:23 ` Jan Kara @ 2011-05-05 12:33 ` Wu Fengguang 0 siblings, 0 replies; 22+ messages in thread From: Wu Fengguang @ 2011-05-05 12:33 UTC (permalink / raw) To: Jan Kara; +Cc: Andrew Morton, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org On Thu, May 05, 2011 at 06:23:33AM +0800, Jan Kara wrote: > On Wed 04-05-11 17:17:13, Wu Fengguang wrote: > > This is a format change that could break established scripts. > > > > But hopefully it's more human friendly to show the relative values like > > "30000" (30s old) or "0" (all fresh ones) than long absolute jiffy > > numbers. > It's possibly more human friendly but probably also harder to handle in > scripts. I especially don't like the fact that the number will be changing > at different tracepoints - you'll then have to do some comparisons to find > out whether just the timestamp has been changed or whether just the time > passes by. It's not that hard to write a script doing absolute->relative > conversion (since you have time stamps in the traces) for those cases where > relative time is really what interests you. OK. > So do you have any particular usecase in mind for the relative > value? I normally just look at the raw traces in which case the raw jiffies mean nothing to me. Or sometimes plot the relative dirty ages with scripts. Thanks, Fengguang > > Note that the initial 30000 or 0 may increase over time as they are now > > relative numbers. > > > > CC: Jan Kara <jack@suse.cz> > > CC: Dave Chinner <david@fromorbit.com> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/trace/events/writeback.h | 11 ++++++----- > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800 > > +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800 > > @@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class, > > __field(int, for_reclaim) > > __field(int, range_cyclic) > > __field(int, more_io) > > - __field(unsigned long, older_than_this) > > + __field(int, older) > > __field(long, range_start) > > __field(long, range_end) > > ), > > @@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class, > > __entry->for_reclaim = wbc->for_reclaim; > > __entry->range_cyclic = wbc->range_cyclic; > > __entry->more_io = wbc->more_io; > > - __entry->older_than_this = wbc->older_than_this ? > > - *wbc->older_than_this : 0; > > + __entry->older = wbc->older_than_this ? > > + (jiffies - *wbc->older_than_this) * 1000 / HZ > > + : -1; > > __entry->range_start = (long)wbc->range_start; > > __entry->range_end = (long)wbc->range_end; > > ), > > > > TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d " > > - "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx " > > + "bgrd=%d reclm=%d cyclic=%d more=%d older=%d " > > "start=0x%lx end=0x%lx", > > __entry->name, > > __entry->nr_to_write, > > @@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class, > > __entry->for_reclaim, > > __entry->range_cyclic, > > __entry->more_io, > > - __entry->older_than_this, > > + __entry->older, > > __entry->range_start, > > __entry->range_end) > > ) > > > > > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang ` (5 preceding siblings ...) 2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang @ 2011-05-04 9:46 ` Christoph Hellwig 2011-05-04 9:56 ` Wu Fengguang 6 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2011-05-04 9:46 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig, LKML, linux-fsdevel On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote: > Andrew, > > Here are one more collection of simple writeback patches. How does this relate to the other floating writeback patches? We really need a proper writeback tree to make any sense of the patch queues.. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig @ 2011-05-04 9:56 ` Wu Fengguang 2011-05-04 10:06 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Wu Fengguang @ 2011-05-04 9:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org On Wed, May 04, 2011 at 05:46:54PM +0800, Christoph Hellwig wrote: > On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote: > > Andrew, > > > > Here are one more collection of simple writeback patches. > > How does this relate to the other floating writeback patches? We > really need a proper writeback tree to make any sense of the patch > queues.. These patches are based on the writeback patches already in -mm. Sorry I should have mentioned it. Thanks, Fengguang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 9:56 ` Wu Fengguang @ 2011-05-04 10:06 ` Dave Chinner 2011-05-04 11:18 ` Christoph Hellwig 2011-05-04 13:12 ` Theodore Tso 0 siblings, 2 replies; 22+ messages in thread From: Dave Chinner @ 2011-05-04 10:06 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Andrew Morton, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Wed, May 04, 2011 at 05:56:37PM +0800, Wu Fengguang wrote: > On Wed, May 04, 2011 at 05:46:54PM +0800, Christoph Hellwig wrote: > > On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote: > > > Andrew, > > > > > > Here are one more collection of simple writeback patches. > > > > How does this relate to the other floating writeback patches? We > > really need a proper writeback tree to make any sense of the patch > > queues.. > > These patches are based on the writeback patches already in -mm. > Sorry I should have mentioned it. Does anyone actaully testing filesystems use the -mm tree? I'm pretty sure that no-one in the XFS world does, and I don't think that any ext4 or btrfs folk do, either.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 10:06 ` Dave Chinner @ 2011-05-04 11:18 ` Christoph Hellwig 2011-05-04 13:12 ` Theodore Tso 1 sibling, 0 replies; 22+ messages in thread From: Christoph Hellwig @ 2011-05-04 11:18 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Wed, May 04, 2011 at 08:06:09PM +1000, Dave Chinner wrote: > > These patches are based on the writeback patches already in -mm. > > Sorry I should have mentioned it. > > Does anyone actaully testing filesystems use the -mm tree? I'm > pretty sure that no-one in the XFS world does, and I don't think > that any ext4 or btrfs folk do, either.... I doubt it since -mm still hasn't made it into linux-next. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 10:06 ` Dave Chinner 2011-05-04 11:18 ` Christoph Hellwig @ 2011-05-04 13:12 ` Theodore Tso 2011-05-04 22:31 ` Jan Kara 1 sibling, 1 reply; 22+ messages in thread From: Theodore Tso @ 2011-05-04 13:12 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On May 4, 2011, at 6:06 AM, Dave Chinner wrote: > Does anyone actaully testing filesystems use the -mm tree? I'm > pretty sure that no-one in the XFS world does, and I don't think > that any ext4 or btrfs folk do, either.... I don't. I agree that it really would be great if there was a separate git tree for the writeback changes, since that way it becomes possible to test just the writeback changes, and not worry about other potential stability problems introduced by changes in the -mm tree.... -- Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events 2011-05-04 13:12 ` Theodore Tso @ 2011-05-04 22:31 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2011-05-04 22:31 UTC (permalink / raw) To: Theodore Tso Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Andrew Morton, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Wed 04-05-11 09:12:54, Ted Tso wrote: > On May 4, 2011, at 6:06 AM, Dave Chinner wrote: > > > Does anyone actaully testing filesystems use the -mm tree? I'm > > pretty sure that no-one in the XFS world does, and I don't think > > that any ext4 or btrfs folk do, either.... > > I don't. I agree that it really would be great if there was a separate > git tree for the writeback changes, since that way it becomes possible to > test just the writeback changes, and not worry about other potential > stability problems introduced by changes in the -mm tree.... OK. We'd still push changes to Linus via Andrew but have them also accumulated in that tree for testing. So we'd have a branch with changes sitting in -mm (likely to go to Linus in near future) and then possibly other branches with things brewing for people to try out. Does that make sense? I can setup that tree and maintain it. Or Fengguang, do you want to do it? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-05-05 17:26 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang 2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang 2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang 2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang 2011-05-04 9:42 ` Christoph Hellwig 2011-05-04 11:52 ` Wu Fengguang 2011-05-04 15:51 ` Wu Fengguang 2011-05-04 16:18 ` Christoph Hellwig 2011-05-05 10:47 ` Wu Fengguang 2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang 2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang 2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang 2011-05-05 17:26 ` Jan Kara 2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang 2011-05-04 22:23 ` Jan Kara 2011-05-05 12:33 ` Wu Fengguang 2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig 2011-05-04 9:56 ` Wu Fengguang 2011-05-04 10:06 ` Dave Chinner 2011-05-04 11:18 ` Christoph Hellwig 2011-05-04 13:12 ` Theodore Tso 2011-05-04 22:31 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).