* [PATCH 0/6] writeback fixes and trace events
@ 2011-05-04 9:17 Wu Fengguang
2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
` (6 more replies)
0 siblings, 7 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Wu Fengguang, LKML,
linux-fsdevel
Andrew,
Here are one more collection of simple writeback patches.
two already reviewed fixes
[PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc
[PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs
a straightforward write size optimization
[PATCH 3/6] writeback: make nr_to_write a per-file limit
two new trace events
[PATCH 4/6] writeback: trace event writeback_single_inode
[PATCH 5/6] writeback: trace event writeback_queue_io
an RFC wbc_writeback_* trace format change
(please drop it if cannot get ACKs from others)
[PATCH 6/6] writeback: convert to relative older_than_this in trace events
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
` (5 subsequent siblings)
6 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner, LKML,
linux-fsdevel
[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]
Clarify the bdi_dirty_limit() comment.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
--- linux-next.orig/mm/page-writeback.c 2011-04-19 10:18:16.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-04-19 10:18:32.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
*pdirty = dirty;
}
-/*
+/**
* bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
*
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
* - starving fast devices
* - piling up dirty pages (that will take long time to sync) on slow devices
*
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang
` (4 subsequent siblings)
6 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra,
Wu Fengguang, Dave Chinner, LKML, linux-fsdevel
[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 2964 bytes --]
This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
It can also prevent
[ 388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
in the balance_dirty_pages tracepoint, which will call
dev_name(mapping->backing_dev_info->dev)
but shmem_backing_dev_info.dev is NULL.
Summary notes about the tmpfs/ramfs behavior changes:
As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold. This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
|| (nr_reclaimable + nr_writeback >= dirty_thresh);
For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
|| (nr_reclaimable + nr_writeback > dirty_thresh);
So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).
I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.
So based on 2.6.37, this patch won't bring more noticeable changes.
CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
--- linux-next.orig/mm/page-writeback.c 2011-04-13 17:18:10.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-04-13 17:18:10.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
static void bdi_writeout_fraction(struct backing_dev_info *bdi,
long *numerator, long *denominator)
{
- if (bdi_cap_writeback_dirty(bdi)) {
- prop_fraction_percpu(&vm_completions, &bdi->completions,
+ prop_fraction_percpu(&vm_completions, &bdi->completions,
numerator, denominator);
- } else {
- *numerator = 0;
- *denominator = 1;
- }
}
static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
bool dirty_exceeded = false;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ if (!bdi_cap_account_dirty(bdi))
+ return;
+
for (;;) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-04 9:42 ` Christoph Hellwig
2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, Wu Fengguang, LKML, linux-fsdevel
[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 2001 bytes --]
This ensures large dirty files can be written in the full 4MB writeback
chunk size, rather than whatever remained quota in wbc->nr_to_write.
CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 11 +++++++++++
include/linux/writeback.h | 1 +
2 files changed, 12 insertions(+)
--- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:00:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-04 16:01:00.000000000 +0800
@@ -346,6 +346,8 @@ writeback_single_inode(struct inode *ino
struct writeback_control *wbc)
{
struct address_space *mapping = inode->i_mapping;
+ long per_file_limit = wbc->per_file_limit;
+ long uninitialized_var(nr_to_write);
unsigned dirty;
int ret;
@@ -385,8 +387,16 @@ writeback_single_inode(struct inode *ino
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
+ if (per_file_limit) {
+ nr_to_write = wbc->nr_to_write;
+ wbc->nr_to_write = per_file_limit;
+ }
+
ret = do_writepages(mapping, wbc);
+ if (per_file_limit)
+ wbc->nr_to_write += nr_to_write - per_file_limit;
+
/*
* Make sure to wait on the data before writing out the metadata.
* This is important for filesystems that modify metadata on data
@@ -710,6 +720,7 @@ static long wb_writeback(struct bdi_writ
wbc.more_io = 0;
wbc.nr_to_write = write_chunk;
+ wbc.per_file_limit = write_chunk;
wbc.pages_skipped = 0;
wbc.inodes_cleaned = 0;
--- linux-next.orig/include/linux/writeback.h 2011-05-04 16:00:45.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-04 16:01:00.000000000 +0800
@@ -28,6 +28,7 @@ struct writeback_control {
older than this */
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
+ long per_file_limit; /* Write this many pages for one file */
long pages_skipped; /* Pages which were not written */
long inodes_cleaned; /* # of inodes cleaned */
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/6] writeback: trace event writeback_single_inode
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
` (2 preceding siblings ...)
2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang
` (2 subsequent siblings)
6 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel
[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3776 bytes --]
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 wrote=0 to_write=1024"
- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "wrote" is the number of pages written in this writeback_single_inode()
- "to_write" is the remained wbc->nr_to_write
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 12 +++---
include/trace/events/writeback.h | 56 +++++++++++++++++++++++++++++
2 files changed, 63 insertions(+), 5 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:01:00.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-04 16:26:34.000000000 +0800
@@ -347,7 +347,7 @@ writeback_single_inode(struct inode *ino
{
struct address_space *mapping = inode->i_mapping;
long per_file_limit = wbc->per_file_limit;
- long uninitialized_var(nr_to_write);
+ long nr_to_write = wbc->nr_to_write;
unsigned dirty;
int ret;
@@ -370,7 +370,8 @@ writeback_single_inode(struct inode *ino
*/
if (wbc->sync_mode != WB_SYNC_ALL) {
requeue_io(inode, wb);
- return 0;
+ ret = 0;
+ goto out;
}
/*
@@ -387,10 +388,8 @@ writeback_single_inode(struct inode *ino
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
- if (per_file_limit) {
- nr_to_write = wbc->nr_to_write;
+ if (per_file_limit)
wbc->nr_to_write = per_file_limit;
- }
ret = do_writepages(mapping, wbc);
@@ -477,6 +476,9 @@ writeback_single_inode(struct inode *ino
}
}
inode_sync_complete(inode);
+out:
+ trace_writeback_single_inode(inode, wbc,
+ nr_to_write - wbc->nr_to_write);
return ret;
}
--- linux-next.orig/include/trace/events/writeback.h 2011-05-04 15:59:23.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-05-04 16:27:18.000000000 +0800
@@ -8,6 +8,19 @@
#include <linux/device.h>
#include <linux/writeback.h>
+#define show_inode_state(state) \
+ __print_flags(state, "|", \
+ {I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
+ {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
+ {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
+ {I_NEW, "I_NEW"}, \
+ {I_WILL_FREE, "I_WILL_FREE"}, \
+ {I_FREEING, "I_FREEING"}, \
+ {I_CLEAR, "I_CLEAR"}, \
+ {I_SYNC, "I_SYNC"}, \
+ {I_REFERENCED, "I_REFERENCED"} \
+ )
+
struct wb_writeback_work;
DECLARE_EVENT_CLASS(writeback_work_class,
@@ -187,6 +200,49 @@ DEFINE_EVENT(writeback_congest_waited_te
TP_ARGS(usec_timeout, usec_delayed)
);
+TRACE_EVENT(writeback_single_inode,
+
+ TP_PROTO(struct inode *inode,
+ struct writeback_control *wbc,
+ unsigned long wrote
+ ),
+
+ TP_ARGS(inode, wbc, wrote),
+
+ TP_STRUCT__entry(
+ __array(char, name, 32)
+ __field(unsigned long, ino)
+ __field(unsigned long, state)
+ __field(unsigned long, age)
+ __field(unsigned long, writeback_index)
+ __field(unsigned long, wrote)
+ __field(long, nr_to_write)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name,
+ dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+ __entry->ino = inode->i_ino;
+ __entry->state = inode->i_state;
+ __entry->age = (jiffies - inode->dirtied_when) *
+ 1000 / HZ;
+ __entry->writeback_index = inode->i_mapping->writeback_index;
+ __entry->wrote = wrote;
+ __entry->nr_to_write = wbc->nr_to_write;
+ ),
+
+ TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+ "index=%lu wrote=%lu to_write=%ld",
+ __entry->name,
+ __entry->ino,
+ show_inode_state(__entry->state),
+ __entry->age,
+ __entry->writeback_index,
+ __entry->wrote,
+ __entry->nr_to_write
+ )
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/6] writeback: trace event writeback_queue_io
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
` (3 preceding siblings ...)
2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang
2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang
2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig
6 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, Wu Fengguang, Dave Chinner, LKML, linux-fsdevel
[-- Attachment #1: writeback-trace-queue_io.patch --]
[-- Type: text/plain, Size: 3305 bytes --]
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 16 +++++++++++-----
include/trace/events/writeback.h | 23 +++++++++++++++++++++++
2 files changed, 34 insertions(+), 5 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-04 16:01:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-04 16:09:01.000000000 +0800
@@ -248,15 +248,16 @@ static bool inode_dirtied_after(struct i
/*
* Move expired dirty inodes from @delaying_queue to @dispatch_queue.
*/
-static void move_expired_inodes(struct list_head *delaying_queue,
- struct list_head *dispatch_queue,
- struct writeback_control *wbc)
+static int move_expired_inodes(struct list_head *delaying_queue,
+ struct list_head *dispatch_queue,
+ struct writeback_control *wbc)
{
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
struct inode *inode;
int do_sb_sort = 0;
+ int moved = 0;
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
@@ -267,12 +268,13 @@ static void move_expired_inodes(struct l
do_sb_sort = 1;
sb = inode->i_sb;
list_move(&inode->i_wb_list, &tmp);
+ moved++;
}
/* just one sb in list, splice to dispatch_queue and we're done */
if (!do_sb_sort) {
list_splice(&tmp, dispatch_queue);
- return;
+ goto out;
}
/* Move inodes from one superblock together */
@@ -284,6 +286,8 @@ static void move_expired_inodes(struct l
list_move(&inode->i_wb_list, dispatch_queue);
}
}
+out:
+ return moved;
}
/*
@@ -299,9 +303,11 @@ static void move_expired_inodes(struct l
*/
static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
{
+ int moved;
assert_spin_locked(&wb->list_lock);
list_splice_init(&wb->b_more_io, &wb->b_io);
- move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
+ moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
+ trace_writeback_queue_io(wb, wbc, moved);
}
static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:06:50.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-05-04 16:08:58.000000000 +0800
@@ -165,6 +165,29 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+TRACE_EVENT(writeback_queue_io,
+ TP_PROTO(struct bdi_writeback *wb,
+ struct writeback_control *wbc,
+ int moved),
+ TP_ARGS(wb, wbc, moved),
+ TP_STRUCT__entry(
+ __array(char, name, 32)
+ __field(int, older)
+ __field(int, moved)
+ ),
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+ __entry->older = wbc->older_than_this ?
+ (jiffies - *wbc->older_than_this) * 1000 / HZ
+ : -1;
+ __entry->moved = moved;
+ ),
+ TP_printk("bdi %s: older=%d enqueue=%d",
+ __entry->name,
+ __entry->older,
+ __entry->moved)
+);
+
DECLARE_EVENT_CLASS(writeback_congest_waited_template,
TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 6/6] writeback: convert to relative older_than_this in trace events
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
` (4 preceding siblings ...)
2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-05-04 9:17 ` Wu Fengguang
2011-05-04 22:23 ` Jan Kara
2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig
6 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, Wu Fengguang, LKML, linux-fsdevel
[-- Attachment #1: writeback-trace-older.patch --]
[-- Type: text/plain, Size: 1897 bytes --]
This is a format change that could break established scripts.
But hopefully it's more human friendly to show the relative values like
"30000" (30s old) or "0" (all fresh ones) than long absolute jiffy
numbers.
Note that the initial 30000 or 0 may increase over time as they are now
relative numbers.
CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/writeback.h | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800
@@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class,
__field(int, for_reclaim)
__field(int, range_cyclic)
__field(int, more_io)
- __field(unsigned long, older_than_this)
+ __field(int, older)
__field(long, range_start)
__field(long, range_end)
),
@@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_reclaim = wbc->for_reclaim;
__entry->range_cyclic = wbc->range_cyclic;
__entry->more_io = wbc->more_io;
- __entry->older_than_this = wbc->older_than_this ?
- *wbc->older_than_this : 0;
+ __entry->older = wbc->older_than_this ?
+ (jiffies - *wbc->older_than_this) * 1000 / HZ
+ : -1;
__entry->range_start = (long)wbc->range_start;
__entry->range_end = (long)wbc->range_end;
),
TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
- "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+ "bgrd=%d reclm=%d cyclic=%d more=%d older=%d "
"start=0x%lx end=0x%lx",
__entry->name,
__entry->nr_to_write,
@@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_reclaim,
__entry->range_cyclic,
__entry->more_io,
- __entry->older_than_this,
+ __entry->older,
__entry->range_start,
__entry->range_end)
)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang
@ 2011-05-04 9:42 ` Christoph Hellwig
2011-05-04 11:52 ` Wu Fengguang
0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-05-04 9:42 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel
On Wed, May 04, 2011 at 05:17:10PM +0800, Wu Fengguang wrote:
> This ensures large dirty files can be written in the full 4MB writeback
> chunk size, rather than whatever remained quota in wbc->nr_to_write.
I like the high-level idea, but the implementation of overriding
nr_to_write and then copying it back seems rather ugly.
The basic problem seems to be that struct writeback_control is
designed to control writeback of a single file, but we keep abuse it
for writing multiple files in writeback_sb_inodes and its callers.
It seems like we should only build the struct writeback_control from
struct wb_writeback_work down in writeback_sb_inodes, even if that
means passing some more information to it either in struct
wb_writeback_work or on the stack.
Then writeback_sb_inodes can do something like
if (wbc.sync_mode == WB_SYNC_NONE)
wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages);
else
wbc.nr_to_write = LONG_MAX;
for each inode it writes.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
` (5 preceding siblings ...)
2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang
@ 2011-05-04 9:46 ` Christoph Hellwig
2011-05-04 9:56 ` Wu Fengguang
6 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-05-04 9:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig, LKML,
linux-fsdevel
On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote:
> Andrew,
>
> Here are one more collection of simple writeback patches.
How does this relate to the other floating writeback patches? We
really need a proper writeback tree to make any sense of the patch
queues..
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig
@ 2011-05-04 9:56 ` Wu Fengguang
2011-05-04 10:06 ` Dave Chinner
0 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 9:56 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
On Wed, May 04, 2011 at 05:46:54PM +0800, Christoph Hellwig wrote:
> On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > Here are one more collection of simple writeback patches.
>
> How does this relate to the other floating writeback patches? We
> really need a proper writeback tree to make any sense of the patch
> queues..
These patches are based on the writeback patches already in -mm.
Sorry I should have mentioned it.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 9:56 ` Wu Fengguang
@ 2011-05-04 10:06 ` Dave Chinner
2011-05-04 11:18 ` Christoph Hellwig
2011-05-04 13:12 ` Theodore Tso
0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2011-05-04 10:06 UTC (permalink / raw)
To: Wu Fengguang
Cc: Christoph Hellwig, Andrew Morton, Jan Kara, LKML,
linux-fsdevel@vger.kernel.org
On Wed, May 04, 2011 at 05:56:37PM +0800, Wu Fengguang wrote:
> On Wed, May 04, 2011 at 05:46:54PM +0800, Christoph Hellwig wrote:
> > On Wed, May 04, 2011 at 05:17:07PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > Here are one more collection of simple writeback patches.
> >
> > How does this relate to the other floating writeback patches? We
> > really need a proper writeback tree to make any sense of the patch
> > queues..
>
> These patches are based on the writeback patches already in -mm.
> Sorry I should have mentioned it.
Does anyone actaully testing filesystems use the -mm tree? I'm
pretty sure that no-one in the XFS world does, and I don't think
that any ext4 or btrfs folk do, either....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 10:06 ` Dave Chinner
@ 2011-05-04 11:18 ` Christoph Hellwig
2011-05-04 13:12 ` Theodore Tso
1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2011-05-04 11:18 UTC (permalink / raw)
To: Dave Chinner
Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton, Jan Kara, LKML,
linux-fsdevel@vger.kernel.org
On Wed, May 04, 2011 at 08:06:09PM +1000, Dave Chinner wrote:
> > These patches are based on the writeback patches already in -mm.
> > Sorry I should have mentioned it.
>
> Does anyone actaully testing filesystems use the -mm tree? I'm
> pretty sure that no-one in the XFS world does, and I don't think
> that any ext4 or btrfs folk do, either....
I doubt it since -mm still hasn't made it into linux-next.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 9:42 ` Christoph Hellwig
@ 2011-05-04 11:52 ` Wu Fengguang
2011-05-04 15:51 ` Wu Fengguang
0 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 11:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
On Wed, May 04, 2011 at 05:42:21PM +0800, Christoph Hellwig wrote:
> On Wed, May 04, 2011 at 05:17:10PM +0800, Wu Fengguang wrote:
> > This ensures large dirty files can be written in the full 4MB writeback
> > chunk size, rather than whatever remained quota in wbc->nr_to_write.
>
> I like the high-level idea, but the implementation of overriding
> nr_to_write and then copying it back seems rather ugly.
>
> The basic problem seems to be that struct writeback_control is
> designed to control writeback of a single file, but we keep abuse it
> for writing multiple files in writeback_sb_inodes and its callers.
>
> It seems like we should only build the struct writeback_control from
> struct wb_writeback_work down in writeback_sb_inodes, even if that
> means passing some more information to it either in struct
> wb_writeback_work or on the stack.
Yes it's very reasonable and possible according to your notes in
another email.
> Then writeback_sb_inodes can do something like
>
> if (wbc.sync_mode == WB_SYNC_NONE)
> wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages);
I like the min() idea. However it have the side effect of (very possible)
smallish IO from balance_dirty_pages(), which may call us with small
->nr_pages.
We may explicitly do "write_chunk = max(MAX_WRITEBACK_PAGES, write_chunk)"
in balance_dirty_pages() to retain the old behavior.
> else
> wbc.nr_to_write = LONG_MAX;
>
> for each inode it writes.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 10:06 ` Dave Chinner
2011-05-04 11:18 ` Christoph Hellwig
@ 2011-05-04 13:12 ` Theodore Tso
2011-05-04 22:31 ` Jan Kara
1 sibling, 1 reply; 22+ messages in thread
From: Theodore Tso @ 2011-05-04 13:12 UTC (permalink / raw)
To: Dave Chinner
Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton, Jan Kara, LKML,
linux-fsdevel@vger.kernel.org
On May 4, 2011, at 6:06 AM, Dave Chinner wrote:
> Does anyone actaully testing filesystems use the -mm tree? I'm
> pretty sure that no-one in the XFS world does, and I don't think
> that any ext4 or btrfs folk do, either....
I don't. I agree that it really would be great if there was a separate git tree for the writeback changes, since that way it becomes possible to test just the writeback changes, and not worry about other potential stability problems introduced by changes in the -mm tree....
-- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 11:52 ` Wu Fengguang
@ 2011-05-04 15:51 ` Wu Fengguang
2011-05-04 16:18 ` Christoph Hellwig
0 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-04 15:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
> > Then writeback_sb_inodes can do something like
> >
> > if (wbc.sync_mode == WB_SYNC_NONE)
> > wbc.nr_to_write = min(MAX_WRITEBACK_PAGES, work->nr_pages);
>
> I like the min() idea. However it have the side effect of (very possible)
> smallish IO from balance_dirty_pages(), which may call us with small
> ->nr_pages.
>
> We may explicitly do "write_chunk = max(MAX_WRITEBACK_PAGES, write_chunk)"
> in balance_dirty_pages() to retain the old behavior.
Sorry I was confused and please ignore the above. At least VFS won't
enlarge the writeback request from balance_dirty_pages()..
Here is the scratch patch. I'll need to double check it tomorrow, but
it's already running fine in the dd+tar+sync test and get pretty good
performance numbers.
# test-tar-dd.sh
[...]
246.62 1941.38
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 11.4383 s, 91.7 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 2.57s system 22% cpu 11.499 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.03s user 4.36s system 56% cpu 28.770 total
286.04 2204.50
elapsed: 263.23000000000025
Thanks,
Fengguang
---
Subject: writeback: make writeback_control.nr_to_write straight
Date: Wed May 04 19:54:37 CST 2011
As suggested by Christoph:
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
to enum writeback_progress
- move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h
- add wb_writeback_work.older_than_this
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 187 +++++++++++++++---------------------
include/linux/writeback.h | 29 +++++
mm/backing-dev.c | 6 -
mm/page-writeback.c | 11 --
4 files changed, 116 insertions(+), 117 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-04 21:30:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-04 23:40:59.000000000 +0800
@@ -30,22 +30,6 @@
#include "internal.h"
/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
- long nr_pages;
- struct super_block *sb;
- enum writeback_sync_modes sync_mode;
- unsigned int tagged_sync:1;
- unsigned int for_kupdate:1;
- unsigned int range_cyclic:1;
- unsigned int for_background:1;
-
- struct list_head list; /* pending work list */
- struct completion *done; /* set if the caller waits */
-};
-
-/*
* Include the creation of the trace points after defining the
* wb_writeback_work structure so that the definition remains local to this
* file.
@@ -463,7 +447,6 @@ writeback_single_inode(struct inode *ino
* No need to add it back to the LRU.
*/
list_del_init(&inode->i_wb_list);
- wbc->inodes_cleaned++;
}
}
inode_sync_complete(inode);
@@ -502,19 +485,53 @@ static bool pin_sb_for_writeback(struct
* If @only_this_sb is true, then find and write all such
* inodes. Otherwise write only ones which go sequentially
* in reverse order.
- *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
*/
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
- struct writeback_control *wbc, bool only_this_sb)
+enum writeback_progress {
+ WROTE_FULL_CHUNK,
+ WROTE_SOME_PAGES,
+ WROTE_SOME_INODES,
+ WROTE_NOTHING,
+};
+
+static int writeback_sb_inodes(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
+ struct writeback_control wbc = {
+ .sync_mode = work->sync_mode,
+ .tagged_sync = work->tagged_sync,
+ .older_than_this = work->older_than_this,
+ .for_kupdate = work->for_kupdate,
+ .for_background = work->for_background,
+ .range_cyclic = work->range_cyclic,
+ .range_start = 0;
+ .range_end = LLONG_MAX;
+ };
+ long write_chunk = MAX_WRITEBACK_PAGES;
+ long wrote = 0;
+ bool inode_cleaned = false;
+
+ /*
+ * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+ * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+ * here avoids calling into writeback_inodes_wb() more than once.
+ *
+ * The intended call sequence for WB_SYNC_ALL writeback is:
+ *
+ * wb_writeback()
+ * writeback_sb_inodes() <== called only once
+ * write_cache_pages() <== called once for each inode
+ * (quickly) tag currently dirty pages
+ * (maybe slowly) sync all tagged pages
+ */
+ if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
+ write_chunk = LONG_MAX;
+
while (!list_empty(&wb->b_io)) {
- long pages_skipped;
struct inode *inode = wb_inode(wb->b_io.prev);
if (inode->i_sb != sb) {
- if (only_this_sb) {
+ if (work->sb) {
/*
* We only want to write back data for this
* superblock, move all inodes not belonging
@@ -529,7 +546,7 @@ static int writeback_sb_inodes(struct su
* Bounce back to the caller to unpin this and
* pin the next superblock.
*/
- return 0;
+ break;
}
/*
@@ -543,34 +560,44 @@ static int writeback_sb_inodes(struct su
requeue_io(inode, wb);
continue;
}
-
__iget(inode);
+ write_chunk = min(write_chunk, work->nr_pages);
+ wbc.nr_to_write = write_chunk;
+ wbc.pages_skipped = 0;
+
+ writeback_single_inode(inode, wb, &wbc);
- pages_skipped = wbc->pages_skipped;
- writeback_single_inode(inode, wb, wbc);
- if (wbc->pages_skipped != pages_skipped) {
+ work->nr_pages -= write_chunk - wbc.nr_to_write;
+ wrote += write_chunk - wbc.nr_to_write;
+ if (wbc.pages_skipped) {
/*
* writeback is not making progress due to locked
* buffers. Skip this inode for now.
*/
redirty_tail(inode, wb);
- }
+ } else if (!(inode->i_state & I_DIRTY))
+ inode_cleaned = true;
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
iput(inode);
cond_resched();
spin_lock(&wb->list_lock);
- if (wbc->nr_to_write <= 0)
- return 1;
+ if (wrote >= MAX_WRITEBACK_PAGES)
+ return WROTE_FULL_CHUNK;
+ if (work->nr_pages <= 0)
+ return WROTE_FULL_CHUNK;
}
- /* b_io is empty */
- return 1;
+ if (wrote)
+ return WROTE_SOME_PAGES;
+ if (inode_cleaned)
+ return WROTE_SOME_INODES;
+ return WROTE_NOTHING;
}
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc)
+static int __writeback_inodes_wb(struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
- int ret = 0;
+ int ret = WROTE_NOTHING;
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,34 +607,26 @@ static void __writeback_inodes_wb(struct
requeue_io(inode, wb);
continue;
}
- ret = writeback_sb_inodes(sb, wb, wbc, false);
+ ret = writeback_sb_inodes(sb, wb, work);
drop_super(sb);
- if (ret)
+ if (ret == WROTE_FULL_CHUNK)
break;
}
/* Leave any unwritten inodes on b_io */
+ return ret;
}
void writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc)
+ struct wb_writeback_work *work)
{
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
- queue_io(wb, wbc->older_than_this);
- __writeback_inodes_wb(wb, wbc);
+ queue_io(wb, work->older_than_this);
+ __writeback_inodes_wb(wb, work);
spin_unlock(&wb->list_lock);
}
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation. We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode. Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES 1024
-
static inline bool over_bground_thresh(void)
{
unsigned long background_thresh, dirty_thresh;
@@ -636,42 +655,12 @@ static inline bool over_bground_thresh(v
static long wb_writeback(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
- struct writeback_control wbc = {
- .sync_mode = work->sync_mode,
- .tagged_sync = work->tagged_sync,
- .older_than_this = NULL,
- .for_kupdate = work->for_kupdate,
- .for_background = work->for_background,
- .range_cyclic = work->range_cyclic,
- };
- unsigned long oldest_jif;
- long wrote = 0;
- long write_chunk = MAX_WRITEBACK_PAGES;
+ long nr_pages = work->nr_pages;
+ unsigned long oldest_jif = jiffies;
struct inode *inode;
+ int progress;
- if (!wbc.range_cyclic) {
- wbc.range_start = 0;
- wbc.range_end = LLONG_MAX;
- }
-
- /*
- * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
- * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
- * here avoids calling into writeback_inodes_wb() more than once.
- *
- * The intended call sequence for WB_SYNC_ALL writeback is:
- *
- * wb_writeback()
- * writeback_sb_inodes() <== called only once
- * write_cache_pages() <== called once for each inode
- * (quickly) tag currently dirty pages
- * (maybe slowly) sync all tagged pages
- */
- if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync) {
- write_chunk = LONG_MAX;
- oldest_jif = jiffies;
- wbc.older_than_this = &oldest_jif;
- }
+ work->older_than_this = &oldest_jif;
for (;;) {
/*
@@ -700,27 +689,18 @@ static long wb_writeback(struct bdi_writ
if (work->for_kupdate || work->for_background) {
oldest_jif = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
- wbc.older_than_this = &oldest_jif;
+ work->older_than_this = &oldest_jif;
}
- wbc.nr_to_write = write_chunk;
- wbc.pages_skipped = 0;
- wbc.inodes_cleaned = 0;
-
retry:
- trace_wbc_writeback_start(&wbc, wb->bdi);
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
- queue_io(wb, wbc->older_than_this);
+ queue_io(wb, work->older_than_this);
if (work->sb)
- writeback_sb_inodes(work->sb, wb, &wbc, true);
+ progress = writeback_sb_inodes(work->sb, wb, work);
else
- __writeback_inodes_wb(wb, &wbc);
+ progress = __writeback_inodes_wb(wb, work);
spin_unlock(&wb->list_lock);
- trace_wbc_writeback_written(&wbc, wb->bdi);
-
- work->nr_pages -= write_chunk - wbc.nr_to_write;
- wrote += write_chunk - wbc.nr_to_write;
/*
* Did we write something? Try for more
@@ -730,9 +710,7 @@ retry:
* mean the overall work is done. So we keep looping as long
* as made some progress on cleaning pages or inodes.
*/
- if (wbc.nr_to_write < write_chunk)
- continue;
- if (wbc.inodes_cleaned)
+ if (progress != WROTE_NOTHING)
continue;
/*
* background writeback will start with expired inodes, and
@@ -741,10 +719,10 @@ retry:
* lists and cause trouble to the page reclaim.
*/
if (work->for_background &&
- wbc.older_than_this &&
+ work->older_than_this &&
list_empty(&wb->b_io) &&
list_empty(&wb->b_more_io)) {
- wbc.older_than_this = NULL;
+ work->older_than_this = NULL;
goto retry;
}
/*
@@ -760,7 +738,6 @@ retry:
spin_lock(&wb->list_lock);
if (!list_empty(&wb->b_more_io)) {
inode = wb_inode(wb->b_more_io.prev);
- trace_wbc_writeback_wait(&wbc, wb->bdi);
spin_lock(&inode->i_lock);
inode_wait_for_writeback(inode, wb);
spin_unlock(&inode->i_lock);
@@ -768,7 +745,7 @@ retry:
spin_unlock(&wb->list_lock);
}
- return wrote;
+ return nr_pages - work->nr_pages;
}
/*
--- linux-next.orig/mm/backing-dev.c 2011-05-04 21:30:22.000000000 +0800
+++ linux-next/mm/backing-dev.c 2011-05-04 21:49:07.000000000 +0800
@@ -262,14 +262,14 @@ int bdi_has_dirty_io(struct backing_dev_
static void bdi_flush_io(struct backing_dev_info *bdi)
{
- struct writeback_control wbc = {
+ struct wb_writeback_work work = {
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.range_cyclic = 1,
- .nr_to_write = 1024,
+ .nr_pages = 1024,
};
- writeback_inodes_wb(&bdi->wb, &wbc);
+ writeback_inodes_wb(&bdi->wb, &work);
}
/*
--- linux-next.orig/mm/page-writeback.c 2011-05-04 21:30:22.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-05-04 21:57:25.000000000 +0800
@@ -494,10 +494,10 @@ static void balance_dirty_pages(struct a
return;
for (;;) {
- struct writeback_control wbc = {
+ struct wb_writeback_work work = {
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
- .nr_to_write = write_chunk,
+ .nr_pages = write_chunk,
.range_cyclic = 1,
};
@@ -562,15 +562,12 @@ static void balance_dirty_pages(struct a
* threshold otherwise wait until the disk writes catch
* up.
*/
- trace_wbc_balance_dirty_start(&wbc, bdi);
if (bdi_nr_reclaimable > bdi_thresh) {
- writeback_inodes_wb(&bdi->wb, &wbc);
- pages_written += write_chunk - wbc.nr_to_write;
- trace_wbc_balance_dirty_written(&wbc, bdi);
+ writeback_inodes_wb(&bdi->wb, &work);
+ pages_written += write_chunk - work.nr_pages;
if (pages_written >= write_chunk)
break; /* We've done our duty */
}
- trace_wbc_balance_dirty_wait(&wbc, bdi);
__set_current_state(TASK_UNINTERRUPTIBLE);
io_schedule_timeout(pause);
--- linux-next.orig/include/linux/writeback.h 2011-05-04 21:30:41.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-04 21:43:04.000000000 +0800
@@ -7,6 +7,15 @@
#include <linux/sched.h>
#include <linux/fs.h>
+/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation. We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode. Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES 1024
+
struct backing_dev_info;
/*
@@ -29,7 +38,6 @@ struct writeback_control {
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
long pages_skipped; /* Pages which were not written */
- long inodes_cleaned; /* # of inodes cleaned */
/*
* For a_ops->writepages(): is start or end are non-zero then this is
@@ -49,6 +57,23 @@ struct writeback_control {
};
/*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+ long nr_pages;
+ struct super_block *sb;
+ enum writeback_sync_modes sync_mode;
+ unsigned long *older_than_this;
+ unsigned int tagged_sync:1;
+ unsigned int for_kupdate:1;
+ unsigned int range_cyclic:1;
+ unsigned int for_background:1;
+
+ struct list_head list; /* pending work list */
+ struct completion *done; /* set if the caller waits */
+};
+
+/*
* fs/fs-writeback.c
*/
struct bdi_writeback;
@@ -59,7 +84,7 @@ int writeback_inodes_sb_if_idle(struct s
int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
void sync_inodes_sb(struct super_block *);
void writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc);
+ struct wb_writeback_work *work);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 15:51 ` Wu Fengguang
@ 2011-05-04 16:18 ` Christoph Hellwig
2011-05-05 10:47 ` Wu Fengguang
0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-05-04 16:18 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
> - move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h
I think it would be a good idea to keep this in fs/fs-writeback.c, which
means we'd need a small writeback_inodes_wb wrapper for
balance_dirty_pages and bdi_flush_io. But IIRC your tree already has
__writeback_inodes_wb for use in wb_writeback, so writeback_inodes_wb
could be that wrapper.
> + long write_chunk = MAX_WRITEBACK_PAGES;
> + long wrote = 0;
> + bool inode_cleaned = false;
> +
> + /*
> + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> + * here avoids calling into writeback_inodes_wb() more than once.
> + *
> + * The intended call sequence for WB_SYNC_ALL writeback is:
> + *
> + * wb_writeback()
> + * writeback_sb_inodes() <== called only once
> + * write_cache_pages() <== called once for each inode
> + * (quickly) tag currently dirty pages
> + * (maybe slowly) sync all tagged pages
> + */
> + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
> + write_chunk = LONG_MAX;
I think this would be easier to read if kept as and if / else clause
with the MAX_WRITEBACK_PAGES usage.
> + write_chunk = min(write_chunk, work->nr_pages);
Or in fact done here - for the WB_SYNC_ALL case LONG_MAX should
always be larger than work->nr_pages, so the whole thing could be
simplified to:
if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
write_chunk = LONG_MAX;
else
write_chunk = min(MAX_WRITEBACK_PAGES, work->nr_pages);
Other notes:
- older_than_this in writeback_control shouldn't be needed anymore
- is the early return for the mis-matching sb in writeback_sb_inodes
handled correctly? Before it had the special 0 return value, and
I'm not quite sure how that fits into your new enum scheme.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] writeback: convert to relative older_than_this in trace events
2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang
@ 2011-05-04 22:23 ` Jan Kara
2011-05-05 12:33 ` Wu Fengguang
0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2011-05-04 22:23 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML, linux-fsdevel
On Wed 04-05-11 17:17:13, Wu Fengguang wrote:
> This is a format change that could break established scripts.
>
> But hopefully it's more human friendly to show the relative values like
> "30000" (30s old) or "0" (all fresh ones) than long absolute jiffy
> numbers.
It's possibly more human friendly but probably also harder to handle in
scripts. I especially don't like the fact that the number will be changing
at different tracepoints - you'll then have to do some comparisons to find
out whether just the timestamp has been changed or whether just the time
passes by. It's not that hard to write a script doing absolute->relative
conversion (since you have time stamps in the traces) for those cases where
relative time is really what interests you. So do you have any particular
usecase in mind for the relative value?
Honza
>
> Note that the initial 30000 or 0 may increase over time as they are now
> relative numbers.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/trace/events/writeback.h | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800
> @@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class,
> __field(int, for_reclaim)
> __field(int, range_cyclic)
> __field(int, more_io)
> - __field(unsigned long, older_than_this)
> + __field(int, older)
> __field(long, range_start)
> __field(long, range_end)
> ),
> @@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class,
> __entry->for_reclaim = wbc->for_reclaim;
> __entry->range_cyclic = wbc->range_cyclic;
> __entry->more_io = wbc->more_io;
> - __entry->older_than_this = wbc->older_than_this ?
> - *wbc->older_than_this : 0;
> + __entry->older = wbc->older_than_this ?
> + (jiffies - *wbc->older_than_this) * 1000 / HZ
> + : -1;
> __entry->range_start = (long)wbc->range_start;
> __entry->range_end = (long)wbc->range_end;
> ),
>
> TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> - "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> + "bgrd=%d reclm=%d cyclic=%d more=%d older=%d "
> "start=0x%lx end=0x%lx",
> __entry->name,
> __entry->nr_to_write,
> @@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class,
> __entry->for_reclaim,
> __entry->range_cyclic,
> __entry->more_io,
> - __entry->older_than_this,
> + __entry->older,
> __entry->range_start,
> __entry->range_end)
> )
>
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/6] writeback fixes and trace events
2011-05-04 13:12 ` Theodore Tso
@ 2011-05-04 22:31 ` Jan Kara
0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2011-05-04 22:31 UTC (permalink / raw)
To: Theodore Tso
Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Andrew Morton,
Jan Kara, LKML, linux-fsdevel@vger.kernel.org
On Wed 04-05-11 09:12:54, Ted Tso wrote:
> On May 4, 2011, at 6:06 AM, Dave Chinner wrote:
>
> > Does anyone actaully testing filesystems use the -mm tree? I'm
> > pretty sure that no-one in the XFS world does, and I don't think
> > that any ext4 or btrfs folk do, either....
>
> I don't. I agree that it really would be great if there was a separate
> git tree for the writeback changes, since that way it becomes possible to
> test just the writeback changes, and not worry about other potential
> stability problems introduced by changes in the -mm tree....
OK. We'd still push changes to Linus via Andrew but have them also
accumulated in that tree for testing. So we'd have a branch with changes
sitting in -mm (likely to go to Linus in near future) and then possibly
other branches with things brewing for people to try out. Does that make
sense? I can setup that tree and maintain it. Or Fengguang, do you want
to do it?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/6] writeback: make nr_to_write a per-file limit
2011-05-04 16:18 ` Christoph Hellwig
@ 2011-05-05 10:47 ` Wu Fengguang
0 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-05 10:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 12:18:16AM +0800, Christoph Hellwig wrote:
> > - move MAX_WRITEBACK_PAGES and wb_writeback_work definitions to writeback.h
>
> I think it would be a good idea to keep this in fs/fs-writeback.c, which
> means we'd need a small writeback_inodes_wb wrapper for
> balance_dirty_pages and bdi_flush_io. But IIRC your tree already has
> __writeback_inodes_wb for use in wb_writeback, so writeback_inodes_wb
> could be that wrapper.
No problem. We can later move bdi_flush_io() to fs-writeback.c and
kill the wbc=>work=>wbc double conversions once balance_dirty_pages()
switches to IO-less.
> > + long write_chunk = MAX_WRITEBACK_PAGES;
> > + long wrote = 0;
> > + bool inode_cleaned = false;
> > +
> > + /*
> > + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> > + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> > + * here avoids calling into writeback_inodes_wb() more than once.
> > + *
> > + * The intended call sequence for WB_SYNC_ALL writeback is:
> > + *
> > + * wb_writeback()
> > + * writeback_sb_inodes() <== called only once
> > + * write_cache_pages() <== called once for each inode
> > + * (quickly) tag currently dirty pages
> > + * (maybe slowly) sync all tagged pages
> > + */
> > + if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
> > + write_chunk = LONG_MAX;
>
> I think this would be easier to read if kept as and if / else clause
> with the MAX_WRITEBACK_PAGES usage.
>
> > + write_chunk = min(write_chunk, work->nr_pages);
>
> Or in fact done here - for the WB_SYNC_ALL case LONG_MAX should
> always be larger than work->nr_pages, so the whole thing could be
> simplified to:
>
> if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
> write_chunk = LONG_MAX;
> else
> write_chunk = min(MAX_WRITEBACK_PAGES, work->nr_pages);
OK, I created a standalone function writeback_chunk_size() for that.
Otherwise the comment block will look very weird deep inside the
function and deeply indented.
>
> Other notes:
>
> - older_than_this in writeback_control shouldn't be needed anymore
Yes. Good catch!
> - is the early return for the mis-matching sb in writeback_sb_inodes
> handled correctly? Before it had the special 0 return value, and
> I'm not quite sure how that fits into your new enum scheme.
Good question. This is the tricky point of the patch.
The original 0/1 return values are to tell __writeback_inodes_wb()
whether it's done with the current batch. Now __writeback_inodes_wb()
will do the test on itself (the below v2 patch).
The v1 patch does have some tricky problem regarding the early return:
if there are two super blocks that return WROTE_SOME_PAGES and
WROTE_NOTHING in turn, __writeback_inodes_wb() will wrongly pick up
the latter as the final return value.
I updated the patch to accumulate the number of pages/inodes cleaned
over multiple writeback_sb_inodes() invocations inside
__writeback_inodes_wb(). This way it can reliably return to
wb_writeback() to check various terminate conditions when made enough
progress.
Thanks,
Fengguang
---
Subject: writeback: make writeback_control.nr_to_write straight
Date: Wed May 04 19:54:37 CST 2011
As suggested by Christoph:
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/btrfs/extent_io.c | 2
fs/fs-writeback.c | 184 +++++++++++++++--------------
include/linux/writeback.h | 3
include/trace/events/writeback.h | 6
mm/backing-dev.c | 1
mm/page-writeback.c | 1
6 files changed, 99 insertions(+), 98 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 18:24:53.000000000 +0800
@@ -30,11 +30,21 @@
#include "internal.h"
/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation. We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode. Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES 1024L
+
+/*
* Passed into wb_writeback(), essentially a subset of writeback_control
*/
struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
+ unsigned long *older_than_this;
enum writeback_sync_modes sync_mode;
unsigned int tagged_sync:1;
unsigned int for_kupdate:1;
@@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
* No need to add it back to the LRU.
*/
list_del_init(&inode->i_wb_list);
- wbc->inodes_cleaned++;
}
}
inode_sync_complete(inode);
@@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct
return false;
}
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+ long pages;
+
+ /*
+ * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+ * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+ * here avoids calling into writeback_inodes_wb() more than once.
+ *
+ * The intended call sequence for WB_SYNC_ALL writeback is:
+ *
+ * wb_writeback()
+ * writeback_sb_inodes() <== called only once
+ * write_cache_pages() <== called once for each inode
+ * (quickly) tag currently dirty pages
+ * (maybe slowly) sync all tagged pages
+ */
+ if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
+ pages = LONG_MAX;
+ else
+ pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+ return pages;
+}
+
/*
* Write a portion of b_io inodes which belong to @sb.
*
@@ -503,18 +537,29 @@ static bool pin_sb_for_writeback(struct
* inodes. Otherwise write only ones which go sequentially
* in reverse order.
*
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
*/
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
- struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
+ struct writeback_control wbc = {
+ .sync_mode = work->sync_mode,
+ .tagged_sync = work->tagged_sync,
+ .for_kupdate = work->for_kupdate,
+ .for_background = work->for_background,
+ .range_cyclic = work->range_cyclic,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ };
+ long write_chunk;
+ long wrote = 0; /* count both pages and inodes */
+
while (!list_empty(&wb->b_io)) {
- long pages_skipped;
struct inode *inode = wb_inode(wb->b_io.prev);
if (inode->i_sb != sb) {
- if (only_this_sb) {
+ if (work->sb) {
/*
* We only want to write back data for this
* superblock, move all inodes not belonging
@@ -529,7 +574,7 @@ static int writeback_sb_inodes(struct su
* Bounce back to the caller to unpin this and
* pin the next superblock.
*/
- return 0;
+ break;
}
/*
@@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su
requeue_io(inode, wb);
continue;
}
-
__iget(inode);
+ write_chunk = writeback_chunk_size(work);
+ wbc.nr_to_write = write_chunk;
+ wbc.pages_skipped = 0;
+
+ writeback_single_inode(inode, wb, &wbc);
- pages_skipped = wbc->pages_skipped;
- writeback_single_inode(inode, wb, wbc);
- if (wbc->pages_skipped != pages_skipped) {
+ work->nr_pages -= write_chunk - wbc.nr_to_write;
+ wrote += write_chunk - wbc.nr_to_write;
+ if (wbc.pages_skipped) {
/*
* writeback is not making progress due to locked
* buffers. Skip this inode for now.
*/
redirty_tail(inode, wb);
- }
+ } else if (!(inode->i_state & I_DIRTY))
+ wrote++;
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
iput(inode);
cond_resched();
spin_lock(&wb->list_lock);
- if (wbc->nr_to_write <= 0)
- return 1;
+ if (wrote >= MAX_WRITEBACK_PAGES)
+ break;
+ if (work->nr_pages <= 0)
+ break;
}
- /* b_io is empty */
- return 1;
+ return wrote;
}
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
- int ret = 0;
+ long wrote = 0;
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,33 +631,35 @@ static void __writeback_inodes_wb(struct
requeue_io(inode, wb);
continue;
}
- ret = writeback_sb_inodes(sb, wb, wbc, false);
+ wrote += writeback_sb_inodes(sb, wb, work);
drop_super(sb);
- if (ret)
+ if (wrote >= MAX_WRITEBACK_PAGES)
+ break;
+ if (work->nr_pages <= 0)
break;
}
/* Leave any unwritten inodes on b_io */
+ return wrote;
}
void writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc)
+ struct writeback_control *wbc)
{
+ struct wb_writeback_work work = {
+ .nr_pages = wbc->nr_to_write,
+ .sync_mode = wbc->sync_mode,
+ .range_cyclic = wbc->range_cyclic,
+ };
+
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
- queue_io(wb, wbc->older_than_this);
- __writeback_inodes_wb(wb, wbc);
+ queue_io(wb, NULL);
+ __writeback_inodes_wb(wb, &work);
spin_unlock(&wb->list_lock);
-}
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation. We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode. Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES 1024
+ wbc->nr_to_write = work.nr_pages;
+}
static inline bool over_bground_thresh(void)
{
@@ -636,42 +689,13 @@ static inline bool over_bground_thresh(v
static long wb_writeback(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
- struct writeback_control wbc = {
- .sync_mode = work->sync_mode,
- .tagged_sync = work->tagged_sync,
- .older_than_this = NULL,
- .for_kupdate = work->for_kupdate,
- .for_background = work->for_background,
- .range_cyclic = work->range_cyclic,
- };
+ long nr_pages = work->nr_pages;
unsigned long oldest_jif;
- long wrote = 0;
- long write_chunk = MAX_WRITEBACK_PAGES;
struct inode *inode;
-
- if (!wbc.range_cyclic) {
- wbc.range_start = 0;
- wbc.range_end = LLONG_MAX;
- }
-
- /*
- * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
- * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
- * here avoids calling into writeback_inodes_wb() more than once.
- *
- * The intended call sequence for WB_SYNC_ALL writeback is:
- *
- * wb_writeback()
- * writeback_sb_inodes() <== called only once
- * write_cache_pages() <== called once for each inode
- * (quickly) tag currently dirty pages
- * (maybe slowly) sync all tagged pages
- */
- if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
- write_chunk = LONG_MAX;
+ long progress;
oldest_jif = jiffies;
- wbc.older_than_this = &oldest_jif;
+ work->older_than_this = &oldest_jif;
for (;;) {
/*
@@ -700,27 +724,18 @@ static long wb_writeback(struct bdi_writ
if (work->for_kupdate || work->for_background) {
oldest_jif = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
- wbc.older_than_this = &oldest_jif;
+ work->older_than_this = &oldest_jif;
}
- wbc.nr_to_write = write_chunk;
- wbc.pages_skipped = 0;
- wbc.inodes_cleaned = 0;
-
retry:
- trace_wbc_writeback_start(&wbc, wb->bdi);
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
- queue_io(wb, wbc->older_than_this);
+ queue_io(wb, work->older_than_this);
if (work->sb)
- writeback_sb_inodes(work->sb, wb, &wbc, true);
+ progress = writeback_sb_inodes(work->sb, wb, work);
else
- __writeback_inodes_wb(wb, &wbc);
+ progress = __writeback_inodes_wb(wb, work);
spin_unlock(&wb->list_lock);
- trace_wbc_writeback_written(&wbc, wb->bdi);
-
- work->nr_pages -= write_chunk - wbc.nr_to_write;
- wrote += write_chunk - wbc.nr_to_write;
/*
* Did we write something? Try for more
@@ -730,9 +745,7 @@ retry:
* mean the overall work is done. So we keep looping as long
* as made some progress on cleaning pages or inodes.
*/
- if (wbc.nr_to_write < write_chunk)
- continue;
- if (wbc.inodes_cleaned)
+ if (progress)
continue;
/*
* background writeback will start with expired inodes, and
@@ -741,10 +754,10 @@ retry:
* lists and cause trouble to the page reclaim.
*/
if (work->for_background &&
- wbc.older_than_this &&
+ work->older_than_this &&
list_empty(&wb->b_io) &&
list_empty(&wb->b_more_io)) {
- wbc.older_than_this = NULL;
+ work->older_than_this = NULL;
goto retry;
}
/*
@@ -760,7 +773,6 @@ retry:
spin_lock(&wb->list_lock);
if (!list_empty(&wb->b_more_io)) {
inode = wb_inode(wb->b_more_io.prev);
- trace_wbc_writeback_wait(&wbc, wb->bdi);
spin_lock(&inode->i_lock);
inode_wait_for_writeback(inode, wb);
spin_unlock(&inode->i_lock);
@@ -768,7 +780,7 @@ retry:
spin_unlock(&wb->list_lock);
}
- return wrote;
+ return nr_pages - work->nr_pages;
}
/*
--- linux-next.orig/include/linux/writeback.h 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-05 18:24:53.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
*/
struct writeback_control {
enum writeback_sync_modes sync_mode;
- unsigned long *older_than_this; /* If !NULL, only write back inodes
- older than this */
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
long pages_skipped; /* Pages which were not written */
- long inodes_cleaned; /* # of inodes cleaned */
/*
* For a_ops->writepages(): is start or end are non-zero then this is
--- linux-next.orig/mm/backing-dev.c 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/mm/backing-dev.c 2011-05-05 18:24:53.000000000 +0800
@@ -264,7 +264,6 @@ static void bdi_flush_io(struct backing_
{
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
- .older_than_this = NULL,
.range_cyclic = 1,
.nr_to_write = 1024,
};
--- linux-next.orig/mm/page-writeback.c 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-05-05 18:24:53.000000000 +0800
@@ -496,7 +496,6 @@ static void balance_dirty_pages(struct a
for (;;) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
- .older_than_this = NULL,
.nr_to_write = write_chunk,
.range_cyclic = 1,
};
--- linux-next.orig/include/trace/events/writeback.h 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-05-05 18:24:53.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
__field(int, for_background)
__field(int, for_reclaim)
__field(int, range_cyclic)
- __field(unsigned long, older_than_this)
__field(long, range_start)
__field(long, range_end)
),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_background = wbc->for_background;
__entry->for_reclaim = wbc->for_reclaim;
__entry->range_cyclic = wbc->range_cyclic;
- __entry->older_than_this = wbc->older_than_this ?
- *wbc->older_than_this : 0;
__entry->range_start = (long)wbc->range_start;
__entry->range_end = (long)wbc->range_end;
),
TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
- "bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+ "bgrd=%d reclm=%d cyclic=%d "
"start=0x%lx end=0x%lx",
__entry->name,
__entry->nr_to_write,
@@ -132,7 +129,6 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_background,
__entry->for_reclaim,
__entry->range_cyclic,
- __entry->older_than_this,
__entry->range_start,
__entry->range_end)
)
--- linux-next.orig/fs/btrfs/extent_io.c 2011-05-05 18:24:21.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c 2011-05-05 18:24:53.000000000 +0800
@@ -2587,7 +2587,6 @@ int extent_write_full_page(struct extent
};
struct writeback_control wbc_writepages = {
.sync_mode = wbc->sync_mode,
- .older_than_this = NULL,
.nr_to_write = 64,
.range_start = page_offset(page) + PAGE_CACHE_SIZE,
.range_end = (loff_t)-1,
@@ -2620,7 +2619,6 @@ int extent_write_locked_range(struct ext
};
struct writeback_control wbc_writepages = {
.sync_mode = mode,
- .older_than_this = NULL,
.nr_to_write = nr_pages * 2,
.range_start = start,
.range_end = end + 1,
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] writeback: convert to relative older_than_this in trace events
2011-05-04 22:23 ` Jan Kara
@ 2011-05-05 12:33 ` Wu Fengguang
0 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2011-05-05 12:33 UTC (permalink / raw)
To: Jan Kara; +Cc: Andrew Morton, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org
On Thu, May 05, 2011 at 06:23:33AM +0800, Jan Kara wrote:
> On Wed 04-05-11 17:17:13, Wu Fengguang wrote:
> > This is a format change that could break established scripts.
> >
> > But hopefully it's more human friendly to show the relative values like
> > "30000" (30s old) or "0" (all fresh ones) than long absolute jiffy
> > numbers.
> It's possibly more human friendly but probably also harder to handle in
> scripts. I especially don't like the fact that the number will be changing
> at different tracepoints - you'll then have to do some comparisons to find
> out whether just the timestamp has been changed or whether just the time
> passes by. It's not that hard to write a script doing absolute->relative
> conversion (since you have time stamps in the traces) for those cases where
> relative time is really what interests you.
OK.
> So do you have any particular usecase in mind for the relative
> value?
I normally just look at the raw traces in which case the raw jiffies
mean nothing to me. Or sometimes plot the relative dirty ages with
scripts.
Thanks,
Fengguang
> > Note that the initial 30000 or 0 may increase over time as they are now
> > relative numbers.
> >
> > CC: Jan Kara <jack@suse.cz>
> > CC: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > include/trace/events/writeback.h | 11 ++++++-----
> > 1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > --- linux-next.orig/include/trace/events/writeback.h 2011-05-04 16:09:59.000000000 +0800
> > +++ linux-next/include/trace/events/writeback.h 2011-05-04 16:10:57.000000000 +0800
> > @@ -115,7 +115,7 @@ DECLARE_EVENT_CLASS(wbc_class,
> > __field(int, for_reclaim)
> > __field(int, range_cyclic)
> > __field(int, more_io)
> > - __field(unsigned long, older_than_this)
> > + __field(int, older)
> > __field(long, range_start)
> > __field(long, range_end)
> > ),
> > @@ -130,14 +130,15 @@ DECLARE_EVENT_CLASS(wbc_class,
> > __entry->for_reclaim = wbc->for_reclaim;
> > __entry->range_cyclic = wbc->range_cyclic;
> > __entry->more_io = wbc->more_io;
> > - __entry->older_than_this = wbc->older_than_this ?
> > - *wbc->older_than_this : 0;
> > + __entry->older = wbc->older_than_this ?
> > + (jiffies - *wbc->older_than_this) * 1000 / HZ
> > + : -1;
> > __entry->range_start = (long)wbc->range_start;
> > __entry->range_end = (long)wbc->range_end;
> > ),
> >
> > TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> > - "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> > + "bgrd=%d reclm=%d cyclic=%d more=%d older=%d "
> > "start=0x%lx end=0x%lx",
> > __entry->name,
> > __entry->nr_to_write,
> > @@ -148,7 +149,7 @@ DECLARE_EVENT_CLASS(wbc_class,
> > __entry->for_reclaim,
> > __entry->range_cyclic,
> > __entry->more_io,
> > - __entry->older_than_this,
> > + __entry->older,
> > __entry->range_start,
> > __entry->range_end)
> > )
> >
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/6] writeback: trace event writeback_queue_io (v2)
2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-05-05 16:37 ` Wu Fengguang
2011-05-05 17:26 ` Jan Kara
0 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2011-05-05 16:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, Dave Chinner, LKML, linux-fsdevel@vger.kernel.org
[changes from v1: due to concerns about the relative older_than_this,
I added the raw older_than_this to the trace output. Hope the added
overheads are not a big concern.]
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 14 ++++++++++----
include/trace/events/writeback.h | 25 +++++++++++++++++++++++++
2 files changed, 35 insertions(+), 4 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:53:33.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 23:53:34.000000000 +0800
@@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i
/*
* Move expired dirty inodes from @delaying_queue to @dispatch_queue.
*/
-static void move_expired_inodes(struct list_head *delaying_queue,
+static int move_expired_inodes(struct list_head *delaying_queue,
struct list_head *dispatch_queue,
- unsigned long *older_than_this)
+ unsigned long *older_than_this)
{
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
struct inode *inode;
int do_sb_sort = 0;
+ int moved = 0;
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
@@ -277,12 +278,13 @@ static void move_expired_inodes(struct l
do_sb_sort = 1;
sb = inode->i_sb;
list_move(&inode->i_wb_list, &tmp);
+ moved++;
}
/* just one sb in list, splice to dispatch_queue and we're done */
if (!do_sb_sort) {
list_splice(&tmp, dispatch_queue);
- return;
+ goto out;
}
/* Move inodes from one superblock together */
@@ -294,6 +296,8 @@ static void move_expired_inodes(struct l
list_move(&inode->i_wb_list, dispatch_queue);
}
}
+out:
+ return moved;
}
/*
@@ -309,9 +313,11 @@ static void move_expired_inodes(struct l
*/
static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
{
+ int moved;
assert_spin_locked(&wb->list_lock);
list_splice_init(&wb->b_more_io, &wb->b_io);
- move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+ moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+ trace_writeback_queue_io(wb, older_than_this, moved);
}
static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h 2011-05-05 23:53:33.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-05-06 00:14:09.000000000 +0800
@@ -158,6 +158,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+TRACE_EVENT(writeback_queue_io,
+ TP_PROTO(struct bdi_writeback *wb,
+ unsigned long *older_than_this,
+ int moved),
+ TP_ARGS(wb, older_than_this, moved),
+ TP_STRUCT__entry(
+ __array(char, name, 32)
+ __field(unsigned long, older)
+ __field(long, age)
+ __field(int, moved)
+ ),
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+ __entry->older = older_than_this ? *older_than_this : 0;
+ __entry->age = older_than_this ?
+ (jiffies - *older_than_this) * 1000 / HZ : -1;
+ __entry->moved = moved;
+ ),
+ TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
+ __entry->name,
+ __entry->older, /* older_than_this in jiffies */
+ __entry->age, /* older_than_this in relative milliseconds */
+ __entry->moved)
+);
+
DECLARE_EVENT_CLASS(writeback_congest_waited_template,
TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] writeback: trace event writeback_queue_io (v2)
2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang
@ 2011-05-05 17:26 ` Jan Kara
0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2011-05-05 17:26 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Dave Chinner, LKML,
linux-fsdevel@vger.kernel.org
On Fri 06-05-11 00:37:30, Wu Fengguang wrote:
> [changes from v1: due to concerns about the relative older_than_this,
> I added the raw older_than_this to the trace output. Hope the added
> overheads are not a big concern.]
>
> Note that it adds a little overheads to account the moved/enqueued
> inodes from b_dirty to b_io. The "moved" accounting may be later used to
> limit the number of inodes that can be moved in one shot, in order to
> keep spinlock hold time under control.
This is fine by me.
Honza
> ---
> fs/fs-writeback.c | 14 ++++++++++----
> include/trace/events/writeback.h | 25 +++++++++++++++++++++++++
> 2 files changed, 35 insertions(+), 4 deletions(-)
>
> --- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:53:33.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-05-05 23:53:34.000000000 +0800
> @@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i
> /*
> * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
> */
> -static void move_expired_inodes(struct list_head *delaying_queue,
> +static int move_expired_inodes(struct list_head *delaying_queue,
> struct list_head *dispatch_queue,
> - unsigned long *older_than_this)
> + unsigned long *older_than_this)
> {
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> struct inode *inode;
> int do_sb_sort = 0;
> + int moved = 0;
>
> while (!list_empty(delaying_queue)) {
> inode = wb_inode(delaying_queue->prev);
> @@ -277,12 +278,13 @@ static void move_expired_inodes(struct l
> do_sb_sort = 1;
> sb = inode->i_sb;
> list_move(&inode->i_wb_list, &tmp);
> + moved++;
> }
>
> /* just one sb in list, splice to dispatch_queue and we're done */
> if (!do_sb_sort) {
> list_splice(&tmp, dispatch_queue);
> - return;
> + goto out;
> }
>
> /* Move inodes from one superblock together */
> @@ -294,6 +296,8 @@ static void move_expired_inodes(struct l
> list_move(&inode->i_wb_list, dispatch_queue);
> }
> }
> +out:
> + return moved;
> }
>
> /*
> @@ -309,9 +313,11 @@ static void move_expired_inodes(struct l
> */
> static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> {
> + int moved;
> assert_spin_locked(&wb->list_lock);
> list_splice_init(&wb->b_more_io, &wb->b_io);
> - move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> + trace_writeback_queue_io(wb, older_than_this, moved);
> }
>
> static int write_inode(struct inode *inode, struct writeback_control *wbc)
> --- linux-next.orig/include/trace/events/writeback.h 2011-05-05 23:53:33.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h 2011-05-06 00:14:09.000000000 +0800
> @@ -158,6 +158,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
> DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
> DEFINE_WBC_EVENT(wbc_writepage);
>
> +TRACE_EVENT(writeback_queue_io,
> + TP_PROTO(struct bdi_writeback *wb,
> + unsigned long *older_than_this,
> + int moved),
> + TP_ARGS(wb, older_than_this, moved),
> + TP_STRUCT__entry(
> + __array(char, name, 32)
> + __field(unsigned long, older)
> + __field(long, age)
> + __field(int, moved)
> + ),
> + TP_fast_assign(
> + strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
> + __entry->older = older_than_this ? *older_than_this : 0;
> + __entry->age = older_than_this ?
> + (jiffies - *older_than_this) * 1000 / HZ : -1;
> + __entry->moved = moved;
> + ),
> + TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
> + __entry->name,
> + __entry->older, /* older_than_this in jiffies */
> + __entry->age, /* older_than_this in relative milliseconds */
> + __entry->moved)
> +);
> +
> DECLARE_EVENT_CLASS(writeback_congest_waited_template,
>
> TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-05-05 17:26 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-04 9:17 [PATCH 0/6] writeback fixes and trace events Wu Fengguang
2011-05-04 9:17 ` [PATCH 1/6] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-05-04 9:17 ` [PATCH 2/6] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-05-04 9:17 ` [PATCH 3/6] writeback: make nr_to_write a per-file limit Wu Fengguang
2011-05-04 9:42 ` Christoph Hellwig
2011-05-04 11:52 ` Wu Fengguang
2011-05-04 15:51 ` Wu Fengguang
2011-05-04 16:18 ` Christoph Hellwig
2011-05-05 10:47 ` Wu Fengguang
2011-05-04 9:17 ` [PATCH 4/6] writeback: trace event writeback_single_inode Wu Fengguang
2011-05-04 9:17 ` [PATCH 5/6] writeback: trace event writeback_queue_io Wu Fengguang
2011-05-05 16:37 ` [PATCH 5/6] writeback: trace event writeback_queue_io (v2) Wu Fengguang
2011-05-05 17:26 ` Jan Kara
2011-05-04 9:17 ` [PATCH 6/6] writeback: convert to relative older_than_this in trace events Wu Fengguang
2011-05-04 22:23 ` Jan Kara
2011-05-05 12:33 ` Wu Fengguang
2011-05-04 9:46 ` [PATCH 0/6] writeback fixes and " Christoph Hellwig
2011-05-04 9:56 ` Wu Fengguang
2011-05-04 10:06 ` Dave Chinner
2011-05-04 11:18 ` Christoph Hellwig
2011-05-04 13:12 ` Theodore Tso
2011-05-04 22:31 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).