From: Dave Chinner <david@fromorbit.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mel@linux.vnet.ibm.com>, Mel Gorman <mel@csn.ul.ie>,
Itaru Kitayama <kitayama@cl.bb4u.ne.jp>,
Minchan Kim <minchan.kim@gmail.com>,
Linux Memory Management List <linux-mm@kvack.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
"Li, Shaohua" <shaohua.li@intel.com>
Subject: Re: [RFC][PATCH v2] writeback: limit number of moved inodes in queue_io()
Date: Sat, 7 May 2011 09:06:19 +1000 [thread overview]
Message-ID: <20110506230619.GG26837@dastard> (raw)
In-Reply-To: <20110506100648.GA3435@localhost>
On Fri, May 06, 2011 at 06:06:48PM +0800, Wu Fengguang wrote:
> On Fri, May 06, 2011 at 04:42:38PM +0800, Wu Fengguang wrote:
> > > patched trace-tar-dd-ext4-2.6.39-rc3+
> >
> > > flush-8:0-3048 [004] 1929.981734: writeback_queue_io: bdi 8:0: older=4296600898 age=2 enqueue=13227
> >
> > > vanilla trace-tar-dd-ext4-2.6.39-rc3
> >
> > > flush-8:0-2911 [004] 77.158312: writeback_queue_io: bdi 8:0: older=0 age=-1 enqueue=18938
> >
> > > flush-8:0-2911 [000] 82.461064: writeback_queue_io: bdi 8:0: older=0 age=-1 enqueue=6957
> >
> > It looks too much to move 13227 and 18938 inodes at once. So I tried
> > arbitrarily limiting the max move number to 1000 and it helps reduce
> > the lock hold time and contentions a lot.
>
> Oh it seems 1000 is too small at least for this workload, it hurts
> dd+tar+sync total elapsed time.
>
> no limit:
> avg 167.486
> stddev 8.996
> limit=1000:
> avg 171.222
> stddev 5.588
> limit=3000:
> avg 165.335
> stddev 5.503
>
> So use 3000 as the new limit.
I don't think that's even enough. The number is going to be workload
dependent and while a limit might be a good idea, I don't think it
can be chosen just from one simple benchmark. e.g. what does it to
do performance of workloads creating tens of thousands of small
dirty files a second?
....
> class name con-bounces contentions waittime-min waittime-max waittime-total acq-b
> ounces acquisitions holdtime-min holdtime-max holdtime-total
> ----------------------------------------------------------------------------------------------------------------------------
> -------------------------------------------------------------------
> vanilla 2.6.39-rc3:
> inode_wb_list_lock: 2063 2065 0.12 2648.66 5948.99
> 27475 943778 0.09 2704.76 498340.24
I wouldn't consider this a contended lock at all on this workload.
FWIW, my profiles on sustained 8-way small file creation workloads
on ext4 over tens of millions of inodes show a 0.1% contention rate
for the inode_wb_list_lock. That compares to a 2% contention rate
for the inode_lru_lock, a 4% contention rate on the
inode_sb_list_lock and a 6% contention rate on the inode_hash_lock.
So really, the inode_wb_list_lock is not the lock we need to spend
effort on optimising to the nth degree right now...
......
> limit=1000:
>
> dd+tar+sync total elapsed time (10 runs):
> avg 171.222
> stddev 5.588
>
> &(&wb->list_lock)->rlock: 842 842 0.14 101.10 1013.34
> 20489 970892 0.09 234.11 509829.79
.....
> limit=3000:
>
> dd+tar+sync total elapsed time (10 runs):
> avg 165.335
> stddev 5.503
>
> &(&wb->list_lock)->rlock: 1088 1092 0.11 245.08 3268.75
> 21124 1718636 0.09 384.53 849827.20
So, from this acquisitions are doubled, and the total lock hold time
has almost doubled as well. That seems like there's a fair bit of
inefficiency introduced. What does it do to the CPU time consumed by
queue_io() (perf top is your friend)?
FYI, queue_io() is already a _massive_ CPU hog. See commit dcd79a1
("xfs: don't use vfs writeback for pure metadata modifications") for
how XFS tries to avoid putting dirty inodes on the list if at all
possible:
Under heavy multi-way parallel create workloads, the VFS
struggles to write back all the inodes that have been changed in
age order. The bdi flusher thread becomes CPU bound, spending
85% of it's time in the VFS code, mostly traversing the
superblock dirty inode list to separate dirty inodes old enough
to flush.
We already keep an index of all metadata changes in age order -
in the AIL - and continued log pressure will do age ordered
writeback without any extra overhead at all. If there is no
pressure on the log, the xfssyncd will periodically write back
metadata in ascending disk address offset order so will be very
efficient.
.....
We're moving towards only tracking inodes with dirty pages in the
b_dirty list for XFS because this time based expiry is so
inefficient. So anything that reduces the efficiency of
queue_io()....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-05-06 23:06 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-20 8:03 [PATCH 0/6] writeback: moving expire targets for background/kupdate works v2 Wu Fengguang
2011-04-20 8:03 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2011-05-04 11:04 ` Christoph Hellwig
2011-05-04 11:13 ` Wu Fengguang
2011-04-20 8:03 ` [PATCH 2/6] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-05-04 11:05 ` Christoph Hellwig
2011-05-04 11:11 ` Wu Fengguang
2011-05-04 11:16 ` Christoph Hellwig
2011-05-04 11:32 ` Wu Fengguang
2011-04-20 8:03 ` [PATCH 3/6] writeback: try more writeback as long as something was written Wu Fengguang
2011-04-20 8:03 ` [PATCH 4/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-04-20 8:03 ` [PATCH 5/6] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-04-20 23:40 ` Andrew Morton
2011-04-21 1:14 ` Wu Fengguang
2011-04-21 1:21 ` Wu Fengguang
2011-04-24 3:15 ` Wu Fengguang
2011-04-26 12:17 ` Jan Kara
2011-04-26 13:51 ` Wu Fengguang
2011-04-26 13:59 ` Wu Fengguang
2011-04-26 14:05 ` Wu Fengguang
2011-04-27 11:15 ` Wu Fengguang
2011-04-20 8:03 ` [PATCH 6/6] writeback: refill b_io iff empty Wu Fengguang
2011-05-04 7:39 ` Wu Fengguang
2011-05-05 16:37 ` Jan Kara
2011-05-05 16:47 ` Wu Fengguang
2011-05-06 5:29 ` Wu Fengguang
2011-05-06 8:42 ` [RFC][PATCH] writeback: limit number of moved inodes in queue_io() Wu Fengguang
2011-05-06 10:06 ` [RFC][PATCH v2] " Wu Fengguang
2011-05-06 23:06 ` Dave Chinner [this message]
2011-05-06 14:21 ` [PATCH 6/6] writeback: refill b_io iff empty Jan Kara
2011-05-10 4:31 ` Wu Fengguang
2011-05-10 4:53 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110506230619.GG26837@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=jack@suse.cz \
--cc=kitayama@cl.bb4u.ne.jp \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mel@linux.vnet.ibm.com \
--cc=minchan.kim@gmail.com \
--cc=shaohua.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).