From: Dave Chinner <david@fromorbit.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
Christoph Hellwig <hch@lst.de>, Theodore Ts'o <tytso@mit.edu>,
Chris Mason <chris.mason@oracle.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Mel Gorman <mel@csn.ul.ie>, Rik van Riel <riel@redhat.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-mm <linux-mm@kvack.org>,
linux-fsdevel@vger.kernel.org,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 00/13] IO-less dirty throttling v2
Date: Thu, 18 Nov 2010 14:21:41 +1100 [thread overview]
Message-ID: <20101118032141.GP13830@dastard> (raw)
In-Reply-To: <20101117180912.38541ca4.akpm@linux-foundation.org>
On Wed, Nov 17, 2010 at 06:09:12PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 13:06:40 +1100 Dave Chinner <david@fromorbit.com> wrote:
>
> > On Wed, Nov 17, 2010 at 03:03:30PM -0800, Andrew Morton wrote:
> > > On Wed, 17 Nov 2010 12:27:20 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >
> > > > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > > > improves IO throughput from 38MB/s to 42MB/s.
> > >
> > > The changes in CPU consumption are remarkable. I've looked through the
> > > changelogs but cannot find mention of where all that time was being
> > > spent?
> >
> > In the writeback path, mostly because every CPU is trying to run
> > writeback at the same time and causing contention on locks and
> > shared structures in the writeback path. That no longer happens
> > because writeback is only happening from one thread instead of from
> > all CPUs at once.
>
> It'd be nice to see this quantified. Partly because handing things
> over to kernel threads uncurs extra overhead - scheduling cost and CPU
> cache footprint.
Sure, but in this case, the scheduling cost is much lower than
actually doing writeback of 1500 pages. The CPU cache footprint of
the syscall is also greatly reduced as well because we don't go down
the writeback path. That shows up in the fact that the "app
overhead" measured by fs_mark goes down significantly with this
patch series (30-50% reduction) - it's doing the same work, but it's
taking much less wall time....
And if you are after lock contention numbers, I have quantified it
though I do not have saved lock_stat numbers at hand. Running the
current inode_lock breakup patchset and the fs_mark workload (8-way
parallel create of 1 byte files), lock_stat shows the
inode_wb_list_lock as the hottest lock in the system (more trafficed
and much more contended than the dcache_lock), along with the
inode->i_lock being the most trafficed.
Running `perf top -p <pid of bdi-flusher>` showed it spending 30-40%
of it's time in __ticket_spin_lock. I saw the same thing with every
fs_mark process also showing 30-40% of it's time in
__ticket_spin_lock. Every process also showed a good chunk of time
in the writeback path. Overall, the fsmark processes showed a CPU
consumption of ~620% CPU, with the bdi-flusher at 80% of a CPU and
kswapd at 80% of CPU.
With the patchset, all that spin lock time is gone from the profiles
(down to about 2%) as is the writeback path (except fo the
bdi-flusher, which is all writeback path). Overall, we have fsmark
processes showing 250% CPU, the bdi-flusher at 80% of a cpu, and
kswapd at about 20% of a CPU, with over 400% idle time.
IOWs, we've traded off 3-4 CPUs worth of spinlock contention and a
flusher thread running at 80% CPU for a flusher thread that runs at
80% CPU doing the same amount of work. To me, that says the cost of
scheduling is orders of magnitude lower than the cost of the current
code...
> But mainly because we're taking the work accounting away from the user
> who caused it and crediting it to the kernel thread instead, and that's
> an actively *bad* thing to do.
The current foreground writeback is doing work on behalf of the
system (i.e. doing background writeback) and therefore crediting it
to the user process. That seems wrong to me; it's hiding the
overhead of system tasks in user processes.
IMO, time spent doing background writeback should not be creditted
to user processes - writeback caching is a function of the OS and
it's overhead should be accounted as such. Indeed, nobody has
realised (until now) just how inefficient it really is because of
the fact that the overhead is mostly hidden in user process system
time.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-11-18 3:21 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-17 4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
2010-11-17 4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-11-17 10:34 ` Minchan Kim
2010-11-22 2:01 ` Wu Fengguang
2010-11-17 23:08 ` Andrew Morton
2010-11-18 13:04 ` Peter Zijlstra
2010-11-18 13:26 ` Wu Fengguang
2010-11-18 13:40 ` Peter Zijlstra
2010-11-18 14:02 ` Wu Fengguang
[not found] ` <20101129151719.GA30590@localhost>
[not found] ` <1291064013.32004.393.camel@laptop>
[not found] ` <20101130043735.GA22947@localhost>
[not found] ` <1291156522.32004.1359.camel@laptop>
[not found] ` <1291156765.32004.1365.camel@laptop>
[not found] ` <20101201133818.GA13377@localhost>
2010-12-01 23:03 ` Andrew Morton
2010-12-02 1:56 ` Wu Fengguang
2010-12-05 16:14 ` Wu Fengguang
2010-12-06 2:42 ` Ted Ts'o
2010-12-06 9:52 ` Dmitry
2010-12-06 12:34 ` Ted Ts'o
2010-11-17 4:27 ` [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-11-17 4:27 ` [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
2010-11-17 14:39 ` Wu Fengguang
2010-11-24 10:23 ` Peter Zijlstra
2010-11-24 10:43 ` Wu Fengguang
2010-11-24 10:49 ` Peter Zijlstra
2010-11-17 4:27 ` [PATCH 04/13] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-11-17 4:27 ` [PATCH 05/13] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-11-24 10:26 ` Peter Zijlstra
2010-11-24 10:44 ` Wu Fengguang
2010-11-17 4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
2010-11-17 23:08 ` Andrew Morton
2010-11-17 23:24 ` Peter Zijlstra
2010-11-17 23:38 ` Andrew Morton
2010-11-17 23:43 ` Peter Zijlstra
2010-11-18 6:51 ` Wu Fengguang
2010-11-24 10:58 ` Peter Zijlstra
2010-11-24 14:06 ` Wu Fengguang
2010-11-24 11:05 ` Peter Zijlstra
2010-11-24 12:10 ` Wu Fengguang
2010-11-24 12:50 ` Peter Zijlstra
2010-11-24 13:14 ` Wu Fengguang
2010-11-24 13:20 ` Wu Fengguang
2010-11-24 13:42 ` Peter Zijlstra
2010-11-24 13:46 ` Wu Fengguang
2010-11-24 14:12 ` Peter Zijlstra
2010-11-24 14:21 ` Wu Fengguang
2010-11-24 14:31 ` Peter Zijlstra
2010-11-24 14:38 ` Wu Fengguang
2010-11-24 14:34 ` Wu Fengguang
2010-11-17 4:27 ` [PATCH 07/13] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-11-17 4:27 ` [PATCH 08/13] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-11-24 11:13 ` Peter Zijlstra
2010-11-24 12:30 ` Wu Fengguang
2010-11-24 12:46 ` Peter Zijlstra
2010-11-24 12:59 ` Wu Fengguang
2010-11-17 4:27 ` [PATCH 09/13] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-11-24 11:15 ` Peter Zijlstra
2010-11-24 12:39 ` Wu Fengguang
2010-11-24 12:56 ` Peter Zijlstra
2010-11-17 4:27 ` [PATCH 10/13] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
2010-11-24 11:18 ` Peter Zijlstra
2010-11-24 12:48 ` Wu Fengguang
2010-11-17 4:27 ` [PATCH 11/13] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-11-17 4:27 ` [PATCH 12/13] writeback: add trace event for balance_dirty_pages() Wu Fengguang
2010-11-17 4:41 ` Wu Fengguang
2010-11-17 4:27 ` [PATCH 13/13] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-11-17 23:03 ` [PATCH 00/13] IO-less dirty throttling v2 Andrew Morton
2010-11-18 2:06 ` Dave Chinner
2010-11-18 2:09 ` Andrew Morton
2010-11-18 3:21 ` Dave Chinner [this message]
2010-11-18 3:34 ` Andrew Morton
2010-11-18 7:27 ` Dave Chinner
2010-11-18 7:33 ` Andrew Morton
2010-11-19 3:11 ` Dave Chinner
2010-11-24 11:12 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101118032141.GP13830@dastard \
--to=david@fromorbit.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=fengguang.wu@intel.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=riel@redhat.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).