From: Jan Kara <jack@suse.cz>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tejun Heo <tj@kernel.org>, Jan Kara <jack@suse.cz>,
vgoyal@redhat.com, Jens Axboe <axboe@kernel.dk>,
linux-mm@kvack.org, sjayaraman@suse.com, andrea@betterlinux.com,
jmoyer@redhat.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
lizefan@huawei.com, containers@lists.linux-foundation.org,
cgroups@vger.kernel.org, ctalbott@google.com, rni@google.com,
lsf@lists.linux-foundation.org
Subject: Re: [RFC] writeback and cgroup
Date: Thu, 19 Apr 2012 22:26:35 +0200 [thread overview]
Message-ID: <20120419202635.GA4795@quack.suse.cz> (raw)
In-Reply-To: <20120419142343.GA12684@localhost>
On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages. Those pages have been the biggest source of
> latency issues in the various parts of the system.
Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.
> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages.
>
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
>
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
>
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
>
> P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.
...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.
...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> >
> > Just do the same 1:1 inside each cgroup.
>
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
>
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-04-19 20:27 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-03 18:36 [RFC] writeback and cgroup Tejun Heo
2012-04-04 14:51 ` Vivek Goyal
2012-04-04 15:36 ` [Lsf] " Steve French
2012-04-04 18:56 ` Tejun Heo
2012-04-04 19:19 ` Vivek Goyal
2012-04-25 8:47 ` Suresh Jayaraman
2012-04-04 18:49 ` Tejun Heo
2012-04-04 19:23 ` [Lsf] " Steve French
2012-04-14 12:15 ` Peter Zijlstra
2012-04-04 20:32 ` Vivek Goyal
2012-04-04 23:02 ` Tejun Heo
2012-04-05 16:38 ` Tejun Heo
2012-04-05 17:13 ` Vivek Goyal
2012-04-14 11:53 ` [Lsf] " Peter Zijlstra
2012-04-07 8:00 ` Jan Kara
2012-04-10 16:23 ` [Lsf] " Steve French
2012-04-10 18:16 ` Vivek Goyal
2012-04-10 18:06 ` Vivek Goyal
2012-04-10 21:05 ` Jan Kara
2012-04-10 21:20 ` Vivek Goyal
2012-04-10 22:24 ` Jan Kara
2012-04-11 15:40 ` Vivek Goyal
2012-04-11 15:45 ` Vivek Goyal
2012-04-11 17:05 ` Jan Kara
2012-04-11 17:23 ` Vivek Goyal
2012-04-11 19:44 ` Jan Kara
2012-04-17 21:48 ` Tejun Heo
2012-04-18 18:18 ` Vivek Goyal
2012-04-11 19:22 ` Jan Kara
2012-04-12 20:37 ` Vivek Goyal
2012-04-12 20:51 ` Tejun Heo
2012-04-14 14:36 ` Fengguang Wu
2012-04-16 14:57 ` Vivek Goyal
2012-04-24 11:33 ` Fengguang Wu
2012-04-24 14:56 ` Jan Kara
2012-04-24 15:58 ` Vivek Goyal
2012-04-25 2:42 ` Fengguang Wu
2012-04-25 3:16 ` Fengguang Wu
2012-04-25 9:01 ` Jan Kara
2012-04-25 12:05 ` Fengguang Wu
2012-04-15 11:37 ` [Lsf] " Peter Zijlstra
2012-04-17 22:01 ` Tejun Heo
2012-04-18 6:30 ` Jan Kara
2012-04-14 12:25 ` [Lsf] " Peter Zijlstra
2012-04-16 12:54 ` Vivek Goyal
2012-04-16 13:07 ` Fengguang Wu
2012-04-16 14:19 ` Fengguang Wu
2012-04-16 15:52 ` Vivek Goyal
2012-04-17 2:14 ` Fengguang Wu
2012-04-04 17:51 ` Fengguang Wu
2012-04-04 18:35 ` Vivek Goyal
2012-04-04 21:42 ` Fengguang Wu
2012-04-05 15:10 ` Vivek Goyal
2012-04-06 0:32 ` Fengguang Wu
2012-04-04 19:33 ` Tejun Heo
2012-04-04 20:18 ` Vivek Goyal
2012-04-05 16:31 ` Tejun Heo
2012-04-05 17:09 ` Vivek Goyal
2012-04-06 9:59 ` Fengguang Wu
2012-04-17 22:38 ` Tejun Heo
2012-04-19 14:23 ` Fengguang Wu
2012-04-19 18:31 ` Vivek Goyal
2012-04-20 12:45 ` Fengguang Wu
2012-04-20 19:29 ` Vivek Goyal
2012-04-20 21:33 ` Tejun Heo
2012-04-22 14:26 ` Fengguang Wu
2012-04-23 12:30 ` Vivek Goyal
2012-04-23 16:04 ` Tejun Heo
2012-04-19 20:26 ` Jan Kara [this message]
2012-04-20 13:34 ` Fengguang Wu
2012-04-20 19:08 ` Tejun Heo
2012-04-22 14:46 ` Fengguang Wu
2012-04-23 16:56 ` Tejun Heo
2012-04-24 7:58 ` Fengguang Wu
2012-04-25 15:47 ` Tejun Heo
2012-04-23 9:14 ` Jan Kara
2012-04-23 10:24 ` Fengguang Wu
2012-04-23 12:42 ` Jan Kara
2012-04-23 14:31 ` Fengguang Wu
2012-04-18 6:57 ` Jan Kara
2012-04-18 7:58 ` Fengguang Wu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120419202635.GA4795@quack.suse.cz \
--to=jack@suse.cz \
--cc=andrea@betterlinux.com \
--cc=axboe@kernel.dk \
--cc=cgroups@vger.kernel.org \
--cc=containers@lists.linux-foundation.org \
--cc=ctalbott@google.com \
--cc=fengguang.wu@intel.com \
--cc=jmoyer@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan@huawei.com \
--cc=lsf@lists.linux-foundation.org \
--cc=rni@google.com \
--cc=sjayaraman@suse.com \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).