From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
ctalbott-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
rni-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
sjayaraman-IBi9RG/b67k@public.gmane.org,
lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Subject: Re: [RFC] writeback and cgroup
Date: Mon, 23 Apr 2012 11:14:32 +0200 [thread overview]
Message-ID: <20120423091432.GC6512@quack.suse.cz> (raw)
In-Reply-To: <20120420133441.GA7035@localhost>
On Fri 20-04-12 21:34:41, Wu Fengguang wrote:
> On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote:
> > > It's not uncommon for me to see filesystems sleep on PG_writeback
> > > pages during heavy writeback, within some lock or transaction, which in
> > > turn stall many tasks that try to do IO or merely dirty some page in
> > > memory. Random writes are especially susceptible to such stalls. The
> > > stable page feature also vastly increase the chances of stalls by
> > > locking the writeback pages.
> > >
> > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> > > the case of direct reclaim, it means blocking random tasks that are
> > > allocating memory in the system.
> > >
> > > PG_writeback pages are much worse than PG_dirty pages in that they are
> > > not movable. This makes a big difference for high-order page allocations.
> > > To make room for a 2MB huge page, vmscan has the option to migrate
> > > PG_dirty pages, but for PG_writeback it has no better choices than to
> > > wait for IO completion.
> > >
> > > The difficulty of THP allocation goes up *exponentially* with the
> > > number of PG_writeback pages. Assume PG_writeback pages are randomly
> > > distributed in the physical memory space. Then we have formula
> > >
> > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
> > Well, this implicitely assumes that PG_Writeback pages are scattered
> > across memory uniformly at random. I'm not sure to which extent this is
> > true...
>
> Yeah, when describing the problem I was also thinking about the
> possibilities of optimization (it would be a very good general
> improvements). Or maybe Mel already has some solutions :)
>
> > Also as a nitpick, this isn't really an exponential growth since
> > the exponent is fixed (256 - actually it should be 512, right?). It's just
>
> Right, 512 4k pages to form one x86_64 2MB huge pages.
>
> > a polynomial with a big exponent. But sure, growth in number of PG_Writeback
> > pages will cause relatively steep drop in the number of available huge
> > pages.
>
> It's exponential indeed, because "1 - p(x)" here means "p(!x)".
> It's exponential for a 10x increase in x resulting in 100x drop of y.
If 'x' is the probability page has PG_Writeback set, then the probability
a huge page has a single PG_Writeback page is (as you almost correctly wrote):
(1-x)^512. This is a polynominal by the definition: It can be
expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite.
The expression decreases fast as x approaches to 1, that's for sure, but
that does not make it exponential. Sorry, my mathematical part could not
resist this terminology correction.
> > ...
> > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > > > It's always there doing 1:1 proportional throttling. Then you try to
> > > > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > > > from its balanced state, leading to large fluctuations and program
> > > > > stalls.
> > > >
> > > > Just do the same 1:1 inside each cgroup.
> > >
> > > Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> > > For example there are only 2 dd tasks doing buffered writes in the
> > > system. Now consider the mismatch that cfq is dispatching their IO
> > > requests at 10:1 weights, while balance_dirty_pages() is throttling
> > > the dd tasks at 1:1 equal split because it's not aware of the cgroup
> > > weights.
> > >
> > > What will happen in the end? The 1:1 ratio imposed by
> > > balance_dirty_pages() will take effect and the dd tasks will progress
> > > at the same pace. The cfq weights will be defeated because the async
> > > queue for the second dd (and cgroup) constantly runs empty.
> > Yup. This just shows that you have to have per-cgroup dirty limits. Once
> > you have those, things start working again.
>
> Right. I think Tejun was more of less aware of this.
>
> I was rather upset by this per-memcg dirty_limit idea indeed. I never
> expect it to work well when used extensively. My plan was to set the
> default memcg dirty_limit high enough, so that it's not hit in normal.
> Then Tejun came and proposed to (mis-)use dirty_limit as the way to
> convert the dirty pages' backpressure into real dirty throttling rate.
> No, that's just crazy idea!
>
> Come on, let's not over-use memcg's dirty_limit. It's there as the
> *last resort* to keep dirty pages under control so as to maintain
> interactive performance inside the cgroup. However if used extensively
> in the system (like dozens of memcgs all hit their dirty limits), the
> limit itself may stall random dirtiers and create interactive
> performance issues!
>
> In the recent days I've come up with the idea of memcg.dirty_setpoint
> for the blkcg backpressure stuff. We can use that instead.
>
> memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate.
> Imagine bdi_setpoint. It's all the same concepts. Why we need this?
> Because if blkcg A and B does 10:1 weights and are both doing buffered
> writes, their dirty pages should better be maintained around 10:1
> ratio to avoid underrun and hopefully achieve better IO size.
> memcg.dirty_limit cannot guarantee that goal.
I agree that to avoid stalls of throttled processes we shouldn't be
hitting memcg.dirty_limit on a regular basis. When I wrote we need "per
cgroup dirty limits" I actually imagined something like you write above -
do complete throttling computations within each memcg - estimate throughput
available for it, compute appropriate dirty rates for it's processes and
from its dirty limit estimate appropriate setpoint to balance around.
> But be warned! Partitioning the dirty pages always means more
> fluctuations of dirty rates (and even stalls) that's perceivable by
> the user. Which means another limiting factor for the backpressure
> based IO controller to scale well.
Sure, the smaller the memcg gets, the more noticeable these fluctuations
would be. I would not expect memcg with 200 MB of memory to behave better
(and also not much worse) than if I have a machine with that much memory...
Honza
--
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR
next prev parent reply other threads:[~2012-04-23 9:14 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20120403183655.GA23106@dhcp-172-17-108-109.mtv.corp.google.com>
[not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 14:51 ` [RFC] writeback and cgroup Vivek Goyal
2012-04-04 17:51 ` Fengguang Wu
2012-04-04 18:35 ` Vivek Goyal
2012-04-04 19:33 ` Tejun Heo
[not found] ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 20:18 ` Vivek Goyal
[not found] ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-05 16:31 ` Tejun Heo
[not found] ` <20120405163113.GD12854@google.com>
[not found] ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-05 17:09 ` Vivek Goyal
2012-04-06 9:59 ` Fengguang Wu
[not found] ` <20120406095934.GA10465@localhost>
2012-04-17 22:38 ` Tejun Heo
2012-04-18 6:57 ` Jan Kara
[not found] ` <20120418065720.GA21485@quack.suse.cz>
[not found] ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-18 7:58 ` Fengguang Wu
[not found] ` <20120417223854.GG19975@google.com>
[not found] ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-19 14:23 ` Fengguang Wu
[not found] ` <20120419142343.GA12684@localhost>
2012-04-19 18:31 ` Vivek Goyal
2012-04-19 20:26 ` Jan Kara
[not found] ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-20 13:34 ` Fengguang Wu
[not found] ` <20120420133441.GA7035@localhost>
2012-04-20 19:08 ` Tejun Heo
[not found] ` <20120420190844.GH32324@google.com>
[not found] ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-22 14:46 ` Fengguang Wu
[not found] ` <20120422144649.GA7066@localhost>
2012-04-23 16:56 ` Tejun Heo
[not found] ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-24 7:58 ` Fengguang Wu
[not found] ` <20120424075853.GA8391@localhost>
2012-04-25 15:47 ` Tejun Heo
2012-04-23 9:14 ` Jan Kara [this message]
[not found] ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-23 10:24 ` Fengguang Wu
[not found] ` <20120423102420.GA13262@localhost>
2012-04-23 12:42 ` Jan Kara
[not found] ` <20120423124240.GE6512@quack.suse.cz>
[not found] ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-23 14:31 ` Fengguang Wu
[not found] ` <20120419183118.GM10216@redhat.com>
[not found] ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-20 12:45 ` Fengguang Wu
[not found] ` <20120420124518.GA7133@localhost>
2012-04-20 19:29 ` Vivek Goyal
[not found] ` <20120420192930.GR22419@redhat.com>
[not found] ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-20 21:33 ` Tejun Heo
[not found] ` <20120420213301.GA29134@google.com>
[not found] ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-22 14:26 ` Fengguang Wu
2012-04-23 12:30 ` Vivek Goyal
[not found] ` <20120423123011.GA8103@redhat.com>
[not found] ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-23 16:04 ` Tejun Heo
[not found] ` <20120404183528.GJ12676@redhat.com>
[not found] ` <20120404183528.GJ12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 21:42 ` Fengguang Wu
2012-04-05 15:10 ` Vivek Goyal
[not found] ` <20120405151026.GB23999@redhat.com>
[not found] ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-06 0:32 ` Fengguang Wu
[not found] ` <20120404145134.GC12676@redhat.com>
[not found] ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ@mail.gmail.com>
[not found] ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-04 18:56 ` [Lsf] " Tejun Heo
[not found] ` <20120404185605.GC29686@dhcp-172-17-108-109.mtv.corp.google.com>
[not found] ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 19:19 ` Vivek Goyal
[not found] ` <20120404191918.GK12676@redhat.com>
[not found] ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-25 8:47 ` Suresh Jayaraman
[not found] ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 15:36 ` Steve French
2012-04-04 18:49 ` Tejun Heo
2012-04-07 8:00 ` Jan Kara
[not found] ` <20120407080027.GA2584@quack.suse.cz>
[not found] ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-10 16:23 ` [Lsf] " Steve French
2012-04-10 18:06 ` Vivek Goyal
[not found] ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA@mail.gmail.com>
[not found] ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-10 18:16 ` [Lsf] " Vivek Goyal
[not found] ` <20120410180653.GJ21801@redhat.com>
[not found] ` <20120410180653.GJ21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-10 21:05 ` Jan Kara
[not found] ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-10 21:20 ` Vivek Goyal
[not found] ` <20120410212041.GP21801@redhat.com>
[not found] ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-10 22:24 ` Jan Kara
[not found] ` <20120410222425.GF4936@quack.suse.cz>
[not found] ` <20120410222425.GF4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-11 15:40 ` Vivek Goyal
[not found] ` <20120411154531.GE16692@redhat.com>
[not found] ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 17:05 ` Jan Kara
[not found] ` <20120411170542.GB16008@quack.suse.cz>
[not found] ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-11 17:23 ` Vivek Goyal
2012-04-17 21:48 ` Tejun Heo
[not found] ` <20120411172311.GF16692@redhat.com>
[not found] ` <20120411172311.GF16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 19:44 ` Jan Kara
[not found] ` <20120417214831.GE19975@google.com>
[not found] ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-18 18:18 ` Vivek Goyal
[not found] ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-11 15:45 ` Vivek Goyal
2012-04-11 19:22 ` Jan Kara
2012-04-14 12:25 ` [Lsf] " Peter Zijlstra
[not found] ` <20120411192231.GF16008@quack.suse.cz>
[not found] ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-12 20:37 ` Vivek Goyal
[not found] ` <20120412205148.GA24056@google.com>
[not found] ` <20120412205148.GA24056-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-14 14:36 ` Fengguang Wu
2012-04-16 14:57 ` Vivek Goyal
[not found] ` <20120416145744.GA15437@redhat.com>
[not found] ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-24 11:33 ` Fengguang Wu
[not found] ` <20120424113340.GA12509@localhost>
2012-04-24 14:56 ` Jan Kara
[not found] ` <20120424145655.GA1474@quack.suse.cz>
[not found] ` <20120424155843.GG26708@redhat.com>
[not found] ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-25 2:42 ` Fengguang Wu
[not found] ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-24 15:58 ` Vivek Goyal
2012-04-25 3:16 ` Fengguang Wu
2012-04-25 9:01 ` Jan Kara
[not found] ` <20120425090156.GB12568@quack.suse.cz>
[not found] ` <20120425090156.GB12568-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2012-04-25 12:05 ` Fengguang Wu
[not found] ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-12 20:51 ` Tejun Heo
2012-04-15 11:37 ` [Lsf] " Peter Zijlstra
2012-04-17 22:01 ` Tejun Heo
[not found] ` <20120417220106.GF19975@google.com>
[not found] ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-18 6:30 ` Jan Kara
[not found] ` <1334406314.2528.90.camel@twins>
2012-04-16 12:54 ` [Lsf] " Vivek Goyal
[not found] ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-16 13:07 ` Fengguang Wu
[not found] ` <20120416130707.GA10532@localhost>
2012-04-16 14:19 ` Fengguang Wu
2012-04-16 15:52 ` Vivek Goyal
[not found] ` <20120416155207.GB15437@redhat.com>
[not found] ` <20120416155207.GB15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-17 2:14 ` Fengguang Wu
[not found] ` <20120404184909.GB29686@dhcp-172-17-108-109.mtv.corp.google.com>
[not found] ` <20120404203239.GM12676@redhat.com>
[not found] ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-04 23:02 ` Tejun Heo
[not found] ` <20120405163854.GE12854@google.com>
[not found] ` <20120405163854.GE12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-05 17:13 ` Vivek Goyal
[not found] ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-04-04 19:23 ` [Lsf] " Steve French
2012-04-04 20:32 ` Vivek Goyal
2012-04-05 16:38 ` Tejun Heo
2012-04-14 11:53 ` [Lsf] " Peter Zijlstra
[not found] ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew@mail.gmail.com>
[not found] ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-14 12:15 ` Peter Zijlstra
2012-04-03 18:36 Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120423091432.GC6512@quack.suse.cz \
--to=jack-alswssmvlrq@public.gmane.org \
--cc=andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org \
--cc=axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=ctalbott-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=mgorman-l3A5Bk7waGM@public.gmane.org \
--cc=rni-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=sjayaraman-IBi9RG/b67k@public.gmane.org \
--cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox