From: Ying Han <yinghan@google.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
hannes@cmpxchg.org, Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue
Date: Tue, 24 May 2011 22:51:55 -0700 [thread overview]
Message-ID: <BANLkTimpzLpGqtuNaPUP3hAhOS3eA_iG1A@mail.gmail.com> (raw)
In-Reply-To: <20110523092557.30d322aa.kamezawa.hiroyu@jp.fujitsu.com>
[-- Attachment #1: Type: text/plain, Size: 6703 bytes --]
On Sun, May 22, 2011 at 5:25 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 20 May 2011 18:26:40 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa <
> kamezawa.hiroyuki@gmail.com> wrote:
> >
> > > 2011/5/21 Andrew Morton <akpm@linux-foundation.org>:
> > > > On Fri, 20 May 2011 12:48:37 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > >> workqueue for memory cgroup asynchronous memory shrinker.
> > > >>
> > > >> This patch implements the workqueue of async shrinker routine. each
> > > >> memcg has a work and only one work can be scheduled at the same
> time.
> > > >>
> > > >> If shrinking memory doesn't goes well, delay will be added to the
> work.
> > > >>
> > > >
> > > > When this code explodes (as it surely will), users will see large
> > > > amounts of CPU consumption in the work queue thread. __We want to
> make
> > > > this as easy to debug as possible, so we should try to make the
> > > > workqueue's names mappable back onto their memcg's. __And anything
> else
> > > > we can think of to help?
>
When we debug kswapd issues in the memory isolation environment, the first
step is to identify which cgroup
the kswapd thread is working on. We need a easy way to make the direct
mapping by reading a API or just
look at "top". So making the "kworkers" name mapped back to the memcg helps
here.
Also, we need a easy way to track the amount of cputime consumed by the
kswapd per-memcg basis. We probably
can export that number in the per-memcg memory.stats. Kame has the patch
from the last post.
> > > >
> > >
> > > I had a patch for showing per-memcg reclaim latency stats. It will be
> help.
> > > I'll add it again to this set. I just dropped it because there are many
> patches
> > > onto memory.stat in flight..
> >
> > Will that patch help us when users report the memcg equivalent of
> > "kswapd uses 99% of CPU"?
> >
> I think so. Each memcg shows what amount of cpu is used.
>
> But, maybe it's not an easy interface. I have several idea.
>
>
> An idea I have is to rename task->comm by overwrite from kworker/u:%d as
> to memcg/%d when the work is scheduled. I think this can be implemented in
> very
> simple interface and flags to workqueue. Then, ps -elf can show what was
> goin on.
> If necessary, I'll add a hardlimit of cpu usage for a work or I'll limit
> the number of thread for memcg workqueue.
>
Does it make sense to have memcg/css->id as the name if that is not the case
yet? Otherwise,
there is hard to link the kworker/%d ( or memcg/% later) back to the memcg
it is working on.
On the last post of per-memcg-per-kswapd implementation, i have the thread
named "memcg-css_id",
and also has a API per-memcg to export its css_id. So we can easily identify
the kernel thread to its
owner.
> Considering there are user who uses 2000+ memcg on a system, a thread per a
> memcg
> was not a choice to me.
So that is only about 2000 * 8k = 16M worth of memory over the machine
capacity (probably a very large number by
have the 2000+ memcgs running). We've run systems w/ 1000+ kswapds w/o
noticing troubles on that. What that is
buying us is better visibility(more of cpu limit per memcg kswapd)
and debug-ability.
Sorry I know we have discussed this before on other thread, but I can not
prevent myself not repeating here again :( Just want to provide a datapoint
where we have lots of kswapd threads (> 1000) per host and that is not
causing us any issues as you concerned. :)
--Ying
> Another idea was thread poll or workqueue. Because thread
> pool can be a poor reimplemenation of workqueue, I used workqueue.
>
>
> I'll implement some idea in above to the next version.
>
>
> > > >
> > > >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> > > >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
> > > >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE);
> > > >> + __ __ if (shrink_to <= usage) {
> > > >> + __ __ __ __ __ __ required = usage - shrink_to;
> > > >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1;
> > > >> + __ __ __ __ __ __ /*
> > > >> + __ __ __ __ __ __ __* This scans some number of pages and returns
> that memory
> > > >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a
> delay as
> > > >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c
> > > >> + __ __ __ __ __ __ __*/
> > > >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem,
> (long)required);
> > > >> + __ __ }
> > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
> > > >> + __ __ __ __ || mem_cgroup_async_should_stop(mem))
> > > >> + __ __ __ __ __ __ goto finish_scan;
> > > >> + __ __ /* If memory reclaim couldn't go well, add delay */
> > > >> + __ __ if (congested)
> > > >> + __ __ __ __ __ __ delay = HZ/10;
> > > >
> > > > Another magic number.
> > > >
> > > > If Moore's law holds, we need to reduce this number by 1.4 each year.
> > > > Is this good?
> > > >
> > >
> > > not good. I just used the same magic number now used with
> wait_iff_congested.
> > > Other than timer, I can use pagein/pageout event counter. If we have
> > > dirty_ratio,
> > > I may able to link this to dirty_ratio and wait until dirty_ratio is
> enough low.
> > > Or, wake up again hit limit.
> > >
> > > Do you have suggestion ?
> > >
> >
> > mm.. It would be pretty easy to generate an estimate of "pages scanned
> > per second" from the contents of (and changes in) the scan_control.
>
> Hmm.
>
> > Konwing that datum and knowing the number of pages in the memcg, we
> > should be able to come up with a delay period which scales
> > appropriately with CPU speed and with memory size?
> >
> > Such a thing could be used to rationalise magic delays in other places,
> > hopefully.
> >
>
> Ok, I'll conder that. Thank you for nice idea.
>
>
> > >
> > > >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work,
> delay);
> > > >> + __ __ return;
> > > >> +finish_scan:
> > > >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css);
> > > >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags);
> > > >> + __ __ return;
> > > >> +}
> > > >> +
> > > >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
> > > >> +{
> > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
> > > >> + __ __ __ __ __ __ return;
> > > >
> > > > I can't work out what ASYNC_NORESCHED does. __Is its name
> well-chosen?
> > > >
> > > how about BLOCK/STOP_ASYNC_RECLAIM ?
> >
> > I can't say - I don't know what it does! Or maybe I did, and immediately
> > forgot ;)
> >
>
> I'll find a better name ;)
>
> Thanks,
> -Kame
>
>
[-- Attachment #2: Type: text/html, Size: 9413 bytes --]
next prev parent reply other threads:[~2011-05-25 5:52 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-20 3:37 [PATCH 0/8] memcg async reclaim v2 KAMEZAWA Hiroyuki
2011-05-20 3:41 ` [PATCH 1/8] memcg: export zone reclaimable pages KAMEZAWA Hiroyuki
2011-05-20 3:42 ` [PATCH 2/8] memcg: easy check routine for reclaimable KAMEZAWA Hiroyuki
2011-05-20 21:49 ` Andrew Morton
2011-05-20 23:57 ` Hiroyuki Kamezawa
2011-05-20 3:43 ` [PATCH 0/8] memcg: clean up, export swapiness KAMEZAWA Hiroyuki
2011-05-23 17:26 ` Ying Han
2011-05-23 23:55 ` KAMEZAWA Hiroyuki
2011-05-20 3:44 ` [PATCH 4/8] memcg: export release victim KAMEZAWA Hiroyuki
2011-05-20 3:46 ` [PATCH 6/8] memcg asynchronous memory reclaim interface KAMEZAWA Hiroyuki
2011-05-20 21:49 ` Andrew Morton
2011-05-20 23:56 ` Hiroyuki Kamezawa
2011-05-23 23:36 ` Ying Han
2011-05-24 0:11 ` KAMEZAWA Hiroyuki
2011-05-24 0:26 ` Ying Han
2011-05-20 3:47 ` [PATCH 7/8] memcg static scan reclaim for asyncrhonous reclaim KAMEZAWA Hiroyuki
2011-05-20 21:50 ` Andrew Morton
2011-05-21 0:23 ` Hiroyuki Kamezawa
2011-05-20 3:48 ` [PATCH 8/8] memcg asyncrhouns reclaim workqueue KAMEZAWA Hiroyuki
2011-05-20 21:51 ` Andrew Morton
2011-05-21 0:41 ` Hiroyuki Kamezawa
2011-05-21 1:26 ` Andrew Morton
2011-05-23 0:25 ` KAMEZAWA Hiroyuki
2011-05-25 5:51 ` Ying Han [this message]
[not found] ` <BANLkTimd0CAqoAnuGz7WvKsbwphJxo0eZQ@mail.gmail.com>
2011-05-24 0:19 ` [PATCH 0/8] memcg async reclaim v2 KAMEZAWA Hiroyuki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=BANLkTimpzLpGqtuNaPUP3hAhOS3eA_iG1A@mail.gmail.com \
--to=yinghan@google.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kamezawa.hiroyuki@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=nishimura@mxp.nes.nec.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).