From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 7D43B900086 for ; Fri, 15 Apr 2011 12:41:12 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id p3FGf0jK001996 for ; Fri, 15 Apr 2011 09:41:00 -0700 Received: from qwi4 (qwi4.prod.google.com [10.241.195.4]) by wpaz1.hot.corp.google.com with ESMTP id p3FGetA5029877 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Fri, 15 Apr 2011 09:40:59 -0700 Received: by qwi4 with SMTP id 4so2478116qwi.15 for ; Fri, 15 Apr 2011 09:40:55 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110415094040.GC8828@tiehlicka.suse.cz> References: <1302821669-29862-1-git-send-email-yinghan@google.com> <20110415094040.GC8828@tiehlicka.suse.cz> Date: Fri, 15 Apr 2011 09:40:54 -0700 Message-ID: Subject: Re: [PATCH V4 00/10] memcg: per cgroup background reclaim From: Ying Han Content-Type: multipart/alternative; boundary=00248c6a84caa2839d04a0f7b479 Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , KAMEZAWA Hiroyuki , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , Dave Hansen , Zhu Yanhai , linux-mm@kvack.org --00248c6a84caa2839d04a0f7b479 Content-Type: text/plain; charset=ISO-8859-1 On Fri, Apr 15, 2011 at 2:40 AM, Michal Hocko wrote: > Hi Ying, > sorry that I am jumping into game that late but I was quite busy after > returning back from LSF and LFCS. > Sure. Nice meeting you guys there and thank you for looking into this patch :) > > On Thu 14-04-11 15:54:19, Ying Han wrote: > > The current implementation of memcg supports targeting reclaim when the > > cgroup is reaching its hard_limit and we do direct reclaim per cgroup. > > Per cgroup background reclaim is needed which helps to spread out memory > > pressure over longer period of time and smoothes out the cgroup > performance. > > > > If the cgroup is configured to use per cgroup background reclaim, a > kswapd > > thread is created which only scans the per-memcg LRU list. > > Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU > strategy. If we make the background reclaim per-cgroup how do we balance > from the global/zone POV? We can end up with all groups over the high > limit while a memory zone is under this watermark. Or am I missing > something? > I thought that plans for the background reclaim were same as for direct > reclaim so that kswapd would just evict pages from groups in the > round-robin fashion (in first round just those that are under limit and > proportionally when it cannot reach high watermark after it got through > all groups). > I think you are talking about the soft_limit reclaim which I am gonna look at next. The soft_limit reclaim is triggered under global memory pressure and doing round-robin across memcgs. I will also cover the zone-balancing by having second list of memgs under their soft_limit. Here is the summary of our LSF discussion :) http://permalink.gmane.org/gmane.linux.kernel.mm/60966 > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the > > background reclaim and stop it. The watermarks are calculated based on > > the cgroup's limit_in_bytes. > > I didn't have time to look at the patch how does the calculation work > yet but we should be careful to match the zone's watermark expectations. > I have API on the following patch which provide high/low_wmark_distance to tune wmarks individually individually. By default, they are set to 0 which turn off the per-memcg kswapd. For now, we are ok since the global kswapd is still doing per-zone scanning and reclaiming :) > > > By default, the per-memcg kswapd threads are running under root cgroup. > There > > is a per-memcg API which exports the pid of each kswapd thread, and > userspace > > can configure cpu cgroup seperately. > > > > I run through dd test on large file and then cat the file. Then I > compared > > the reclaim related stats in memory.stat. > > > > Step1: Create a cgroup with 500M memory_limit. > > $ mkdir /dev/cgroup/memory/A > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes > > $ echo $$ >/dev/cgroup/memory/A/tasks > > > > Step2: Test and set the wmarks. > > $ cat /dev/cgroup/memory/A/memory.low_wmark_distance > > 0 > > $ cat /dev/cgroup/memory/A/memory.high_wmark_distance > > 0 > > They are used to tune the high/low_marks based on the hard_limit. We might need to export that configuration to user admin especially on machines where they over-commit by hard_limit. > > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks > > low_wmark 524288000 > > high_wmark 524288000 > > > > $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance > > $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks > > low_wmark 482344960 > > high_wmark 471859200 > > low_wmark is higher than high_wmark? > hah, it is confusing. I have them documented. Basically, low_wmark triggers reclaim and high_wmark stop the reclaim. And we have high_wmark < usage < low_wmark. > > [...] > > Note: > > This is the first effort of enhancing the target reclaim into memcg. Here > are > > the existing known issues and our plan: > > > > 1. there are one kswapd thread per cgroup. the thread is created when the > > cgroup changes its limit_in_bytes and is deleted when the cgroup is being > > removed. In some enviroment when thousand of cgroups are being configured > on > > a single host, we will have thousand of kswapd threads. The memory > consumption > > would be 8k*100 = 8M. We don't see a big issue for now if the host can > host > > that many of cgroups. > > I think that zone background reclaim is much bigger issue than 8k per > kernel thread and too many threads... > yes. > I am not sure how much orthogonal per-cgroup-per-thread vs. zone > approaches are, though. Maybe it makes some sense to do both per-cgroup > and zone background reclaim. Anyway I think that we should start with > the zone reclaim first. > I missed the point here. Can you clarify the zone reclaim here? > > [...] > > > 4. no hierarchical reclaim support in this patchset. I would like to get > to > > after the basic stuff are being accepted. > > Just an idea. > If we did that from zone's POV then we could call > mem_cgroup_hierarchical_reclaim, > right? > > Maybe. I need to think through that, for this verion I don't plan to put hierarchical reclaim. --Ying > [...] > > Thanks > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic > --00248c6a84caa2839d04a0f7b479 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Fri, Apr 15, 2011 at 2:40 AM, Michal = Hocko <mhocko@suse.c= z> wrote:
Hi Ying,
sorry that I am jumping into game that late but I was quite busy after
returning back from LSF and LFCS.

Sure.= Nice meeting you guys there and thank you for looking into this patch :)= =A0

On Thu 14-04-11 15:54:19, Ying Han wrote:
> The current implementation of memcg supports targeting reclaim when th= e
> cgroup is reaching its hard_limit and we do direct reclaim per cgroup.=
> Per cgroup background reclaim is needed which helps to spread out memo= ry
> pressure over longer period of time and smoothes out the cgroup perfor= mance.
>
> If the cgroup is configured to use per cgroup background reclaim, a ks= wapd
> thread is created which only scans the per-memcg LRU list.

Hmm, I am wondering if this fits into the get-rid-of-the-global-LRU strategy. If we make the background reclaim per-cgroup how do we balance from the global/zone POV? We can end up with all groups over the high
limit while a memory zone is under this watermark. Or am I missing
something?
I thought that plans for the background reclaim were same as for direct
reclaim so that kswapd would just evict pages from groups in the
round-robin fashion (in first round just those that are under limit and
proportionally when it cannot reach high watermark after it got through
all groups).

I think you are talking ab= out the soft_limit reclaim which I am gonna look at next. The soft_limit re= claim
is triggered under global memory pressure and doing round-r= obin across memcgs. I will also cover the=A0
zone-balancing by having second list of memgs under their soft_limit.<= /div>

Here is the summary of our LSF discussion :)
= http://p= ermalink.gmane.org/gmane.linux.kernel.mm/60966=A0=A0

> Two watermarks ("high_wmark", "low_wmark") are add= ed to trigger the
> background reclaim and stop it. The watermarks are calculated based on=
> the cgroup's limit_in_bytes.

I didn't have time to look at the patch how does the calculation = work
yet but we should be careful to match the zone's watermark expectations= .

I have API on the following patch whi= ch provide=A0high/low_wmark_distance to tune wmarks individually individual= ly.=A0=A0By default, they are set to 0 which turn off the per-memcg kswapd.= For now, we are ok since the global kswapd is still doing per-zone scannin= g and reclaiming :)

> By default, the per-memcg kswapd threads are running under root cgroup= . There
> is a per-memcg API which exports the pid of each kswapd thread, and us= erspace
> can configure cpu cgroup seperately.
>
> I run through dd test on large file and then cat the file. Then I comp= ared
> the reclaim related stats in memory.stat.
>
> Step1: Create a cgroup with 500M memory_limit.
> $ mkdir /dev/cgroup/memory/A
> $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
> $ echo $$ >/dev/cgroup/memory/A/tasks
>
> Step2: Test and set the wmarks.
> $ cat /dev/cgroup/memory/A/memory.low_wmark_distance
> 0
> $ cat /dev/cgroup/memory/A/memory.high_wmark_distance
> 0


They are used to tune the high/low_m= arks based on the hard_limit. We might need to export that configuration to= user admin especially on machines where they over-commit by hard_limit.

>
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark 524288000
> high_wmark 524288000
>
> $ echo 50m >/dev/cgroup/memory/A/memory.high_wmark_distance
> $ echo 40m >/dev/cgroup/memory/A/memory.low_wmark_distance
>
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark =A0482344960
> high_wmark 471859200

low_wmark is higher than high_wmark?

<= div>hah, it is confusing. I have them documented. Basically, low_wmark trig= gers reclaim and high_wmark stop the reclaim. And we have=A0

high_wmark < usage < low_wmark.=A0

[...]
> Note:
> This is the first effort of enhancing the target reclaim into memcg. H= ere are
> the existing known issues and our plan:
>
> 1. there are one kswapd thread per cgroup. the thread is created when = the
> cgroup changes its limit_in_bytes and is deleted when the cgroup is be= ing
> removed. In some enviroment when thousand of cgroups are being configu= red on
> a single host, we will have thousand of kswapd threads. The memory con= sumption
> would be 8k*100 =3D 8M. We don't see a big issue for now if the ho= st can host
> that many of cgroups.

I think that zone background reclaim is much bigger issue than 8k per=
kernel thread and too many threads...

y= es.
=A0
I am not sure how much orthogonal per-cgroup-per-thread vs. zone
approaches are, though. =A0Maybe it makes some sense to do both per-cgroup<= br> and zone background reclaim. =A0Anyway I think that we should start with the zone reclaim first.

I missed the po= int here. Can you clarify the zone reclaim here?=A0

[...]

> 4. no hierarchical reclaim support in this patchset. I would like to g= et to
> after the basic stuff are being accepted.

Just an idea.
If we did that from zone's POV then we could call mem_cgroup_hierarchic= al_reclaim,
right?

Maybe. I need to think through that, for this verion = I don't plan to put=A0hierarchical=A0reclaim.=A0

--Ying
[...]

Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

--00248c6a84caa2839d04a0f7b479-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org