From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 5F2AF8D003B for ; Wed, 20 Apr 2011 23:05:11 -0400 (EDT) Received: from hpaq6.eem.corp.google.com (hpaq6.eem.corp.google.com [172.25.149.6]) by smtp-out.google.com with ESMTP id p3L357w8009488 for ; Wed, 20 Apr 2011 20:05:08 -0700 Received: from qwa26 (qwa26.prod.google.com [10.241.193.26]) by hpaq6.eem.corp.google.com with ESMTP id p3L34etV026455 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Wed, 20 Apr 2011 20:05:06 -0700 Received: by qwa26 with SMTP id 26so1000869qwa.14 for ; Wed, 20 Apr 2011 20:05:06 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110421025107.GG2333@cmpxchg.org> References: <1303185466-2532-1-git-send-email-yinghan@google.com> <20110421025107.GG2333@cmpxchg.org> Date: Wed, 20 Apr 2011 20:05:05 -0700 Message-ID: Subject: Re: [PATCH V6 00/10] memcg: per cgroup background reclaim From: Ying Han Content-Type: multipart/alternative; boundary=002354470aa81c45ee04a165023f Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , KAMEZAWA Hiroyuki , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Rik van Riel , Hugh Dickins , Michal Hocko , Dave Hansen , Zhu Yanhai , linux-mm@kvack.org --002354470aa81c45ee04a165023f Content-Type: text/plain; charset=ISO-8859-1 On Wed, Apr 20, 2011 at 7:51 PM, Johannes Weiner wrote: > Hello Ying, > > I'm sorry that I chime in so late, I was still traveling until Monday. > Hey, hope you had a great trip :) > > On Mon, Apr 18, 2011 at 08:57:36PM -0700, Ying Han wrote: > > The current implementation of memcg supports targeting reclaim when the > > cgroup is reaching its hard_limit and we do direct reclaim per cgroup. > > Per cgroup background reclaim is needed which helps to spread out memory > > pressure over longer period of time and smoothes out the cgroup > performance. > > Latency reduction makes perfect sense, the reasons kswapd exists apply > to memory control groups as well. But I disagree with the design > choices you made. > > > If the cgroup is configured to use per cgroup background reclaim, a > kswapd > > thread is created which only scans the per-memcg LRU list. > > We already have direct reclaim, direct reclaim on behalf of a memcg, > and global kswapd-reclaim. Please don't add yet another reclaim path > that does its own thing and interacts unpredictably with the rest of > them. > Yes, we do have per-memcg direct reclaim and kswapd-reclaim. but the later one is global and we don't want to start reclaiming from each memcg until we reach the global memory pressure. > > As discussed on LSF, we want to get rid of the global LRU. So the > goal is to have each reclaim entry end up at the same core part of > reclaim that round-robin scans a subset of zones from a subset of > memory control groups. > True, but that is for system under global memory pressure and we would like to do targeting reclaim instead of reclaiming from the global LRU. That is not the same in this patch, which is doing targeting reclaim proactively per-memcg based on their hard_limit. > > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the > > background reclaim and stop it. The watermarks are calculated based > > on the cgroup's limit_in_bytes. > > Which brings me to the next issue: making the watermarks configurable. > > You argued that having them adjustable from userspace is required for > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking > in in case of global memory pressure. But that is only a problem > because global kswapd reclaim is (apart from soft limit reclaim) > unaware of memory control groups. > > I think the much better solution is to make global kswapd memcg aware > (with the above mentioned round-robin reclaim scheduler), compared to > adding new (and final!) kernel ABI to avoid an internal shortcoming. > We need to make the global kswapd memcg aware and that is the soft_limit hierarchical reclaim. It is different from doing per-memcg background reclaim which we want to reclaim memory per-memcg before they goes to per-memcg direct reclaim. > > The whole excercise of asynchroneous background reclaim is to reduce > reclaim latency. We already have a mechanism for global memory > pressure in place. Per-memcg watermarks should only exist to avoid > direct reclaim due to hitting the hardlimit, nothing else. > Yes, but we have per-memcg direct reclaim which is based on the hard_limit. The latency we need to reduce is the direct reclaim which is different from global memory pressure. > > So in summary, I think converting the reclaim core to this round-robin > scheduler solves all these problems at once: a single code path for > reclaim, breaking up of the global lru lock, fair soft limit reclaim, > and a mechanism for latency reduction that just DTRT without any > user-space configuration necessary. > Not exactly. We will have cases where only few cgroups configured and the total hard_limit always less than the machine capacity. So we will never trigger the global memory pressure. However, we still need to smooth out the performance per-memcg by doing background page reclaim proactively before they hit their hard_limit (direct reclaim) --Ying > > Hannes > --002354470aa81c45ee04a165023f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Wed, Apr 20, 2011 at 7:51 PM, Johanne= s Weiner <hannes= @cmpxchg.org> wrote:
Hello Ying,

I'm sorry that I chime in so late, I was still traveling until Monday.<= br>

Hey, hope you had a great trip :)=A0

On Mon, Apr 18, 2011 at 08:57:36PM -0700, Ying Han wrote:
> The current implementation of memcg supports targeting reclaim when th= e
> cgroup is reaching its hard_limit and we do direct reclaim per cgroup.=
> Per cgroup background reclaim is needed which helps to spread out memo= ry
> pressure over longer period of time and smoothes out the cgroup perfor= mance.

Latency reduction makes perfect sense, the reasons kswapd exists appl= y
to memory control groups as well. =A0But I disagree with the design
choices you made.

> If the cgroup is configured to use per cgroup background reclaim, a ks= wapd
> thread is created which only scans the per-memcg LRU list.

We already have direct reclaim, direct reclaim on behalf of a memcg,<= br> and global kswapd-reclaim. =A0Please don't add yet another reclaim path=
that does its own thing and interacts unpredictably with the rest of
them.

Yes, we do have per-memcg direct = reclaim and kswapd-reclaim. but the later one is global and we don't wa= nt to start reclaiming from each memcg until we reach the global memory pre= ssure.

As discussed on LSF, we want to get rid of the global LRU. =A0So the
goal is to have each reclaim entry end up at the same core part of
reclaim that round-robin scans a subset of zones from a subset of
memory control groups.

True, but that i= s for system under global memory pressure and we would like to do targeting= reclaim instead of reclaiming from the global LRU. That is not the same in= this patch, which is doing targeting reclaim proactively per-memcg based o= n their hard_limit.=A0

> Two watermarks ("high_wmark", "low_wmark") are add= ed to trigger the
> background reclaim and stop it. The watermarks are calculated based > on the cgroup's limit_in_bytes.

Which brings me to the next issue: making the watermarks configurable= .

You argued that having them adjustable from userspace is required for
overcommitting the hardlimits and per-memcg kswapd reclaim not kicking
in in case of global memory pressure. =A0But that is only a problem
because global kswapd reclaim is (apart from soft limit reclaim)
unaware of memory control groups.

I think the much better solution is to make global kswapd memcg aware
(with the above mentioned round-robin reclaim scheduler), compared to
adding new (and final!) kernel ABI to avoid an internal shortcoming.

We need to make the global kswapd memcg aware= and that is the soft_limit=A0hierarchical reclaim.
It is differe= nt from doing per-memcg background reclaim which we want to reclaim memory = per-memcg
before they goes to per-memcg direct reclaim. =A0

The whole excercise of asynchroneous background reclaim is to reduce
reclaim latency. =A0We already have a mechanism for global memory
pressure in place. =A0Per-memcg watermarks should only exist to avoid
direct reclaim due to hitting the hardlimit, nothing else.
=

Yes, but we have per-memcg direct reclaim which is base= d on the hard_limit. The latency we need to reduce is the direct reclaim wh= ich is different from global memory pressure.

So in summary, I think converting the reclaim core to this round-robin
scheduler solves all these problems at once: a single code path for
reclaim, breaking up of the global lru lock, fair soft limit reclaim,
and a mechanism for latency reduction that just DTRT without any
user-space configuration necessary.

Not= exactly. We will have cases where only few cgroups configured and the tota= l hard_limit always less than the machine capacity. So we will never trigge= r the global memory pressure. However, we still need to smooth out the perf= ormance per-memcg by doing background page reclaim proactively before they = hit their hard_limit (direct reclaim)

--Ying
=A0

=A0 =A0 =A0 =A0Hannes

--002354470aa81c45ee04a165023f-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org