From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
YAMAMOTO Takashi <yamamoto@valinux.co.jp>,
Paul Menage <menage@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC 0/5] Memory controller soft limit introduction (v3)
Date: Mon, 30 Jun 2008 09:11:19 +0530 [thread overview]
Message-ID: <486855DF.2070100@linux.vnet.ibm.com> (raw)
In-Reply-To: <20080630102054.ee214765.kamezawa.hiroyu@jp.fujitsu.com>
KAMEZAWA Hiroyuki wrote:
> On Sun, 29 Jun 2008 10:32:03 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>> I have a couple of comments.
>>>
>>> 1. Why you add soft_limit to res_coutner ?
>>> Is there any other controller which uses soft-limit ?
>>> I'll move watermark handling to memcg from res_counter becasue it's
>>> required only by memcg.
>>>
>> I expect soft_limits to be controller independent. The same thing can be applied
>> to an io-controller for example, right?
>>
>
> I can't imagine how soft-limit works on i/o controller. could you explain ?
>
An io-controller could have the same concept. A hard-limit on the bandwidth and
a soft-limit to allow a group to exceed the soft-limit provided there is no i/o
bandwidth congestion.
>
>>> 2. *please* handle NUMA
>>> There is a fundamental difference between global VMM and memcg.
>>> global VMM - reclaim memory at memory shortage.
>>> memcg - for reclaim memory at memory limit
>>> Then, memcg wasn't required to handle place-of-memory at hitting limit.
>>> *just reducing the usage* was enough.
>>> In this set, you try to handle memory shortage handling.
>>> So, please handle NUMA, i.e. "what node do you want to reclaim memory from ?"
>>> If not,
>>> - memory placement of Apps can be terrible.
>>> - cannot work well with cpuset. (I think)
>>>
>> try_to_free_mem_cgroup_pages() handles NUMA right? We start with the
>> node_zonelists of the current node on which we are executing. I can pass on the
>> zonelist from __alloc_pages_internal() to try_to_free_mem_cgroup_pages(). Is
>> there anything else you had in mind?
>>
> Assume following case of a host with 2 nodes. and following mount style.
>
> mount -t cgroup -o memory,cpuset none /opt/cgroup/
>
>
> /Group1: cpu 0-1, mem=0 limit=1G, soft-limit=700M
> /Group2: cpu 2-3, mem=1 limit=1G soft-limit=700M
> ....
> /Groupxxxx
>
> Assume a environ after some workload,
>
> /Group1: cpu 0-1, mem=0 limit=1G, soft-limit=700M usage=990M
> /Group2: cpu 2-3, mem=1 limit=1G soft-limit=700M usage=400M
>
> *And* memory of node"1" is in shortage and the kernel has to reclaim
> memory from node "1".
>
> Your routine tries to relclaim memory from a group, which exceeds soft-limit
> ....Group1. But it's no help because Group1 doesn't contains any memory in Node1.
> And make it worse, your routine doen't tries to call try_to_free_pages() in global
> LRU when your soft-limit reclaim some memory. So, if a task in Group 1 continues
> to allocate memory at some speed, memory shortage in Group2 will not be recovered,
> easily.
>
> This includes 2 aspects of trouble.
> - Group1's memory is reclaimed but it's wrong.
> - Group2's try_to_free_pages() may took very long time.
>
> (Current page shrinking under cpuset seems to scan all nodes,
> his seems not to be quick, but it works because it scans all.
> This will be another problem, anyway ;).
>
>
> BTW, currently mem_cgroup_try_to_free_pages() assumes GFP_HIGHUSER_MOVABLE
> always.
> ==
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> gfp_t gfp_mask)
> {
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> .may_swap = 1,
> .swap_cluster_max = SWAP_CLUSTER_MAX,
> .swappiness = vm_swappiness,
> .order = 0,
> .mem_cgroup = mem_cont,
> .isolate_pages = mem_cgroup_isolate_pages,
> };
> struct zonelist *zonelist;
>
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> zonelist = NODE_DATA(numa_node_id())->node_zonelists;
> return do_try_to_free_pages(zonelist, &sc);
> }
> ==
> please select appropriate zonelist here.
>
We do have zonelist information in __alloc_pages_internal(), it should be easy
to pass the zonelist or come up with a good default (current one) if no zonelist
is provided to the routine.
>
>>> 3. I think when "mem_cgroup_reclaim_on_contention" exits is unclear.
>>> plz add explanation of algorithm. It returns when some pages are reclaimed ?
>>>
>> Sure, I will do that.
>>
>>> 4. When swap-full cgroup is on the top of heap, which tends to contain
>>> tons of memory, much amount of cpu-time will be wasted.
>>> Can we add "ignore me" flag ?
>>>
>> Could you elaborate on swap-full cgroup please? Are you referring to changes
>> introduced by the memcg-handle-swap-cache patch? I don't mind adding a ignore me
>> flag, but I guess we need to figure out when a cgroup is swap full.
>>
> No. no-available-swap, or all-swap-are-used situation.
>
> This situation will happen very easily if swap-controller comes.
We'll definitely deal with it when the swap-controller comes in.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
WARNING: multiple messages have this Message-ID (diff)
From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
YAMAMOTO Takashi <yamamoto@valinux.co.jp>,
Paul Menage <menage@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC 0/5] Memory controller soft limit introduction (v3)
Date: Mon, 30 Jun 2008 09:11:19 +0530 [thread overview]
Message-ID: <486855DF.2070100@linux.vnet.ibm.com> (raw)
In-Reply-To: <20080630102054.ee214765.kamezawa.hiroyu@jp.fujitsu.com>
KAMEZAWA Hiroyuki wrote:
> On Sun, 29 Jun 2008 10:32:03 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>> I have a couple of comments.
>>>
>>> 1. Why you add soft_limit to res_coutner ?
>>> Is there any other controller which uses soft-limit ?
>>> I'll move watermark handling to memcg from res_counter becasue it's
>>> required only by memcg.
>>>
>> I expect soft_limits to be controller independent. The same thing can be applied
>> to an io-controller for example, right?
>>
>
> I can't imagine how soft-limit works on i/o controller. could you explain ?
>
An io-controller could have the same concept. A hard-limit on the bandwidth and
a soft-limit to allow a group to exceed the soft-limit provided there is no i/o
bandwidth congestion.
>
>>> 2. *please* handle NUMA
>>> There is a fundamental difference between global VMM and memcg.
>>> global VMM - reclaim memory at memory shortage.
>>> memcg - for reclaim memory at memory limit
>>> Then, memcg wasn't required to handle place-of-memory at hitting limit.
>>> *just reducing the usage* was enough.
>>> In this set, you try to handle memory shortage handling.
>>> So, please handle NUMA, i.e. "what node do you want to reclaim memory from ?"
>>> If not,
>>> - memory placement of Apps can be terrible.
>>> - cannot work well with cpuset. (I think)
>>>
>> try_to_free_mem_cgroup_pages() handles NUMA right? We start with the
>> node_zonelists of the current node on which we are executing. I can pass on the
>> zonelist from __alloc_pages_internal() to try_to_free_mem_cgroup_pages(). Is
>> there anything else you had in mind?
>>
> Assume following case of a host with 2 nodes. and following mount style.
>
> mount -t cgroup -o memory,cpuset none /opt/cgroup/
>
>
> /Group1: cpu 0-1, mem=0 limit=1G, soft-limit=700M
> /Group2: cpu 2-3, mem=1 limit=1G soft-limit=700M
> ....
> /Groupxxxx
>
> Assume a environ after some workload,
>
> /Group1: cpu 0-1, mem=0 limit=1G, soft-limit=700M usage=990M
> /Group2: cpu 2-3, mem=1 limit=1G soft-limit=700M usage=400M
>
> *And* memory of node"1" is in shortage and the kernel has to reclaim
> memory from node "1".
>
> Your routine tries to relclaim memory from a group, which exceeds soft-limit
> ....Group1. But it's no help because Group1 doesn't contains any memory in Node1.
> And make it worse, your routine doen't tries to call try_to_free_pages() in global
> LRU when your soft-limit reclaim some memory. So, if a task in Group 1 continues
> to allocate memory at some speed, memory shortage in Group2 will not be recovered,
> easily.
>
> This includes 2 aspects of trouble.
> - Group1's memory is reclaimed but it's wrong.
> - Group2's try_to_free_pages() may took very long time.
>
> (Current page shrinking under cpuset seems to scan all nodes,
> his seems not to be quick, but it works because it scans all.
> This will be another problem, anyway ;).
>
>
> BTW, currently mem_cgroup_try_to_free_pages() assumes GFP_HIGHUSER_MOVABLE
> always.
> ==
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> gfp_t gfp_mask)
> {
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> .may_swap = 1,
> .swap_cluster_max = SWAP_CLUSTER_MAX,
> .swappiness = vm_swappiness,
> .order = 0,
> .mem_cgroup = mem_cont,
> .isolate_pages = mem_cgroup_isolate_pages,
> };
> struct zonelist *zonelist;
>
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> zonelist = NODE_DATA(numa_node_id())->node_zonelists;
> return do_try_to_free_pages(zonelist, &sc);
> }
> ==
> please select appropriate zonelist here.
>
We do have zonelist information in __alloc_pages_internal(), it should be easy
to pass the zonelist or come up with a good default (current one) if no zonelist
is provided to the routine.
>
>>> 3. I think when "mem_cgroup_reclaim_on_contention" exits is unclear.
>>> plz add explanation of algorithm. It returns when some pages are reclaimed ?
>>>
>> Sure, I will do that.
>>
>>> 4. When swap-full cgroup is on the top of heap, which tends to contain
>>> tons of memory, much amount of cpu-time will be wasted.
>>> Can we add "ignore me" flag ?
>>>
>> Could you elaborate on swap-full cgroup please? Are you referring to changes
>> introduced by the memcg-handle-swap-cache patch? I don't mind adding a ignore me
>> flag, but I guess we need to figure out when a cgroup is swap full.
>>
> No. no-available-swap, or all-swap-are-used situation.
>
> This situation will happen very easily if swap-controller comes.
We'll definitely deal with it when the swap-controller comes in.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-06-30 3:41 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-06-27 15:18 [RFC 0/5] Memory controller soft limit introduction (v3) Balbir Singh
2008-06-27 15:18 ` Balbir Singh
2008-06-27 15:18 ` [RFC 1/5] Memory controller soft limit documentation Balbir Singh
2008-06-27 15:18 ` Balbir Singh
2008-06-27 15:18 ` [RFC 2/5] Add delete max to prio heap Balbir Singh
2008-06-27 15:18 ` Balbir Singh
2008-06-27 15:18 ` [RFC 3/5] Replacement policy on heap overfull Balbir Singh
2008-06-27 15:18 ` Balbir Singh
2008-06-27 15:37 ` Paul Menage
2008-06-27 15:37 ` Paul Menage
2008-06-30 3:46 ` Balbir Singh
2008-06-30 3:46 ` Balbir Singh
2008-06-27 15:18 ` [RFC 4/5] Memory controller soft limit resource counter additions Balbir Singh
2008-06-27 15:18 ` Balbir Singh
2008-06-27 15:19 ` [RFC 5/5] Memory controller soft limit reclaim on contention Balbir Singh
2008-06-27 15:19 ` Balbir Singh
2008-06-27 16:09 ` Paul Menage
2008-06-27 16:09 ` Paul Menage
2008-06-29 4:48 ` Balbir Singh
2008-06-29 4:48 ` Balbir Singh
2008-06-30 3:42 ` Balbir Singh
2008-06-30 3:42 ` Balbir Singh
2008-06-28 4:22 ` KAMEZAWA Hiroyuki
2008-06-28 4:22 ` KAMEZAWA Hiroyuki
2008-06-30 7:33 ` KOSAKI Motohiro
2008-06-30 7:33 ` KOSAKI Motohiro
2008-06-30 7:48 ` Balbir Singh
2008-06-30 7:48 ` Balbir Singh
2008-06-30 7:56 ` KOSAKI Motohiro
2008-06-30 7:56 ` KOSAKI Motohiro
2008-06-30 8:11 ` Balbir Singh
2008-06-30 8:11 ` Balbir Singh
2008-06-30 8:17 ` KOSAKI Motohiro
2008-06-30 8:17 ` KOSAKI Motohiro
2008-06-28 4:36 ` [RFC 0/5] Memory controller soft limit introduction (v3) KAMEZAWA Hiroyuki
2008-06-28 4:36 ` KAMEZAWA Hiroyuki
2008-06-29 5:02 ` Balbir Singh
2008-06-29 5:02 ` Balbir Singh
2008-06-30 1:20 ` KAMEZAWA Hiroyuki
2008-06-30 1:20 ` KAMEZAWA Hiroyuki
2008-06-30 1:50 ` KAMEZAWA Hiroyuki
2008-06-30 1:50 ` KAMEZAWA Hiroyuki
2008-06-30 2:02 ` KAMEZAWA Hiroyuki
2008-06-30 2:02 ` KAMEZAWA Hiroyuki
2008-06-30 3:41 ` Balbir Singh [this message]
2008-06-30 3:41 ` Balbir Singh
2008-06-30 3:57 ` KAMEZAWA Hiroyuki
2008-06-30 3:57 ` KAMEZAWA Hiroyuki
2008-06-30 4:00 ` Balbir Singh
2008-06-30 4:00 ` Balbir Singh
2008-06-30 4:19 ` KAMEZAWA Hiroyuki
2008-06-30 4:19 ` KAMEZAWA Hiroyuki
2008-06-30 4:40 ` Balbir Singh
2008-06-30 4:40 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=486855DF.2070100@linux.vnet.ibm.com \
--to=balbir@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=yamamoto@valinux.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.