linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Question about cgroup hierarchy and reducing memory limit
@ 2010-11-22 16:59 Evgeniy Ivanov
  2010-11-24  0:47 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Evgeniy Ivanov @ 2010-11-22 16:59 UTC (permalink / raw)
  To: linux-mm

Hello,

I have following cgroup hierarchy:

  Root
  /   |
A   B

A and B have memory limits set so that it's 100% of limit set in Root.
I want to add C to root:

  Root
  /   |  \
A   B  C

What is correct way to shrink limits for A and B? When they use all
allowed memory and I try to write to their limit files I get error. It
seems, that I can shrink their limits multiple times by 1Mb and it
works, but looks ugly and like very dirty workaround.


-- 
Evgeniy Ivanov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-22 16:59 Question about cgroup hierarchy and reducing memory limit Evgeniy Ivanov
@ 2010-11-24  0:47 ` KAMEZAWA Hiroyuki
  2010-11-24 12:17   ` Evgeniy Ivanov
  0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-24  0:47 UTC (permalink / raw)
  To: Evgeniy Ivanov; +Cc: linux-mm

On Mon, 22 Nov 2010 19:59:41 +0300
Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:

> Hello,
> 
Hi,

> I have following cgroup hierarchy:
> 
>   Root
>   /   |
> A   B
> 
> A and B have memory limits set so that it's 100% of limit set in Root.
> I want to add C to root:
> 
>   Root
>   /   |  \
> A   B  C
> 
> What is correct way to shrink limits for A and B? When they use all
> allowed memory and I try to write to their limit files I get error. 

What kinds of error ? Do you have swap ? What is the kerenel version ?

> It seems, that I can shrink their limits multiple times by 1Mb and it
> works, but looks ugly and like very dirty workaround.
> 

It's designed to allow "shrink at once" but that means release memory
and do forced-writeback. To release memory, it may have to write back
to swap. If tasks in "A" and "B" are too busy and tocuhes tons of memory
while shrinking, it may fail.

It may be a regression. Kernel version is important.

Could you show memory.stat file when you shrink "A" and "B" ?
And what happnes
# sync
# sync
# sync
# reduce memory A
# reduce memory B

one by one ?
Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-24  0:47 ` KAMEZAWA Hiroyuki
@ 2010-11-24 12:17   ` Evgeniy Ivanov
  2010-11-25  1:04     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Evgeniy Ivanov @ 2010-11-24 12:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm

On Wed, Nov 24, 2010 at 3:47 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 22 Nov 2010 19:59:41 +0300
> Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:
>
>> Hello,
>>
> Hi,
>
>> I have following cgroup hierarchy:
>>
>>   Root
>>   /   |
>> A   B
>>
>> A and B have memory limits set so that it's 100% of limit set in Root.
>> I want to add C to root:
>>
>>   Root
>>   /   |  \
>> A   B  C
>>
>> What is correct way to shrink limits for A and B? When they use all
>> allowed memory and I try to write to their limit files I get error.
>
> What kinds of error ? Do you have swap ? What is the kerenel version ?

Kernel is 2.6.31-5 from SLES-SP1 (my build, but without extra patches).
I have 2 Gb swap and just 40 Mb used. Machine has 3 Gb RAM and no load
(neither mem or CPU).

Error is "-bash: echo: write error: Device or resource busy", when I
write to memory.limit_in_bytes.

> It's designed to allow "shrink at once" but that means release memory
> and do forced-writeback. To release memory, it may have to write back
> to swap. If tasks in "A" and "B" are too busy and tocuhes tons of memory
> while shrinking, it may fail.

Well, in test I have a process which uses 30M of memory and in loop
dirties all pages (just single byte) then sleeps 5 seconds before next
iteration.

> It may be a regression. Kernel version is important.
>
> Could you show memory.stat file when you shrink "A" and "B" ?
> And what happnes
> # sync
> # sync
> # sync
> # reduce memory A
> # reduce memory B

Sync doesn't help. Here is log just for memory.stat for group I tried to shrink:

ivanoe:/cgroups/root# cat C/memory.stat
cache 0
rss 90222592
mapped_file 0
pgpgin 1212770
pgpgout 1190743
inactive_anon 45338624
active_anon 44883968
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 94371840
hierarchical_memsw_limit 9223372036854775807
total_cache 0
total_rss 90222592
total_mapped_file 0
total_pgpgin 1212770
total_pgpgout 1190743
total_inactive_anon 45338624
total_active_anon 44883968
total_inactive_file 0
total_active_file 0
total_unevictable 0
ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy
ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy
ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy
ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
ivanoe:/cgroups/root# cat memory.limit_in_bytes
125829120
ivanoe:/cgroups/root# cat B/memory.limit_in_bytes
62914560
ivanoe:/cgroups/root# cat A/memory.limit_in_bytes
20971520



-- 
Evgeniy Ivanov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-24 12:17   ` Evgeniy Ivanov
@ 2010-11-25  1:04     ` KAMEZAWA Hiroyuki
  2010-11-25 10:51       ` Evgeniy Ivanov
  0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-25  1:04 UTC (permalink / raw)
  To: Evgeniy Ivanov; +Cc: linux-mm

Thanks.

On Wed, 24 Nov 2010 15:17:38 +0300
Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:
> > What kinds of error ? Do you have swap ? What is the kerenel version ?
> 
> Kernel is 2.6.31-5 from SLES-SP1 (my build, but without extra patches).
> I have 2 Gb swap and just 40 Mb used. Machine has 3 Gb RAM and no load
> (neither mem or CPU).
> 
Hmm, maybe I should see 2.6.32.

> Error is "-bash: echo: write error: Device or resource busy", when I
> write to memory.limit_in_bytes.
> 
Ok.

> > It's designed to allow "shrink at once" but that means release memory
> > and do forced-writeback. To release memory, it may have to write back
> > to swap. If tasks in "A" and "B" are too busy and tocuhes tons of memory
> > while shrinking, it may fail.
> 
> Well, in test I have a process which uses 30M of memory and in loop
> dirties all pages (just single byte) then sleeps 5 seconds before next
> iteration.
> 
> > It may be a regression. Kernel version is important.
> >
> > Could you show memory.stat file when you shrink "A" and "B" ?
> > And what happnes
> > # sync
> > # sync
> > # sync
> > # reduce memory A
> > # reduce memory B
> 
> Sync doesn't help. Here is log just for memory.stat for group I tried to shrink:
> 
> ivanoe:/cgroups/root# cat C/memory.stat
> cache 0
> rss 90222592

Hmm, memcg is filled with 86MB of anon pages....So, all "pageout" in this
will go swap.

> mapped_file 0
> pgpgin 1212770
> pgpgout 1190743
> inactive_anon 45338624
> active_anon 44883968

(Off topic) IIUC, this active/inactive ratio has been modified in recent kernel.
            So, new swapout may do different behavior.

> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 94371840
> hierarchical_memsw_limit 9223372036854775807
> total_cache 0
> total_rss 90222592
> total_mapped_file 0
> total_pgpgin 1212770
> total_pgpgout 1190743
> total_inactive_anon 45338624
> total_active_anon 44883968
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
> -bash: echo: write error: Device or resource busy
> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
> -bash: echo: write error: Device or resource busy
> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
> -bash: echo: write error: Device or resource busy
> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes


So, this means reducing limit from 90M->30M and
failure of 50MB swapout.

> ivanoe:/cgroups/root# cat memory.limit_in_bytes
> 125829120
> ivanoe:/cgroups/root# cat B/memory.limit_in_bytes
> 62914560
> ivanoe:/cgroups/root# cat A/memory.limit_in_bytes
> 20971520
> 

Ah....I have to explain this.

  (root) limited to 120MB
  (A)    limited to 60MB and this is children of (root)
  (B)    limited to 20MB and this is children of (root)
  (C)    limited to 90MB(now) and this is children of (root)

And now, you want to set limit of (C) to 30MB.

At first, memory cgroup has 2 mode. Do you know memory.use_hierarchy file ?

If memory.use_hierarchy == 0, all cgroups under the cgroup are flat.
In above, if root/memory.use_hierarhy == 0, A and B and C and (root) are
all independent from each other.

If memory.use_hierarchy == 1, all cgroups under the cgroup are in tree.
In above, if root/memory.use_hierarchy == 1, A and B and C works as children
of (root) and usage of A+B+C is limited by (root). 

If you use root/memory.use_hierarchy==0, changing limit of C doesn't affect to
(root) and (root/A) and (root/B). All works will be done in C and you can set
arbitrary limit.

Even if you use root/memory.use_hierarchy==1, changing limit of C will not
affect to (root) and (root/A) and (root/B). All pageout will be done in C
but you can't set limit larger than (root).

(Off topic)If you use root/memory.use_hierarchy==1, changing limit of (root)
will affect (A) and (B) and (C). Then memory are reclaimed from (A) and (B)
and (C) because (root) is parent of (A) and (B) and (C).



So, in this case, only "C" is the problem.
And, at swapout, it may be problem how swap is slow.

The logic of pageout(swapout) at shrinking is:

0. retry_count=5
1. usage = current_usage
2. limit = new limit.
3. if (usage < limit) => goto end(success)
4. try to reclaim memory.
5. new_usage = current_usage
6. if (new_usage >= usage) retry_count--
7. if (retry_count < 0) goto end(-EBUSY)

So, It depends on workload(swapin) and speed of swapout whether it will success.
It seems pagein in "C" is faster than swapout of shrinking itelation.

So, why you succeed to reduce limit by 1MB is maybe because pagein is blocked
by hitting memory limit. So, shrink usage can success.

To make success rate higher, it seems 
 1) memory cgroup should do harder retry
    Difficulty with this is that we have no guarantee. 
or
 2) memory cgroup should block pagein.
    Difficulty with this is tasks may stop too long. (if new limit is bad.)

I may not be able to give you good advise about SLES.
I'll think about some and write a patch. Thank you for reporting.
I hope my patch may be able to be backported.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-25  1:04     ` KAMEZAWA Hiroyuki
@ 2010-11-25 10:51       ` Evgeniy Ivanov
  2010-11-29  6:58         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Evgeniy Ivanov @ 2010-11-25 10:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm

Thank you very much for very detailed explanation.

On Thu, Nov 25, 2010 at 4:04 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Thanks.
>
> On Wed, 24 Nov 2010 15:17:38 +0300
> Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:
>> > What kinds of error ? Do you have swap ? What is the kerenel version ?
>>
>> Kernel is 2.6.31-5 from SLES-SP1 (my build, but without extra patches).
>> I have 2 Gb swap and just 40 Mb used. Machine has 3 Gb RAM and no load
>> (neither mem or CPU).
>>
> Hmm, maybe I should see 2.6.32.

Oh, yes. You're right, I took a wrong version, but still my 2.6.31-5
is from Novell. I can use 2.6.32 as well (I see it in packages).

>> Error is "-bash: echo: write error: Device or resource busy", when I
>> write to memory.limit_in_bytes.
>>
> Ok.
>
>> > It's designed to allow "shrink at once" but that means release memory
>> > and do forced-writeback. To release memory, it may have to write back
>> > to swap. If tasks in "A" and "B" are too busy and tocuhes tons of memory
>> > while shrinking, it may fail.
>>
>> Well, in test I have a process which uses 30M of memory and in loop
>> dirties all pages (just single byte) then sleeps 5 seconds before next
>> iteration.
>>
>> > It may be a regression. Kernel version is important.
>> >
>> > Could you show memory.stat file when you shrink "A" and "B" ?
>> > And what happnes
>> > # sync
>> > # sync
>> > # sync
>> > # reduce memory A
>> > # reduce memory B
>>
>> Sync doesn't help. Here is log just for memory.stat for group I tried to shrink:
>>
>> ivanoe:/cgroups/root# cat C/memory.stat
>> cache 0
>> rss 90222592
>
> Hmm, memcg is filled with 86MB of anon pages....So, all "pageout" in this
> will go swap.
>
>> mapped_file 0
>> pgpgin 1212770
>> pgpgout 1190743
>> inactive_anon 45338624
>> active_anon 44883968
>
> (Off topic) IIUC, this active/inactive ratio has been modified in recent kernel.
>            So, new swapout may do different behavior.
>
>> inactive_file 0
>> active_file 0
>> unevictable 0
>> hierarchical_memory_limit 94371840
>> hierarchical_memsw_limit 9223372036854775807
>> total_cache 0
>> total_rss 90222592
>> total_mapped_file 0
>> total_pgpgin 1212770
>> total_pgpgout 1190743
>> total_inactive_anon 45338624
>> total_active_anon 44883968
>> total_inactive_file 0
>> total_active_file 0
>> total_unevictable 0
>> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
>> -bash: echo: write error: Device or resource busy
>> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
>> -bash: echo: write error: Device or resource busy
>> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
>> -bash: echo: write error: Device or resource busy
>> ivanoe:/cgroups/root# echo 30M > C/memory.limit_in_bytes
>
>
> So, this means reducing limit from 90M->30M and
> failure of 50MB swapout.
>
>> ivanoe:/cgroups/root# cat memory.limit_in_bytes
>> 125829120
>> ivanoe:/cgroups/root# cat B/memory.limit_in_bytes
>> 62914560
>> ivanoe:/cgroups/root# cat A/memory.limit_in_bytes
>> 20971520
>>
>
> Ah....I have to explain this.
>
>  (root) limited to 120MB
>  (A)    limited to 60MB and this is children of (root)
>  (B)    limited to 20MB and this is children of (root)
>  (C)    limited to 90MB(now) and this is children of (root)
>
> And now, you want to set limit of (C) to 30MB.
>
> At first, memory cgroup has 2 mode. Do you know memory.use_hierarchy file ?
>
> If memory.use_hierarchy == 0, all cgroups under the cgroup are flat.
> In above, if root/memory.use_hierarhy == 0, A and B and C and (root) are
> all independent from each other.
>
> If memory.use_hierarchy == 1, all cgroups under the cgroup are in tree.
> In above, if root/memory.use_hierarchy == 1, A and B and C works as children
> of (root) and usage of A+B+C is limited by (root).
>
> If you use root/memory.use_hierarchy==0, changing limit of C doesn't affect to
> (root) and (root/A) and (root/B). All works will be done in C and you can set
> arbitrary limit.
>
> Even if you use root/memory.use_hierarchy==1, changing limit of C will not
> affect to (root) and (root/A) and (root/B). All pageout will be done in C
> but you can't set limit larger than (root).

Thank you for explanation. I use memory.use_hierarchy=1, I don't want
all pageout done in C, that's why originally I was trying to change
limits of A and B before adding C (same problem as with changing
limits for C).

>
> (Off topic)If you use root/memory.use_hierarchy==1, changing limit of (root)
> will affect (A) and (B) and (C). Then memory are reclaimed from (A) and (B)
> and (C) because (root) is parent of (A) and (B) and (C).
>
>
>
> So, in this case, only "C" is the problem.

Kind of, A and B are not good too. I guess it's related to decreasing
memory limit of any group.

> And, at swapout, it may be problem how swap is slow.
>
> The logic of pageout(swapout) at shrinking is:
>
> 0. retry_count=5
> 1. usage = current_usage
> 2. limit = new limit.
> 3. if (usage < limit) => goto end(success)
> 4. try to reclaim memory.
> 5. new_usage = current_usage
> 6. if (new_usage >= usage) retry_count--
> 7. if (retry_count < 0) goto end(-EBUSY)
>
> So, It depends on workload(swapin) and speed of swapout whether it will success.
> It seems pagein in "C" is faster than swapout of shrinking itelation.
>
> So, why you succeed to reduce limit by 1MB is maybe because pagein is blocked
> by hitting memory limit. So, shrink usage can success.

I see, that makes sense.

> To make success rate higher, it seems
>  1) memory cgroup should do harder retry
>    Difficulty with this is that we have no guarantee.
> or
>  2) memory cgroup should block pagein.
>    Difficulty with this is tasks may stop too long. (if new limit is bad.)

> I may not be able to give you good advise about SLES.
> I'll think about some and write a patch. Thank you for reporting.
> I hope my patch may be able to be backported.

That would be great, thanks!
For now we decided either to use decreasing limits in script with
timeout or controlling the limit just by root group.


-- 
Evgeniy Ivanov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-25 10:51       ` Evgeniy Ivanov
@ 2010-11-29  6:58         ` KAMEZAWA Hiroyuki
  2010-11-29 14:02           ` Balbir Singh
  0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  6:58 UTC (permalink / raw)
  To: Evgeniy Ivanov
  Cc: linux-mm, nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	gthelen

On Thu, 25 Nov 2010 13:51:06 +0300
Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:

> That would be great, thanks!
> For now we decided either to use decreasing limits in script with
> timeout or controlling the limit just by root group.
> 

I wrote a patch as below but I also found that "success" of shrkinking limit 
means easy OOM Kill because we don't have wait-for-writeback logic.

Now, -EBUSY seems to be a safe guard logic against OOM KILL.
I'd like to wait for the merge of dirty_ratio logic and test this again.
I hope it helps.

Thanks,
-Kame
==
At changing limit of memory cgroup, we see many -EBUSY when
 1. Cgroup is small.
 2. Some tasks are accessing pages very frequently.

It's not very covenient. This patch makes memcg to be in "shrinking" mode
when the limit is shrinking. This patch does,

 a) block new allocation.
 b) ignore page reference bit at shrinking.

The admin should know what he does...

Need:
 - dirty_ratio for avoid OOM.
 - Documentation update.

Note:
 - Sudden shrinking of memory limit tends to cause OOM.
   We need dirty_ratio patch before merging this.

Reported-by: Evgeniy Ivanov <lolkaantimat@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 +++++
 mm/memcontrol.c            |   48 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |    2 +
 3 files changed, 56 insertions(+)

Index: mmotm-1117/mm/memcontrol.c
===================================================================
--- mmotm-1117.orig/mm/memcontrol.c
+++ mmotm-1117/mm/memcontrol.c
@@ -239,6 +239,7 @@ struct mem_cgroup {
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	atomic_t	shrinking;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1814,6 +1815,25 @@ static int __cpuinit memcg_cpu_hotplug_c
 	return NOTIFY_OK;
 }
 
+static DECLARE_WAIT_QUEUE_HEAD(memcg_shrink_waitq);
+
+bool mem_cgroup_shrinking(struct mem_cgroup *mem)
+{
+	return atomic_read(&mem->shrinking) > 0;
+}
+
+void mem_cgroup_shrink_wait(struct mem_cgroup *mem)
+{
+	wait_queue_t wait;
+
+	init_wait(&wait);
+	prepare_to_wait(&memcg_shrink_waitq, &wait, TASK_INTERRUPTIBLE);
+	smp_rmb();
+	if (mem_cgroup_shrinking(mem))
+		schedule();
+	finish_wait(&memcg_shrink_waitq, &wait);
+}
+
 
 /* See __mem_cgroup_try_charge() for details */
 enum {
@@ -1832,6 +1852,17 @@ static int __mem_cgroup_do_charge(struct
 	unsigned long flags = 0;
 	int ret;
 
+	/*
+ 	 * If shrinking() == true, admin is now reducing limit of memcg and
+ 	 * reclaiming memory eagerly. This _new_ charge will increase usage and
+ 	 * prevents the system from setting new limit. We add delay here and
+ 	 * make reducing size easier.
+ 	 */
+	if (unlikely(mem_cgroup_shrinking(mem)) && (gfp_mask & __GFP_WAIT)) {
+		mem_cgroup_shrink_wait(mem);
+		return CHARGE_RETRY;
+	}
+
 	ret = res_counter_charge(&mem->res, csize, &fail_res);
 
 	if (likely(!ret)) {
@@ -1984,6 +2015,7 @@ again:
 			csize = PAGE_SIZE;
 			css_put(&mem->css);
 			mem = NULL;
+			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 			goto again;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
 			css_put(&mem->css);
@@ -2938,12 +2970,14 @@ static DEFINE_MUTEX(set_limit_mutex);
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				unsigned long long val)
 {
+	struct mem_cgroup *iter;
 	int retry_count;
 	u64 memswlimit, memlimit;
 	int ret = 0;
 	int children = mem_cgroup_count_children(memcg);
 	u64 curusage, oldusage;
 	int enlarge;
+	int need_unset_shrinking = 0;
 
 	/*
 	 * For keeping hierarchical_reclaim simple, how long we should retry
@@ -2954,6 +2988,14 @@ static int mem_cgroup_resize_limit(struc
 
 	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 
+	/*
+	 * At reducing limit, new charges should be delayed.
+	 */
+	if (val < res_counter_read_u64(&memcg->res, RES_LIMIT)) {
+		need_unset_shrinking = 1;
+		for_each_mem_cgroup_tree(iter, memcg)
+			atomic_inc(&iter->shrinking);
+	}
 	enlarge = 0;
 	while (retry_count) {
 		if (signal_pending(current)) {
@@ -3001,6 +3043,12 @@ static int mem_cgroup_resize_limit(struc
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
+	if (need_unset_shrinking) {
+		for_each_mem_cgroup_tree(iter, memcg)
+			atomic_dec(&iter->shrinking);
+		wake_up_all(&memcg_shrink_waitq);
+	}
+
 	return ret;
 }
 
Index: mmotm-1117/include/linux/memcontrol.h
===================================================================
--- mmotm-1117.orig/include/linux/memcontrol.h
+++ mmotm-1117/include/linux/memcontrol.h
@@ -146,6 +146,8 @@ unsigned long mem_cgroup_soft_limit_recl
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+bool mem_cgroup_shrinking(struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -336,6 +338,10 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 	return 0;
 }
 
+static inline bool mem_cgroup_shrinking(struct mem_cgroup *mem);
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-1117/mm/vmscan.c
===================================================================
--- mmotm-1117.orig/mm/vmscan.c
+++ mmotm-1117/mm/vmscan.c
@@ -617,6 +617,8 @@ static enum page_references page_check_r
 	/* Lumpy reclaim - ignore references */
 	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
 		return PAGEREF_RECLAIM;
+	if (!scanning_global_lru(sc) && mem_cgroup_shrinking(sc->mem_cgroup))
+		return PAGEREF_RECLAIM;
 
 	/*
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-29  6:58         ` KAMEZAWA Hiroyuki
@ 2010-11-29 14:02           ` Balbir Singh
  2010-11-30  0:03             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Balbir Singh @ 2010-11-29 14:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Evgeniy Ivanov, linux-mm, nishimura@mxp.nes.nec.co.jp, gthelen

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-11-29 15:58:58]:

> On Thu, 25 Nov 2010 13:51:06 +0300
> Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:
> 
> > That would be great, thanks!
> > For now we decided either to use decreasing limits in script with
> > timeout or controlling the limit just by root group.
> > 
> 
> I wrote a patch as below but I also found that "success" of shrkinking limit 
> means easy OOM Kill because we don't have wait-for-writeback logic.
> 
> Now, -EBUSY seems to be a safe guard logic against OOM KILL.
> I'd like to wait for the merge of dirty_ratio logic and test this again.
> I hope it helps.
> 
> Thanks,
> -Kame
> ==
> At changing limit of memory cgroup, we see many -EBUSY when
>  1. Cgroup is small.
>  2. Some tasks are accessing pages very frequently.
> 
> It's not very covenient. This patch makes memcg to be in "shrinking" mode
> when the limit is shrinking. This patch does,
> 
>  a) block new allocation.
>  b) ignore page reference bit at shrinking.
> 
> The admin should know what he does...
> 
> Need:
>  - dirty_ratio for avoid OOM.
>  - Documentation update.
> 
> Note:
>  - Sudden shrinking of memory limit tends to cause OOM.
>    We need dirty_ratio patch before merging this.
> 
> Reported-by: Evgeniy Ivanov <lolkaantimat@gmail.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    6 +++++
>  mm/memcontrol.c            |   48 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |    2 +
>  3 files changed, 56 insertions(+)
> 
> Index: mmotm-1117/mm/memcontrol.c
> ===================================================================
> --- mmotm-1117.orig/mm/memcontrol.c
> +++ mmotm-1117/mm/memcontrol.c
> @@ -239,6 +239,7 @@ struct mem_cgroup {
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	atomic_t	shrinking;
> 
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1814,6 +1815,25 @@ static int __cpuinit memcg_cpu_hotplug_c
>  	return NOTIFY_OK;
>  }
> 
> +static DECLARE_WAIT_QUEUE_HEAD(memcg_shrink_waitq);
> +
> +bool mem_cgroup_shrinking(struct mem_cgroup *mem)

I prefer is_mem_cgroup_shrinking

> +{
> +	return atomic_read(&mem->shrinking) > 0;
> +}
> +
> +void mem_cgroup_shrink_wait(struct mem_cgroup *mem)
> +{
> +	wait_queue_t wait;
> +
> +	init_wait(&wait);
> +	prepare_to_wait(&memcg_shrink_waitq, &wait, TASK_INTERRUPTIBLE);
> +	smp_rmb();

Why the rmb?

> +	if (mem_cgroup_shrinking(mem))
> +		schedule();

We need to check for signals if we sleep with TASK_INTERRUPTIBLE, but
that complicates the entire path as well. May be the question to ask
is - why is this TASK_INTERRUPTIBLE, what is the expected delay. Could
this be a fairness issue as well?

> +	finish_wait(&memcg_shrink_waitq, &wait);
> +}
> +
> 
>  /* See __mem_cgroup_try_charge() for details */
>  enum {
> @@ -1832,6 +1852,17 @@ static int __mem_cgroup_do_charge(struct
>  	unsigned long flags = 0;
>  	int ret;
> 
> +	/*
> + 	 * If shrinking() == true, admin is now reducing limit of memcg and
> + 	 * reclaiming memory eagerly. This _new_ charge will increase usage and
> + 	 * prevents the system from setting new limit. We add delay here and
> + 	 * make reducing size easier.
> + 	 */
> +	if (unlikely(mem_cgroup_shrinking(mem)) && (gfp_mask & __GFP_WAIT)) {
> +		mem_cgroup_shrink_wait(mem);
> +		return CHARGE_RETRY;
> +	}
> +

Oh! oh! I'd hate to do this in the fault path

>  	ret = res_counter_charge(&mem->res, csize, &fail_res);
> 
>  	if (likely(!ret)) {
> @@ -1984,6 +2015,7 @@ again:
>  			csize = PAGE_SIZE;
>  			css_put(&mem->css);
>  			mem = NULL;
> +			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  			goto again;
>  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
>  			css_put(&mem->css);
> @@ -2938,12 +2970,14 @@ static DEFINE_MUTEX(set_limit_mutex);
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  				unsigned long long val)
>  {
> +	struct mem_cgroup *iter;
>  	int retry_count;
>  	u64 memswlimit, memlimit;
>  	int ret = 0;
>  	int children = mem_cgroup_count_children(memcg);
>  	u64 curusage, oldusage;
>  	int enlarge;
> +	int need_unset_shrinking = 0;
> 
>  	/*
>  	 * For keeping hierarchical_reclaim simple, how long we should retry
> @@ -2954,6 +2988,14 @@ static int mem_cgroup_resize_limit(struc
> 
>  	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> 
> +	/*
> +	 * At reducing limit, new charges should be delayed.
> +	 */
> +	if (val < res_counter_read_u64(&memcg->res, RES_LIMIT)) {
> +		need_unset_shrinking = 1;
> +		for_each_mem_cgroup_tree(iter, memcg)
> +			atomic_inc(&iter->shrinking);
> +	}
>  	enlarge = 0;
>  	while (retry_count) {
>  		if (signal_pending(current)) {
> @@ -3001,6 +3043,12 @@ static int mem_cgroup_resize_limit(struc
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
> 
> +	if (need_unset_shrinking) {
> +		for_each_mem_cgroup_tree(iter, memcg)
> +			atomic_dec(&iter->shrinking);
> +		wake_up_all(&memcg_shrink_waitq);
> +	}
> +
>  	return ret;
>  }
> 
> Index: mmotm-1117/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-1117.orig/include/linux/memcontrol.h
> +++ mmotm-1117/include/linux/memcontrol.h
> @@ -146,6 +146,8 @@ unsigned long mem_cgroup_soft_limit_recl
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> 
> +bool mem_cgroup_shrinking(struct mem_cgroup *mem);
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
> 
> @@ -336,6 +338,10 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>  	return 0;
>  }
> 
> +static inline bool mem_cgroup_shrinking(struct mem_cgroup *mem);
> +{
> +	return false;
> +}
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 
>  #endif /* _LINUX_MEMCONTROL_H */
> Index: mmotm-1117/mm/vmscan.c
> ===================================================================
> --- mmotm-1117.orig/mm/vmscan.c
> +++ mmotm-1117/mm/vmscan.c
> @@ -617,6 +617,8 @@ static enum page_references page_check_r
>  	/* Lumpy reclaim - ignore references */
>  	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
>  		return PAGEREF_RECLAIM;
> +	if (!scanning_global_lru(sc) && mem_cgroup_shrinking(sc->mem_cgroup))
> +		return PAGEREF_RECLAIM;
> 
>  	/*
>  	 * Mlock lost the isolation race with us.  Let try_to_unmap()
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-29 14:02           ` Balbir Singh
@ 2010-11-30  0:03             ` KAMEZAWA Hiroyuki
  2010-11-30  1:23               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  0:03 UTC (permalink / raw)
  To: balbir; +Cc: Evgeniy Ivanov, linux-mm, nishimura@mxp.nes.nec.co.jp, gthelen

On Mon, 29 Nov 2010 19:32:33 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-11-29 15:58:58]:
> 
> > On Thu, 25 Nov 2010 13:51:06 +0300
> > Evgeniy Ivanov <lolkaantimat@gmail.com> wrote:
> > 
> > > That would be great, thanks!
> > > For now we decided either to use decreasing limits in script with
> > > timeout or controlling the limit just by root group.
> > > 
> > 
> > I wrote a patch as below but I also found that "success" of shrkinking limit 
> > means easy OOM Kill because we don't have wait-for-writeback logic.
> > 
> > Now, -EBUSY seems to be a safe guard logic against OOM KILL.
> > I'd like to wait for the merge of dirty_ratio logic and test this again.
> > I hope it helps.
> > 
> > Thanks,
> > -Kame
> > ==
> > At changing limit of memory cgroup, we see many -EBUSY when
> >  1. Cgroup is small.
> >  2. Some tasks are accessing pages very frequently.
> > 
> > It's not very covenient. This patch makes memcg to be in "shrinking" mode
> > when the limit is shrinking. This patch does,
> > 
> >  a) block new allocation.
> >  b) ignore page reference bit at shrinking.
> > 
> > The admin should know what he does...
> > 
> > Need:
> >  - dirty_ratio for avoid OOM.
> >  - Documentation update.
> > 
> > Note:
> >  - Sudden shrinking of memory limit tends to cause OOM.
> >    We need dirty_ratio patch before merging this.
> > 
> > Reported-by: Evgeniy Ivanov <lolkaantimat@gmail.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    6 +++++
> >  mm/memcontrol.c            |   48 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/vmscan.c                |    2 +
> >  3 files changed, 56 insertions(+)
> > 
> > Index: mmotm-1117/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-1117.orig/mm/memcontrol.c
> > +++ mmotm-1117/mm/memcontrol.c
> > @@ -239,6 +239,7 @@ struct mem_cgroup {
> >  	unsigned int	swappiness;
> >  	/* OOM-Killer disable */
> >  	int		oom_kill_disable;
> > +	atomic_t	shrinking;
> > 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > @@ -1814,6 +1815,25 @@ static int __cpuinit memcg_cpu_hotplug_c
> >  	return NOTIFY_OK;
> >  }
> > 
> > +static DECLARE_WAIT_QUEUE_HEAD(memcg_shrink_waitq);
> > +
> > +bool mem_cgroup_shrinking(struct mem_cgroup *mem)
> 
> I prefer is_mem_cgroup_shrinking
> 
Hmm, ok.

> > +{
> > +	return atomic_read(&mem->shrinking) > 0;
> > +}
> > +
> > +void mem_cgroup_shrink_wait(struct mem_cgroup *mem)
> > +{
> > +	wait_queue_t wait;
> > +
> > +	init_wait(&wait);
> > +	prepare_to_wait(&memcg_shrink_waitq, &wait, TASK_INTERRUPTIBLE);
> > +	smp_rmb();
> 
> Why the rmb?
> 
my fault.


> > +	if (mem_cgroup_shrinking(mem))
> > +		schedule();
> 
> We need to check for signals if we sleep with TASK_INTERRUPTIBLE, but
> that complicates the entire path as well. May be the question to ask
> is - why is this TASK_INTERRUPTIBLE, what is the expected delay. Could
> this be a fairness issue as well?
> 
Signal check is done in do_charge() automaticaly.


> > +	finish_wait(&memcg_shrink_waitq, &wait);
> > +}
> > +
> > 
> >  /* See __mem_cgroup_try_charge() for details */
> >  enum {
> > @@ -1832,6 +1852,17 @@ static int __mem_cgroup_do_charge(struct
> >  	unsigned long flags = 0;
> >  	int ret;
> > 
> > +	/*
> > + 	 * If shrinking() == true, admin is now reducing limit of memcg and
> > + 	 * reclaiming memory eagerly. This _new_ charge will increase usage and
> > + 	 * prevents the system from setting new limit. We add delay here and
> > + 	 * make reducing size easier.
> > + 	 */
> > +	if (unlikely(mem_cgroup_shrinking(mem)) && (gfp_mask & __GFP_WAIT)) {
> > +		mem_cgroup_shrink_wait(mem);
> > +		return CHARGE_RETRY;
> > +	}
> > +
> 
> Oh! oh! I'd hate to do this in the fault path
> 
Why ? We have per-cpu stock now and infulence of this is minimum.
We never hit this.
If problem, I'll use per-cpu value but it seems to be overkill.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about cgroup hierarchy and reducing memory limit
  2010-11-30  0:03             ` KAMEZAWA Hiroyuki
@ 2010-11-30  1:23               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-30  1:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, Evgeniy Ivanov, linux-mm, nishimura@mxp.nes.nec.co.jp,
	gthelen

On Tue, 30 Nov 2010 09:03:33 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Oh! oh! I'd hate to do this in the fault path
> > 
> Why ? We have per-cpu stock now and infulence of this is minimum.
> We never hit this.
> If problem, I'll use per-cpu value but it seems to be overkill.

I'll remove all atomic ops. 

BTW, if you don't like waitqueue, what is alternative ?
Keeping memory cgroup limit broken as returning -EBUSY is better ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-11-30  1:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-22 16:59 Question about cgroup hierarchy and reducing memory limit Evgeniy Ivanov
2010-11-24  0:47 ` KAMEZAWA Hiroyuki
2010-11-24 12:17   ` Evgeniy Ivanov
2010-11-25  1:04     ` KAMEZAWA Hiroyuki
2010-11-25 10:51       ` Evgeniy Ivanov
2010-11-29  6:58         ` KAMEZAWA Hiroyuki
2010-11-29 14:02           ` Balbir Singh
2010-11-30  0:03             ` KAMEZAWA Hiroyuki
2010-11-30  1:23               ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).