[LSF/MM TOPIC] memcg topics.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] memcg topics.
@ 2012-02-01  0:55 KAMEZAWA Hiroyuki
  2012-02-01  8:58 ` Glauber Costa
  2012-02-01 20:24 ` [LSF/MM TOPIC] " Greg Thelen
  0 siblings, 2 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-01  0:55 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, hannes@cmpxchg.org, Michal Hocko, bsingharora@gmail.com,
	Hugh Dickins, Ying Han, Mel Gorman

Hi, I guess we have some topics on memory cgroups.

1-4 : someone has an implemanation
5   : no implemenation.

1. page_cgroup diet
   memory cgroup uses 'struct page_cgroup', it was 40bytes per 4096bytes in past.
   Johannes removed ->page and ->lru from page_cgroup, then now,
   sizeof(page_cgroup)==16. Now, I'm working on removing ->flags to make
   sizeof(page_cgroup)==8.

   Then, finally, page_cgroup can be moved into struct page on 64bit system ?
   How 32bit system will be ?

2. memory reclaim
   Johannes, Michal and Ying, ant others, are now working on memory reclaim problem
   with new LRU. Under it, LRU is per-memcg-per-zone.
   Following topics are discussed now.

   - simplificaiton/re-implemenation of softlimit
   - isolation of workload (by softlimit)
   - when we should stop memory reclaim, especially under direct-reclaim.
     (Now, we scan all zonelist..)

3. per-memcg-lru-zone-lru-lock
   I hear Hugh Dickins have some patches and are testing it.
   It will be good to discuss this if it has Pros. and Cons or implemenation issue.

4. dirty ratio
   In the last year, patches were posted but not merged. I'd like to hear
   works on this area.

5. accounting other than user pages.
   Last year, tcp buffer limiting was added to "memcg".
   If someone has other plans, I'd like to hear.
   I myself don't think 'generic kernel memory limitation' is a good thing....
   admins can't predict performance.

   Can we make accounting on dentry/inode into memcg and call shrink_slab() ?
   But I guess per-zone-shrink-slab() should go 1st...

More ?

x. per-memcg kswapd.
   This is good for reducing direct-reclaim latency of memcg('s limit).
   But (my) patch is not updated now, so, this will be off-topic in this year.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-01  0:55 [LSF/MM TOPIC] memcg topics KAMEZAWA Hiroyuki
@ 2012-02-01  8:58 ` Glauber Costa
  2012-02-02 11:33   ` [LSF/MM TOPIC][ATTEND] " Glauber Costa
  2012-02-01 20:24 ` [LSF/MM TOPIC] " Greg Thelen
  1 sibling, 1 reply; 16+ messages in thread
From: Glauber Costa @ 2012-02-01  8:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: lsf-pc, linux-mm, hannes@cmpxchg.org, Michal Hocko,
	bsingharora@gmail.com, Hugh Dickins, Ying Han, Mel Gorman

On 02/01/2012 04:55 AM, KAMEZAWA Hiroyuki wrote:
> Hi, I guess we have some topics on memory cgroups.
>
> 1-4 : someone has an implemanation
> 5   : no implemenation.
>
> 1. page_cgroup diet
>     memory cgroup uses 'struct page_cgroup', it was 40bytes per 4096bytes in past.
>     Johannes removed ->page and ->lru from page_cgroup, then now,
>     sizeof(page_cgroup)==16. Now, I'm working on removing ->flags to make
>     sizeof(page_cgroup)==8.
>
>     Then, finally, page_cgroup can be moved into struct page on 64bit system ?
>     How 32bit system will be ?
>
> 2. memory reclaim
>     Johannes, Michal and Ying, ant others, are now working on memory reclaim problem
>     with new LRU. Under it, LRU is per-memcg-per-zone.
>     Following topics are discussed now.
>
>     - simplificaiton/re-implemenation of softlimit
>     - isolation of workload (by softlimit)
>     - when we should stop memory reclaim, especially under direct-reclaim.
>       (Now, we scan all zonelist..)
>
> 3. per-memcg-lru-zone-lru-lock
>     I hear Hugh Dickins have some patches and are testing it.
>     It will be good to discuss this if it has Pros. and Cons or implemenation issue.
>
> 4. dirty ratio
>     In the last year, patches were posted but not merged. I'd like to hear
>     works on this area.
>
> 5. accounting other than user pages.
>     Last year, tcp buffer limiting was added to "memcg".
I was about to correct you about "last year", when suddenly my mind went 
"oh god, this is 2012!"

>     If someone has other plans, I'd like to hear.
>     I myself don't think 'generic kernel memory limitation' is a good thing....
>     admins can't predict performance.
>
>     Can we make accounting on dentry/inode into memcg and call shrink_slab() ?
>     But I guess per-zone-shrink-slab() should go 1st...

Well, I have work in progress to continue that. There are a couple of 
slabs I'd like to track. I am convinced that a generic framework is a 
good thing, but indeed, I am still not sure if a generic interface is.

The advantage of keeping it unified, is that it prevents the number of 
knobs from exploding. For us, this is not that much of a problem, 
because there are only a couple of ones we are interested in. dcache and 
inode is an example of that: when we sent out some proposals (that 
didn't use memcg), some people wanted to see inode, not dcache being 
tracked. We disagreed. But yet, the truth remains that only *one* of 
them needs to be tracked, because they live in a close relation to each 
other. So if we manage to find a couple of slabs that are key to that, 
we can limit only those.

Well, that was food for thought only. I do think this is a nice topic.

Also, there is no serious implementation for that, as you mentioned, but 
a series of patches were sent out for appreciation last year. So there 
is at least a basis for starting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC][ATTEND] memcg topics.
  2012-02-01  8:58 ` Glauber Costa
@ 2012-02-02 11:33   ` Glauber Costa
  0 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-02-02 11:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: lsf-pc, linux-mm, hannes@cmpxchg.org, Michal Hocko,
	bsingharora@gmail.com, Hugh Dickins, Ying Han, Mel Gorman

On 02/01/2012 12:58 PM, Glauber Costa wrote:
> On 02/01/2012 04:55 AM, KAMEZAWA Hiroyuki wrote:
>> Hi, I guess we have some topics on memory cgroups.
>>
>> 1-4 : someone has an implemanation
>> 5 : no implemenation.
>>
>> 1. page_cgroup diet
>> memory cgroup uses 'struct page_cgroup', it was 40bytes per 4096bytes
>> in past.
>> Johannes removed ->page and ->lru from page_cgroup, then now,
>> sizeof(page_cgroup)==16. Now, I'm working on removing ->flags to make
>> sizeof(page_cgroup)==8.
>>
>> Then, finally, page_cgroup can be moved into struct page on 64bit
>> system ?
>> How 32bit system will be ?
>>
>> 2. memory reclaim
>> Johannes, Michal and Ying, ant others, are now working on memory
>> reclaim problem
>> with new LRU. Under it, LRU is per-memcg-per-zone.
>> Following topics are discussed now.
>>
>> - simplificaiton/re-implemenation of softlimit
>> - isolation of workload (by softlimit)
>> - when we should stop memory reclaim, especially under direct-reclaim.
>> (Now, we scan all zonelist..)
>>
>> 3. per-memcg-lru-zone-lru-lock
>> I hear Hugh Dickins have some patches and are testing it.
>> It will be good to discuss this if it has Pros. and Cons or
>> implemenation issue.
>>
>> 4. dirty ratio
>> In the last year, patches were posted but not merged. I'd like to hear
>> works on this area.
>>
>> 5. accounting other than user pages.
>> Last year, tcp buffer limiting was added to "memcg".
> I was about to correct you about "last year", when suddenly my mind went
> "oh god, this is 2012!"
>
>> If someone has other plans, I'd like to hear.
>> I myself don't think 'generic kernel memory limitation' is a good
>> thing....
>> admins can't predict performance.
>>
>> Can we make accounting on dentry/inode into memcg and call
>> shrink_slab() ?
>> But I guess per-zone-shrink-slab() should go 1st...
>
> Well, I have work in progress to continue that. There are a couple of
> slabs I'd like to track. I am convinced that a generic framework is a
> good thing, but indeed, I am still not sure if a generic interface is.
>
> The advantage of keeping it unified, is that it prevents the number of
> knobs from exploding. For us, this is not that much of a problem,
> because there are only a couple of ones we are interested in. dcache and
> inode is an example of that: when we sent out some proposals (that
> didn't use memcg), some people wanted to see inode, not dcache being
> tracked. We disagreed. But yet, the truth remains that only *one* of
> them needs to be tracked, because they live in a close relation to each
> other. So if we manage to find a couple of slabs that are key to that,
> we can limit only those.
>
> Well, that was food for thought only. I do think this is a nice topic.
>
> Also, there is no serious implementation for that, as you mentioned, but
> a series of patches were sent out for appreciation last year. So there
> is at least a basis for starting
>
Forgot to add [ATTEND] to the subject. I'd like to attend to discuss that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-01  0:55 [LSF/MM TOPIC] memcg topics KAMEZAWA Hiroyuki
  2012-02-01  8:58 ` Glauber Costa
@ 2012-02-01 20:24 ` Greg Thelen
  2012-02-02  6:33   ` Wu Fengguang
  1 sibling, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-01 20:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: lsf-pc, linux-mm, hannes@cmpxchg.org, Michal Hocko,
	bsingharora@gmail.com, Hugh Dickins, Ying Han, Mel Gorman,
	Wu Fengguang

On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 4. dirty ratio
>   In the last year, patches were posted but not merged. I'd like to hear
>   works on this area.

I would like to attend to discuss this topic.  I have not had much time to work
on this recently, but should be able to focus more on this soon.  The
IO less writeback changes require some redesign and may allow for a
simpler implementation of mem_cgroup_balance_dirty_pages().
Maintaining a per container dirty page counts, ratios, and limits is
fairly easy, but integration with writeback is the challenge.  My big
questions are for writeback people:
1. how to compute per-container pause based on bdi bandwidth, cgroup
dirty page usage.
2. how to ensure that writeback will engage even if system and bdi are
below respective background dirty ratios, yet a memcg is above its bg
dirty limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-01 20:24 ` [LSF/MM TOPIC] " Greg Thelen
@ 2012-02-02  6:33   ` Wu Fengguang
  2012-02-02  7:34     ` Greg Thelen
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Wu Fengguang @ 2012-02-02  6:33 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, lsf-pc, linux-mm, hannes@cmpxchg.org,
	Michal Hocko, bsingharora@gmail.com, Hugh Dickins, Ying Han,
	Mel Gorman

Hi Greg,

On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 4. dirty ratio
> > A  In the last year, patches were posted but not merged. I'd like to hear
> > A  works on this area.
> 
> I would like to attend to discuss this topic.  I have not had much time to work
> on this recently, but should be able to focus more on this soon.  The
> IO less writeback changes require some redesign and may allow for a
> simpler implementation of mem_cgroup_balance_dirty_pages().
> Maintaining a per container dirty page counts, ratios, and limits is
> fairly easy, but integration with writeback is the challenge.  My big
> questions are for writeback people:
> 1. how to compute per-container pause based on bdi bandwidth, cgroup
> dirty page usage.
> 2. how to ensure that writeback will engage even if system and bdi are
> below respective background dirty ratios, yet a memcg is above its bg
> dirty limit.

The solution to (1,2) would be something like this:

--- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
+++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
@@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
 	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
 	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
 
+	if (memcg) {
+		long long f;
+		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
+			    memcg_limit - memcg_setpoint + 1);
+		f = x;
+		f = f * x >> RATELIMIT_CALC_SHIFT;
+		f = f * x >> RATELIMIT_CALC_SHIFT;
+		f += 1 << RATELIMIT_CALC_SHIFT;
+		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
+	}
+
 	/*
 	 * We have computed basic pos_ratio above based on global situation. If
 	 * the bdi is over/under its share of dirty pages, we want to scale
@@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
 		if (nr_dirty <= freerun) {
+			if (memcg && memcg_dirty > memcg_freerun)
+				goto start_writeback;
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
@@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
 			break;
 		}
 
+start_writeback:
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
 

That makes the minimal change to enforce per-memcg dirty ratio.
It could result in a less stable control system, but should still
be able to balance things out.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-02  6:33   ` Wu Fengguang
@ 2012-02-02  7:34     ` Greg Thelen
  2012-02-02  7:54       ` Wu Fengguang
  2012-02-02  7:52     ` Wu Fengguang
  2012-02-02 10:15     ` Jan Kara
  2 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-02  7:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, lsf-pc, linux-mm, hannes@cmpxchg.org,
	Michal Hocko, bsingharora@gmail.com, Hugh Dickins, Ying Han,
	Mel Gorman

On Wed, Feb 1, 2012 at 10:33 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Hi Greg,
>
> On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
>> 1. how to compute per-container pause based on bdi bandwidth, cgroup
>> dirty page usage.
>> 2. how to ensure that writeback will engage even if system and bdi are
>> below respective background dirty ratios, yet a memcg is above its bg
>> dirty limit.
>
> The solution to (1,2) would be something like this:
>
> --- linux-next.orig/mm/page-writeback.c 2012-02-02 14:13:45.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2012-02-02 14:24:11.000000000 +0800
> @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
>        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>
> +       if (memcg) {
> +               long long f;
> +               x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> +                           memcg_limit - memcg_setpoint + 1);
> +               f = x;
> +               f = f * x >> RATELIMIT_CALC_SHIFT;
> +               f = f * x >> RATELIMIT_CALC_SHIFT;
> +               f += 1 << RATELIMIT_CALC_SHIFT;
> +               pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> +       }
> +
>        /*
>         * We have computed basic pos_ratio above based on global situation. If
>         * the bdi is over/under its share of dirty pages, we want to scale
> @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
>                freerun = dirty_freerun_ceiling(dirty_thresh,
>                                                background_thresh);
>                if (nr_dirty <= freerun) {
> +                       if (memcg && memcg_dirty > memcg_freerun)
> +                               goto start_writeback;
>                        current->dirty_paused_when = now;
>                        current->nr_dirtied = 0;
>                        current->nr_dirtied_pause =
> @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
>                        break;
>                }
>
> +start_writeback:
>                if (unlikely(!writeback_in_progress(bdi)))
>                        bdi_start_background_writeback(bdi);
>
>
> That makes the minimal change to enforce per-memcg dirty ratio.
> It could result in a less stable control system, but should still
> be able to balance things out.
>
> Thanks,
> Fengguang

Thank you for the quick patch.  It looks promising.  I can imagine how
this would wake up background writeback.  But I am unsure how
background writeback will do anything.  It seems like
over_bground_thresh() would not necessarily see system or bdi dirty
usage over respective limits.  In previously posted memcg writeback
patches this involved an fs-writeback.c call to
mem_cgroups_over_bground_dirty_thresh() to check for memcg dirty limit
compliance.  Do you think we still need such a call out to memcg from
writeback?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-02  7:34     ` Greg Thelen
@ 2012-02-02  7:54       ` Wu Fengguang
  0 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2012-02-02  7:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, lsf-pc, linux-mm, hannes@cmpxchg.org,
	Michal Hocko, bsingharora@gmail.com, Hugh Dickins, Ying Han,
	Mel Gorman

On Wed, Feb 01, 2012 at 11:34:36PM -0800, Greg Thelen wrote:
> On Wed, Feb 1, 2012 at 10:33 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Hi Greg,
> >
> > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> >> 1. how to compute per-container pause based on bdi bandwidth, cgroup
> >> dirty page usage.
> >> 2. how to ensure that writeback will engage even if system and bdi are
> >> below respective background dirty ratios, yet a memcg is above its bg
> >> dirty limit.
> >
> > The solution to (1,2) would be something like this:
> >
> > --- linux-next.orig/mm/page-writeback.c 2012-02-02 14:13:45.000000000 +0800
> > +++ linux-next/mm/page-writeback.c A  A  A 2012-02-02 14:24:11.000000000 +0800
> > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> > A  A  A  A pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > A  A  A  A pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> >
> > + A  A  A  if (memcg) {
> > + A  A  A  A  A  A  A  long long f;
> > + A  A  A  A  A  A  A  x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  memcg_limit - memcg_setpoint + 1);
> > + A  A  A  A  A  A  A  f = x;
> > + A  A  A  A  A  A  A  f = f * x >> RATELIMIT_CALC_SHIFT;
> > + A  A  A  A  A  A  A  f = f * x >> RATELIMIT_CALC_SHIFT;
> > + A  A  A  A  A  A  A  f += 1 << RATELIMIT_CALC_SHIFT;
> > + A  A  A  A  A  A  A  pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > + A  A  A  }
> > +
> > A  A  A  A /*
> > A  A  A  A  * We have computed basic pos_ratio above based on global situation. If
> > A  A  A  A  * the bdi is over/under its share of dirty pages, we want to scale
> > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> > A  A  A  A  A  A  A  A freerun = dirty_freerun_ceiling(dirty_thresh,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A background_thresh);
> > A  A  A  A  A  A  A  A if (nr_dirty <= freerun) {
> > + A  A  A  A  A  A  A  A  A  A  A  if (memcg && memcg_dirty > memcg_freerun)
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  goto start_writeback;
> > A  A  A  A  A  A  A  A  A  A  A  A current->dirty_paused_when = now;
> > A  A  A  A  A  A  A  A  A  A  A  A current->nr_dirtied = 0;
> > A  A  A  A  A  A  A  A  A  A  A  A current->nr_dirtied_pause =
> > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> > A  A  A  A  A  A  A  A  A  A  A  A break;
> > A  A  A  A  A  A  A  A }
> >
> > +start_writeback:
> > A  A  A  A  A  A  A  A if (unlikely(!writeback_in_progress(bdi)))
> > A  A  A  A  A  A  A  A  A  A  A  A bdi_start_background_writeback(bdi);
> >
> >
> > That makes the minimal change to enforce per-memcg dirty ratio.
> > It could result in a less stable control system, but should still
> > be able to balance things out.
> >
> > Thanks,
> > Fengguang
> 
> Thank you for the quick patch.  It looks promising.  I can imagine how
> this would wake up background writeback.  But I am unsure how
> background writeback will do anything.  It seems like
> over_bground_thresh() would not necessarily see system or bdi dirty
> usage over respective limits.  In previously posted memcg writeback
> patches this involved an fs-writeback.c call to
> mem_cgroups_over_bground_dirty_thresh() to check for memcg dirty limit
> compliance.  Do you think we still need such a call out to memcg from
> writeback?

Yeah I forgot over_bground_thresh().. Obviously it needs to be memcg
aware, too.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM TOPIC] memcg topics.
  2012-02-02  6:33   ` Wu Fengguang
  2012-02-02  7:34     ` Greg Thelen
@ 2012-02-02  7:52     ` Wu Fengguang
  2012-02-02 10:39       ` [Lsf-pc] " Jan Kara
  2012-02-02 10:15     ` Jan Kara
  2 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2012-02-02  7:52 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, lsf-pc, linux-mm, hannes@cmpxchg.org,
	Michal Hocko, bsingharora@gmail.com, Hugh Dickins, Ying Han,
	Mel Gorman

On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> Hi Greg,
> 
> On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 4. dirty ratio
> > > A  In the last year, patches were posted but not merged. I'd like to hear
> > > A  works on this area.
> > 
> > I would like to attend to discuss this topic.  I have not had much time to work
> > on this recently, but should be able to focus more on this soon.  The
> > IO less writeback changes require some redesign and may allow for a
> > simpler implementation of mem_cgroup_balance_dirty_pages().
> > Maintaining a per container dirty page counts, ratios, and limits is
> > fairly easy, but integration with writeback is the challenge.  My big
> > questions are for writeback people:
> > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > dirty page usage.
> > 2. how to ensure that writeback will engage even if system and bdi are
> > below respective background dirty ratios, yet a memcg is above its bg
> > dirty limit.
> 
> The solution to (1,2) would be something like this:
> 
> --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>  
> +	if (memcg) {
> +		long long f;
> +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> +			    memcg_limit - memcg_setpoint + 1);
> +		f = x;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f += 1 << RATELIMIT_CALC_SHIFT;
> +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> +	}
> +
>  	/*
>  	 * We have computed basic pos_ratio above based on global situation. If
>  	 * the bdi is over/under its share of dirty pages, we want to scale
> @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
>  		freerun = dirty_freerun_ceiling(dirty_thresh,
>  						background_thresh);
>  		if (nr_dirty <= freerun) {
> +			if (memcg && memcg_dirty > memcg_freerun)
> +				goto start_writeback;
>  			current->dirty_paused_when = now;
>  			current->nr_dirtied = 0;
>  			current->nr_dirtied_pause =
> @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
>  			break;
>  		}
>  
> +start_writeback:
>  		if (unlikely(!writeback_in_progress(bdi)))
>  			bdi_start_background_writeback(bdi);
>  
> 
> That makes the minimal change to enforce per-memcg dirty ratio.
> It could result in a less stable control system, but should still
> be able to balance things out.

Unfortunately the memcg partitioning could fundamentally make the
dirty throttling more bumpy.

Imagine 10 memcgs each with

- memcg_dirty_limit=50MB
- 1 dd dirty task

The flusher thread will be working on 10 inodes in turn, each time
grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
pages to the disk. So each inode will be flushed on every ~5s.

Without memcg dirty ratio, the dd tasks will be throttled quite
smoothly.  However with memcg, each memcg will be limited to 50MB
dirty pages, and the dirty number will be dropping quickly from 50MB
to 0 on every 5 seconds.

As a result, the small partitions of dirty pages will transmit the
flusher's bumpy writeout (which is necessary for performance) to the
dd tasks' bumpy progress. The dd tasks will be blocked for seconds
from time to time.

So I cannot help thinking: can the problem be canceled in the root?
The basic scheme could be: when reclaiming from a memcg zone, if any
PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
move it to the global zone and de-account it from the memcg.

In this way, we can avoid dirty/writeback pages hurting the (possibly
small) memcg zones. The aggressive dirtier tasks will be throttled by
the global 20% limit and the memcg page reclaims can go on smoothly.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02  7:52     ` Wu Fengguang
@ 2012-02-02 10:39       ` Jan Kara
  2012-02-02 11:04         ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2012-02-02 10:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Greg Thelen, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu 02-02-12 15:52:34, Wu Fengguang wrote:
> On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> > Hi Greg,
> > 
> > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 4. dirty ratio
> > > >   In the last year, patches were posted but not merged. I'd like to hear
> > > >   works on this area.
> > > 
> > > I would like to attend to discuss this topic.  I have not had much time to work
> > > on this recently, but should be able to focus more on this soon.  The
> > > IO less writeback changes require some redesign and may allow for a
> > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > Maintaining a per container dirty page counts, ratios, and limits is
> > > fairly easy, but integration with writeback is the challenge.  My big
> > > questions are for writeback people:
> > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > dirty page usage.
> > > 2. how to ensure that writeback will engage even if system and bdi are
> > > below respective background dirty ratios, yet a memcg is above its bg
> > > dirty limit.
> > 
> > The solution to (1,2) would be something like this:
> > 
> > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> >  
> > +	if (memcg) {
> > +		long long f;
> > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > +			    memcg_limit - memcg_setpoint + 1);
> > +		f = x;
> > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > +	}
> > +
> >  	/*
> >  	 * We have computed basic pos_ratio above based on global situation. If
> >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> >  						background_thresh);
> >  		if (nr_dirty <= freerun) {
> > +			if (memcg && memcg_dirty > memcg_freerun)
> > +				goto start_writeback;
> >  			current->dirty_paused_when = now;
> >  			current->nr_dirtied = 0;
> >  			current->nr_dirtied_pause =
> > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> >  			break;
> >  		}
> >  
> > +start_writeback:
> >  		if (unlikely(!writeback_in_progress(bdi)))
> >  			bdi_start_background_writeback(bdi);
> >  
> > 
> > That makes the minimal change to enforce per-memcg dirty ratio.
> > It could result in a less stable control system, but should still
> > be able to balance things out.
> 
> Unfortunately the memcg partitioning could fundamentally make the
> dirty throttling more bumpy.
> 
> Imagine 10 memcgs each with
> 
> - memcg_dirty_limit=50MB
> - 1 dd dirty task
> 
> The flusher thread will be working on 10 inodes in turn, each time
> grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
> pages to the disk. So each inode will be flushed on every ~5s.
> 
> Without memcg dirty ratio, the dd tasks will be throttled quite
> smoothly.  However with memcg, each memcg will be limited to 50MB
> dirty pages, and the dirty number will be dropping quickly from 50MB
> to 0 on every 5 seconds.
>
> As a result, the small partitions of dirty pages will transmit the
> flusher's bumpy writeout (which is necessary for performance) to the
> dd tasks' bumpy progress. The dd tasks will be blocked for seconds
> from time to time.
> 
> So I cannot help thinking: can the problem be canceled in the root?
> The basic scheme could be: when reclaiming from a memcg zone, if any
> PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
> move it to the global zone and de-account it from the memcg.
> 
> In this way, we can avoid dirty/writeback pages hurting the (possibly
> small) memcg zones. The aggressive dirtier tasks will be throttled by
> the global 20% limit and the memcg page reclaims can go on smoothly.
  If I remember Google's usecase right, their ultimate goal is to partition
the machine so that processes in memcg A get say 1/4 of the available
disk bandwidth, processes in memcg B get 1/2 of the disk bandwidth.

Now you can do the bandwidth limitting in CFQ but it doesn't really work
for buffered writes because these are done by flusher thread ignoring any
memcg boundaries. So they introduce knowledge of memcgs into flusher thread
so that writeback done by flusher thread reflects the configured
proportions.

But then the result is that processes in memcg A will simply accumulate
more dirty pages because writeback is slower for them. So that's why you
want to stop dirtying processes in that memcg when they reach their
dirty_limit. All in all, I believe more bumpy writeback / lower throughput
(you can choose between these two) is unavoidable for this usecase. But
OTOH I'm not sure how big problem this will be in practice because machines
should be big enough so that even after partitioning you get a resonably
sized machine...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02 10:39       ` [Lsf-pc] " Jan Kara
@ 2012-02-02 11:04         ` Wu Fengguang
  2012-02-02 15:42           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2012-02-02 11:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Greg Thelen, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu, Feb 02, 2012 at 11:39:53AM +0100, Jan Kara wrote:
> On Thu 02-02-12 15:52:34, Wu Fengguang wrote:
> > On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> > > Hi Greg,
> > > 
> > > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 4. dirty ratio
> > > > > A  In the last year, patches were posted but not merged. I'd like to hear
> > > > > A  works on this area.
> > > > 
> > > > I would like to attend to discuss this topic.  I have not had much time to work
> > > > on this recently, but should be able to focus more on this soon.  The
> > > > IO less writeback changes require some redesign and may allow for a
> > > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > > Maintaining a per container dirty page counts, ratios, and limits is
> > > > fairly easy, but integration with writeback is the challenge.  My big
> > > > questions are for writeback people:
> > > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > > dirty page usage.
> > > > 2. how to ensure that writeback will engage even if system and bdi are
> > > > below respective background dirty ratios, yet a memcg is above its bg
> > > > dirty limit.
> > > 
> > > The solution to (1,2) would be something like this:
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> > >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > >  
> > > +	if (memcg) {
> > > +		long long f;
> > > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > > +			    memcg_limit - memcg_setpoint + 1);
> > > +		f = x;
> > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > > +	}
> > > +
> > >  	/*
> > >  	 * We have computed basic pos_ratio above based on global situation. If
> > >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> > >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> > >  						background_thresh);
> > >  		if (nr_dirty <= freerun) {
> > > +			if (memcg && memcg_dirty > memcg_freerun)
> > > +				goto start_writeback;
> > >  			current->dirty_paused_when = now;
> > >  			current->nr_dirtied = 0;
> > >  			current->nr_dirtied_pause =
> > > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> > >  			break;
> > >  		}
> > >  
> > > +start_writeback:
> > >  		if (unlikely(!writeback_in_progress(bdi)))
> > >  			bdi_start_background_writeback(bdi);
> > >  
> > > 
> > > That makes the minimal change to enforce per-memcg dirty ratio.
> > > It could result in a less stable control system, but should still
> > > be able to balance things out.
> > 
> > Unfortunately the memcg partitioning could fundamentally make the
> > dirty throttling more bumpy.
> > 
> > Imagine 10 memcgs each with
> > 
> > - memcg_dirty_limit=50MB
> > - 1 dd dirty task
> > 
> > The flusher thread will be working on 10 inodes in turn, each time
> > grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
> > pages to the disk. So each inode will be flushed on every ~5s.
> > 
> > Without memcg dirty ratio, the dd tasks will be throttled quite
> > smoothly.  However with memcg, each memcg will be limited to 50MB
> > dirty pages, and the dirty number will be dropping quickly from 50MB
> > to 0 on every 5 seconds.
> >
> > As a result, the small partitions of dirty pages will transmit the
> > flusher's bumpy writeout (which is necessary for performance) to the
> > dd tasks' bumpy progress. The dd tasks will be blocked for seconds
> > from time to time.
> > 
> > So I cannot help thinking: can the problem be canceled in the root?
> > The basic scheme could be: when reclaiming from a memcg zone, if any
> > PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
> > move it to the global zone and de-account it from the memcg.
> > 
> > In this way, we can avoid dirty/writeback pages hurting the (possibly
> > small) memcg zones. The aggressive dirtier tasks will be throttled by
> > the global 20% limit and the memcg page reclaims can go on smoothly.
>   If I remember Google's usecase right, their ultimate goal is to partition
> the machine so that processes in memcg A get say 1/4 of the available
> disk bandwidth, processes in memcg B get 1/2 of the disk bandwidth.
> 
> Now you can do the bandwidth limitting in CFQ but it doesn't really work
> for buffered writes because these are done by flusher thread ignoring any
> memcg boundaries. So they introduce knowledge of memcgs into flusher thread
> so that writeback done by flusher thread reflects the configured
> proportions.

Actually the dirty rate can be controlled independent from the dirty pages:

blk-cgroup: async write IO controller 
https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d

> But then the result is that processes in memcg A will simply accumulate
> more dirty pages because writeback is slower for them. So that's why you
> want to stop dirtying processes in that memcg when they reach their

The bandwidth control alone will be pretty smooth, not suffering from
the partition problem. And it don't need to alter the flusher behavior
(like make it focusing on some inodes) and hence won't impact performance.

If memcg A's dirty rate is throttled, its dirty pages will naturally
shrink. The flusher will automatically work less on A's dirty pages.

> dirty_limit. All in all, I believe more bumpy writeback / lower throughput
> (you can choose between these two) is unavoidable for this usecase. But
> OTOH I'm not sure how big problem this will be in practice because machines
> should be big enough so that even after partitioning you get a resonably
> sized machine...

The end user may expect big machines to handle 100 or even 1000 memcg,
so if each memcg corresponds to 1 dd, 1 dirty inode and 50MB dirty limit,
each inode will be waiting for 50 or 500 seconds to be flushed once.
The stall time will then go up to dozens/hundreds of seconds...  The
partition scheme simply won't scale...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02 11:04         ` Wu Fengguang
@ 2012-02-02 15:42           ` Jan Kara
  2012-02-03  1:26             ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2012-02-02 15:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Greg Thelen, bsingharora@gmail.com, Hugh Dickins,
	Michal Hocko, linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org,
	lsf-pc, KAMEZAWA Hiroyuki

On Thu 02-02-12 19:04:34, Wu Fengguang wrote:
> On Thu, Feb 02, 2012 at 11:39:53AM +0100, Jan Kara wrote:
> > On Thu 02-02-12 15:52:34, Wu Fengguang wrote:
> > > On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> > > > Hi Greg,
> > > > 
> > > > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > > 4. dirty ratio
> > > > > >   In the last year, patches were posted but not merged. I'd like to hear
> > > > > >   works on this area.
> > > > > 
> > > > > I would like to attend to discuss this topic.  I have not had much time to work
> > > > > on this recently, but should be able to focus more on this soon.  The
> > > > > IO less writeback changes require some redesign and may allow for a
> > > > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > > > Maintaining a per container dirty page counts, ratios, and limits is
> > > > > fairly easy, but integration with writeback is the challenge.  My big
> > > > > questions are for writeback people:
> > > > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > > > dirty page usage.
> > > > > 2. how to ensure that writeback will engage even if system and bdi are
> > > > > below respective background dirty ratios, yet a memcg is above its bg
> > > > > dirty limit.
> > > > 
> > > > The solution to (1,2) would be something like this:
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > > > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> > > >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > >  
> > > > +	if (memcg) {
> > > > +		long long f;
> > > > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > > > +			    memcg_limit - memcg_setpoint + 1);
> > > > +		f = x;
> > > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > > > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 * We have computed basic pos_ratio above based on global situation. If
> > > >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > > > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> > > >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> > > >  						background_thresh);
> > > >  		if (nr_dirty <= freerun) {
> > > > +			if (memcg && memcg_dirty > memcg_freerun)
> > > > +				goto start_writeback;
> > > >  			current->dirty_paused_when = now;
> > > >  			current->nr_dirtied = 0;
> > > >  			current->nr_dirtied_pause =
> > > > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> > > >  			break;
> > > >  		}
> > > >  
> > > > +start_writeback:
> > > >  		if (unlikely(!writeback_in_progress(bdi)))
> > > >  			bdi_start_background_writeback(bdi);
> > > >  
> > > > 
> > > > That makes the minimal change to enforce per-memcg dirty ratio.
> > > > It could result in a less stable control system, but should still
> > > > be able to balance things out.
> > > 
> > > Unfortunately the memcg partitioning could fundamentally make the
> > > dirty throttling more bumpy.
> > > 
> > > Imagine 10 memcgs each with
> > > 
> > > - memcg_dirty_limit=50MB
> > > - 1 dd dirty task
> > > 
> > > The flusher thread will be working on 10 inodes in turn, each time
> > > grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
> > > pages to the disk. So each inode will be flushed on every ~5s.
> > > 
> > > Without memcg dirty ratio, the dd tasks will be throttled quite
> > > smoothly.  However with memcg, each memcg will be limited to 50MB
> > > dirty pages, and the dirty number will be dropping quickly from 50MB
> > > to 0 on every 5 seconds.
> > >
> > > As a result, the small partitions of dirty pages will transmit the
> > > flusher's bumpy writeout (which is necessary for performance) to the
> > > dd tasks' bumpy progress. The dd tasks will be blocked for seconds
> > > from time to time.
> > > 
> > > So I cannot help thinking: can the problem be canceled in the root?
> > > The basic scheme could be: when reclaiming from a memcg zone, if any
> > > PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
> > > move it to the global zone and de-account it from the memcg.
> > > 
> > > In this way, we can avoid dirty/writeback pages hurting the (possibly
> > > small) memcg zones. The aggressive dirtier tasks will be throttled by
> > > the global 20% limit and the memcg page reclaims can go on smoothly.
> >   If I remember Google's usecase right, their ultimate goal is to partition
> > the machine so that processes in memcg A get say 1/4 of the available
> > disk bandwidth, processes in memcg B get 1/2 of the disk bandwidth.
> > 
> > Now you can do the bandwidth limitting in CFQ but it doesn't really work
> > for buffered writes because these are done by flusher thread ignoring any
> > memcg boundaries. So they introduce knowledge of memcgs into flusher thread
> > so that writeback done by flusher thread reflects the configured
> > proportions.
> 
> Actually the dirty rate can be controlled independent from the dirty pages:
> 
> blk-cgroup: async write IO controller 
> https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
> 
> > But then the result is that processes in memcg A will simply accumulate
> > more dirty pages because writeback is slower for them. So that's why you
> > want to stop dirtying processes in that memcg when they reach their
> 
> The bandwidth control alone will be pretty smooth, not suffering from
> the partition problem. And it don't need to alter the flusher behavior
> (like make it focusing on some inodes) and hence won't impact performance.
> 
> If memcg A's dirty rate is throttled, its dirty pages will naturally
> shrink. The flusher will automatically work less on A's dirty pages.
  I'm not sure about details of requirements Google guys have. So this may
or may not be good enough for them. I'd suspect they still wouldn't want
one cgroup to fill up available page cache with dirty pages so just
limitting bandwidth won't be enough for them. Also limitting dirty
bandwidth has a problem that it's not coupled with how much reading the
particular cgroup does. Anyway, until we are sure about their exact
requirements, this is mostly philosophical talking ;).

> > dirty_limit. All in all, I believe more bumpy writeback / lower throughput
> > (you can choose between these two) is unavoidable for this usecase. But
> > OTOH I'm not sure how big problem this will be in practice because machines
> > should be big enough so that even after partitioning you get a resonably
> > sized machine...
> 
> The end user may expect big machines to handle 100 or even 1000 memcg,
> so if each memcg corresponds to 1 dd, 1 dirty inode and 50MB dirty limit,
> each inode will be waiting for 50 or 500 seconds to be flushed once.
> The stall time will then go up to dozens/hundreds of seconds...  The
> partition scheme simply won't scale...
  There is always a reasonable load / reasonable partitioning for the
machine and if you partition the machine too much, results will be
suboptimal. It's like if you run 100 / 1000 KVM guests on the server...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02 15:42           ` Jan Kara
@ 2012-02-03  1:26             ` Wu Fengguang
  2012-02-03  6:21               ` Greg Thelen
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2012-02-03  1:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Greg Thelen, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu, Feb 02, 2012 at 04:42:09PM +0100, Jan Kara wrote:
> On Thu 02-02-12 19:04:34, Wu Fengguang wrote:
> > On Thu, Feb 02, 2012 at 11:39:53AM +0100, Jan Kara wrote:
> > > On Thu 02-02-12 15:52:34, Wu Fengguang wrote:
> > > > On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> > > > > Hi Greg,
> > > > > 
> > > > > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > > > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > > > 4. dirty ratio
> > > > > > > A  In the last year, patches were posted but not merged. I'd like to hear
> > > > > > > A  works on this area.
> > > > > > 
> > > > > > I would like to attend to discuss this topic.  I have not had much time to work
> > > > > > on this recently, but should be able to focus more on this soon.  The
> > > > > > IO less writeback changes require some redesign and may allow for a
> > > > > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > > > > Maintaining a per container dirty page counts, ratios, and limits is
> > > > > > fairly easy, but integration with writeback is the challenge.  My big
> > > > > > questions are for writeback people:
> > > > > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > > > > dirty page usage.
> > > > > > 2. how to ensure that writeback will engage even if system and bdi are
> > > > > > below respective background dirty ratios, yet a memcg is above its bg
> > > > > > dirty limit.
> > > > > 
> > > > > The solution to (1,2) would be something like this:
> > > > > 
> > > > > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > > > > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > > > > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> > > > >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > >  
> > > > > +	if (memcg) {
> > > > > +		long long f;
> > > > > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +			    memcg_limit - memcg_setpoint + 1);
> > > > > +		f = x;
> > > > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > > > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > > > > +	}
> > > > > +
> > > > >  	/*
> > > > >  	 * We have computed basic pos_ratio above based on global situation. If
> > > > >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > > > > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> > > > >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> > > > >  						background_thresh);
> > > > >  		if (nr_dirty <= freerun) {
> > > > > +			if (memcg && memcg_dirty > memcg_freerun)
> > > > > +				goto start_writeback;
> > > > >  			current->dirty_paused_when = now;
> > > > >  			current->nr_dirtied = 0;
> > > > >  			current->nr_dirtied_pause =
> > > > > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> > > > >  			break;
> > > > >  		}
> > > > >  
> > > > > +start_writeback:
> > > > >  		if (unlikely(!writeback_in_progress(bdi)))
> > > > >  			bdi_start_background_writeback(bdi);
> > > > >  
> > > > > 
> > > > > That makes the minimal change to enforce per-memcg dirty ratio.
> > > > > It could result in a less stable control system, but should still
> > > > > be able to balance things out.
> > > > 
> > > > Unfortunately the memcg partitioning could fundamentally make the
> > > > dirty throttling more bumpy.
> > > > 
> > > > Imagine 10 memcgs each with
> > > > 
> > > > - memcg_dirty_limit=50MB
> > > > - 1 dd dirty task
> > > > 
> > > > The flusher thread will be working on 10 inodes in turn, each time
> > > > grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
> > > > pages to the disk. So each inode will be flushed on every ~5s.
> > > > 
> > > > Without memcg dirty ratio, the dd tasks will be throttled quite
> > > > smoothly.  However with memcg, each memcg will be limited to 50MB
> > > > dirty pages, and the dirty number will be dropping quickly from 50MB
> > > > to 0 on every 5 seconds.
> > > >
> > > > As a result, the small partitions of dirty pages will transmit the
> > > > flusher's bumpy writeout (which is necessary for performance) to the
> > > > dd tasks' bumpy progress. The dd tasks will be blocked for seconds
> > > > from time to time.
> > > > 
> > > > So I cannot help thinking: can the problem be canceled in the root?
> > > > The basic scheme could be: when reclaiming from a memcg zone, if any
> > > > PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
> > > > move it to the global zone and de-account it from the memcg.
> > > > 
> > > > In this way, we can avoid dirty/writeback pages hurting the (possibly
> > > > small) memcg zones. The aggressive dirtier tasks will be throttled by
> > > > the global 20% limit and the memcg page reclaims can go on smoothly.
> > >   If I remember Google's usecase right, their ultimate goal is to partition
> > > the machine so that processes in memcg A get say 1/4 of the available
> > > disk bandwidth, processes in memcg B get 1/2 of the disk bandwidth.
> > > 
> > > Now you can do the bandwidth limitting in CFQ but it doesn't really work
> > > for buffered writes because these are done by flusher thread ignoring any
> > > memcg boundaries. So they introduce knowledge of memcgs into flusher thread
> > > so that writeback done by flusher thread reflects the configured
> > > proportions.
> > 
> > Actually the dirty rate can be controlled independent from the dirty pages:
> > 
> > blk-cgroup: async write IO controller 
> > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
> > 
> > > But then the result is that processes in memcg A will simply accumulate
> > > more dirty pages because writeback is slower for them. So that's why you
> > > want to stop dirtying processes in that memcg when they reach their
> > 
> > The bandwidth control alone will be pretty smooth, not suffering from
> > the partition problem. And it don't need to alter the flusher behavior
> > (like make it focusing on some inodes) and hence won't impact performance.
> > 
> > If memcg A's dirty rate is throttled, its dirty pages will naturally
> > shrink. The flusher will automatically work less on A's dirty pages.
>   I'm not sure about details of requirements Google guys have. So this may
> or may not be good enough for them. I'd suspect they still wouldn't want
> one cgroup to fill up available page cache with dirty pages so just
> limitting bandwidth won't be enough for them. Also limitting dirty
> bandwidth has a problem that it's not coupled with how much reading the
> particular cgroup does. Anyway, until we are sure about their exact
> requirements, this is mostly philosophical talking ;).

Yeah, I'm not sure what exactly Google needs and how big problem the
partition will be for them. Basically,

- when there are N memcg each dirtying 1 file, each file will be
  flushed on every (N * 0.5) seconds, where 0.5s is the typical time

- if (memcg_dirty_limit > 10 * bdi_bandwidth), the dd tasks should be
  able to progress reasonably smoothly

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-03  1:26             ` Wu Fengguang
@ 2012-02-03  6:21               ` Greg Thelen
  2012-02-03  9:40                 ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-03  6:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu, Feb 2, 2012 at 5:26 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Thu, Feb 02, 2012 at 04:42:09PM +0100, Jan Kara wrote:
>> On Thu 02-02-12 19:04:34, Wu Fengguang wrote:
>> > If memcg A's dirty rate is throttled, its dirty pages will naturally
>> > shrink. The flusher will automatically work less on A's dirty pages.
>> I'm not sure about details of requirements Google guys have. So this may
>> or may not be good enough for them. I'd suspect they still wouldn't want
>> one cgroup to fill up available page cache with dirty pages so just
>> limitting bandwidth won't be enough for them. Also limitting dirty
>> bandwidth has a problem that it's not coupled with how much reading the
>> particular cgroup does. Anyway, until we are sure about their exact
>> requirements, this is mostly philosophical talking ;).
>
> Yeah, I'm not sure what exactly Google needs and how big problem the
> partition will be for them. Basically,
>
> - when there are N memcg each dirtying 1 file, each file will be
>  flushed on every (N * 0.5) seconds, where 0.5s is the typical time
>
> - if (memcg_dirty_limit > 10 * bdi_bandwidth), the dd tasks should be
>  able to progress reasonably smoothly
>
> Thanks,
> Fengguang

I am looking for a solution that partitions memory and ideally disk
bandwidth.  This is a large undertaking and I am willing to start
small and grow into a more sophisticated solution (if needed).  One
important goal is to enforce per-container memory limits - this
includes dirty and clean page cache.  Moving memcg dirty pages to root
is probably not going to work because it would not allow for control
of job memory usage.  My hunch is that we will thus need per-memcg
dirty counters, limits, and some writeback changes.  Perhaps the
initial writeback changes would be small: enough to ensure that
writeback continues writing until it services any over-limit cgroups.
This is complicated by the fact that a memcg can have dirty memory
spread on different bdi.  If blk bandwidth throttling is sufficient
here, then let me know because it sounds easier ;)

Here is an example of a memcg OOM seen on a 3.3 kernel:
        # mkdir /dev/cgroup/memory/x
        # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
        # echo $$ > /dev/cgroup/memory/x/tasks
        # dd if=/dev/zero of=/data/f1 bs=1k count=1M &
        # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
        # wait
        [1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
        [2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k

This is caused from direct reclaim not being able to reliably reclaim
(write) dirty page cache pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-03  6:21               ` Greg Thelen
@ 2012-02-03  9:40                 ` Wu Fengguang
  0 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2012-02-03  9:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Jan Kara, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

Greg,

On Thu, Feb 02, 2012 at 10:21:53PM -0800, Greg Thelen wrote:
> I am looking for a solution that partitions memory and ideally disk
> bandwidth.  This is a large undertaking and I am willing to start
> small and grow into a more sophisticated solution (if needed).  One
> important goal is to enforce per-container memory limits - this
> includes dirty and clean page cache.  Moving memcg dirty pages to root
> is probably not going to work because it would not allow for control
> of job memory usage.

If reserving 20% global memory for dirty/writeback pages from the
memcg allocations, it will do the trick: each job will use at most its
memcg limit, plus some share of the 20% dirty limit. Since the
moved pages are marked PG_reclaim and hence will be freed quickly
after become clean, it's guaranteed that the dirty pages moved out of
the memcgs won't outnumber the 20% global dirty limit at any time.

So it would be some kind of per-job memcg container plus a globally
shared 20% dirty pages container. The job pages won't further leak
and become uncontrollable.

But if this does not fit nicely into Google's usage model, I'm fine
with adding per-memcg dirty limits, bearing in mind that the per-memcg
dirty limits won't be able to work fluently if not large enough.  We
can do some experiments on that once get the minimal patch ready.

> My hunch is that we will thus need per-memcg
> dirty counters, limits, and some writeback changes.  Perhaps the
> initial writeback changes would be small: enough to ensure that
> writeback continues writing until it services any over-limit cgroups.

Yeah, that's a good plan.

> This is complicated by the fact that a memcg can have dirty memory
> spread on different bdi.

That sure sounds complicated. The other problem is the pos_ratio will
no longer be roughly equal to each other for all the tasks writing to
the same bdi, making the bdi dirty_ratelimit less stable. Again, we
can experiment how well the control system behaves.

> If blk bandwidth throttling is sufficient
> here, then let me know because it sounds easier ;)

I'd love to say so, however bandwidth throttling is obviously not the
right solution to the below example ;)

> Here is an example of a memcg OOM seen on a 3.3 kernel:
>         # mkdir /dev/cgroup/memory/x
>         # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
>         # echo $$ > /dev/cgroup/memory/x/tasks
>         # dd if=/dev/zero of=/data/f1 bs=1k count=1M &
>         # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
>         # wait
>         [1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
>         [2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
> 
> This is caused from direct reclaim not being able to reliably reclaim
> (write) dirty page cache pages.

If moving dirty pages out of the memcg to the 20% global dirty pages
pool on page reclaim, the above OOM can be avoided. It does change the
meaning of memory.limit_in_bytes in that the memcg tasks can now
actually consume more pages (up to the shared global 20% dirty limit).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02  6:33   ` Wu Fengguang
  2012-02-02  7:34     ` Greg Thelen
  2012-02-02  7:52     ` Wu Fengguang
@ 2012-02-02 10:15     ` Jan Kara
  2012-02-02 11:31       ` Wu Fengguang
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2012-02-02 10:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Greg Thelen, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu 02-02-12 14:33:45, Wu Fengguang wrote:
> Hi Greg,
> 
> On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 4. dirty ratio
> > >   In the last year, patches were posted but not merged. I'd like to hear
> > >   works on this area.
> > 
> > I would like to attend to discuss this topic.  I have not had much time to work
> > on this recently, but should be able to focus more on this soon.  The
> > IO less writeback changes require some redesign and may allow for a
> > simpler implementation of mem_cgroup_balance_dirty_pages().
> > Maintaining a per container dirty page counts, ratios, and limits is
> > fairly easy, but integration with writeback is the challenge.  My big
> > questions are for writeback people:
> > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > dirty page usage.
> > 2. how to ensure that writeback will engage even if system and bdi are
> > below respective background dirty ratios, yet a memcg is above its bg
> > dirty limit.
> 
> The solution to (1,2) would be something like this:
> 
> --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>  
> +	if (memcg) {
> +		long long f;
> +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> +			    memcg_limit - memcg_setpoint + 1);
> +		f = x;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f += 1 << RATELIMIT_CALC_SHIFT;
> +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> +	}
> +
  Hmm, so you multiply pos_ratio computed for global situation with
pos_ratio computed for memcg situation, right? Why? My natural choice would
be to just use memcg situation for computing pos_ratio since memcg is
supposed to have less memory & stricter limits than root cgroup (global)...

>  	/*
>  	 * We have computed basic pos_ratio above based on global situation. If
>  	 * the bdi is over/under its share of dirty pages, we want to scale
> @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
>  		freerun = dirty_freerun_ceiling(dirty_thresh,
>  						background_thresh);
>  		if (nr_dirty <= freerun) {
> +			if (memcg && memcg_dirty > memcg_freerun)
> +				goto start_writeback;
>  			current->dirty_paused_when = now;
>  			current->nr_dirtied = 0;
>  			current->nr_dirtied_pause =
> @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
>  			break;
>  		}
>  
> +start_writeback:
>  		if (unlikely(!writeback_in_progress(bdi)))
>  			bdi_start_background_writeback(bdi);
  I guess this should better be coupled with memcg-aware writeback which
was part of Greg's original patches if I remember right. That way we'd know
we are making progress on the pages of the right cgroup. But we can
certainly try this minimal change and see whether cgroups won't get starved
too much...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.
  2012-02-02 10:15     ` Jan Kara
@ 2012-02-02 11:31       ` Wu Fengguang
  0 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2012-02-02 11:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Greg Thelen, bsingharora@gmail.com, Hugh Dickins, Michal Hocko,
	linux-mm, Mel Gorman, Ying Han, hannes@cmpxchg.org, lsf-pc,
	KAMEZAWA Hiroyuki

On Thu, Feb 02, 2012 at 11:15:25AM +0100, Jan Kara wrote:
> On Thu 02-02-12 14:33:45, Wu Fengguang wrote:
> > Hi Greg,
> > 
> > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 4. dirty ratio
> > > > A  In the last year, patches were posted but not merged. I'd like to hear
> > > > A  works on this area.
> > > 
> > > I would like to attend to discuss this topic.  I have not had much time to work
> > > on this recently, but should be able to focus more on this soon.  The
> > > IO less writeback changes require some redesign and may allow for a
> > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > Maintaining a per container dirty page counts, ratios, and limits is
> > > fairly easy, but integration with writeback is the challenge.  My big
> > > questions are for writeback people:
> > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > dirty page usage.
> > > 2. how to ensure that writeback will engage even if system and bdi are
> > > below respective background dirty ratios, yet a memcg is above its bg
> > > dirty limit.
> > 
> > The solution to (1,2) would be something like this:
> > 
> > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> >  
> > +	if (memcg) {
> > +		long long f;
> > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > +			    memcg_limit - memcg_setpoint + 1);
> > +		f = x;
> > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > +	}
> > +
>   Hmm, so you multiply pos_ratio computed for global situation with
> pos_ratio computed for memcg situation, right? Why? My natural choice would
> be to just use memcg situation for computing pos_ratio since memcg is
> supposed to have less memory & stricter limits than root cgroup (global)...

Yeah I also started with considering a standalone memcg pos_ratio.
However the above form can free us from worrying about misconfigured
memcg dirty limit exceeding global dirty limit, or the more
uncontrollable scheme of the memcg dirty limit exceeding some bdi
threshold.

> >  	/*
> >  	 * We have computed basic pos_ratio above based on global situation. If
> >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> >  						background_thresh);
> >  		if (nr_dirty <= freerun) {
> > +			if (memcg && memcg_dirty > memcg_freerun)
> > +				goto start_writeback;
> >  			current->dirty_paused_when = now;
> >  			current->nr_dirtied = 0;
> >  			current->nr_dirtied_pause =
> > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> >  			break;
> >  		}
> >  
> > +start_writeback:
> >  		if (unlikely(!writeback_in_progress(bdi)))
> >  			bdi_start_background_writeback(bdi);
>   I guess this should better be coupled with memcg-aware writeback which
> was part of Greg's original patches if I remember right. That way we'd know
> we are making progress on the pages of the right cgroup. But we can
> certainly try this minimal change and see whether cgroups won't get starved
> too much...

Agreed. The complete solution would need more code from Greg to
teach the flusher to focus on the memcg inodes/pages.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-02-03  9:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-01  0:55 [LSF/MM TOPIC] memcg topics KAMEZAWA Hiroyuki
2012-02-01  8:58 ` Glauber Costa
2012-02-02 11:33   ` [LSF/MM TOPIC][ATTEND] " Glauber Costa
2012-02-01 20:24 ` [LSF/MM TOPIC] " Greg Thelen
2012-02-02  6:33   ` Wu Fengguang
2012-02-02  7:34     ` Greg Thelen
2012-02-02  7:54       ` Wu Fengguang
2012-02-02  7:52     ` Wu Fengguang
2012-02-02 10:39       ` [Lsf-pc] " Jan Kara
2012-02-02 11:04         ` Wu Fengguang
2012-02-02 15:42           ` Jan Kara
2012-02-03  1:26             ` Wu Fengguang
2012-02-03  6:21               ` Greg Thelen
2012-02-03  9:40                 ` Wu Fengguang
2012-02-02 10:15     ` Jan Kara
2012-02-02 11:31       ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).