[PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
@ 2012-07-30 22:32 Ying Han
  2012-07-31 15:59 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-07-30 22:32 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Hillf Danton, Hugh Dickins, KOSAKI Motohiro,
	Andrew Morton
  Cc: linux-mm

In memcg kernel, cgroup under its softlimit is not targeted under global
reclaim. It could be possible that all memcgs are under their softlimit for
a particular zone. If that is the case, the current implementation will
burn extra cpu cycles without making forward progress.

The idea is from LSF discussion where we detect it after the first round of
scanning and restart the reclaim by not looking at softlimit at all. This
allows us to make forward progress on shrink_zone().

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/vmscan.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59e633c..747d903 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1861,6 +1861,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		.priority = sc->priority,
 	};
 	struct mem_cgroup *memcg;
+	bool over_softlimit, ignore_softlimit = false;
+
+restart:
+	over_softlimit = false;
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
@@ -1879,10 +1883,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		 * we have to reclaim under softlimit instead of burning more
 		 * cpu cycles.
 		 */
-		if (!global_reclaim(sc) || sc->priority < DEF_PRIORITY - 2 ||
-				mem_cgroup_over_soft_limit(memcg))
+		if (ignore_softlimit || !global_reclaim(sc) ||
+				sc->priority < DEF_PRIORITY - 2 ||
+				mem_cgroup_over_soft_limit(memcg)) {
 			shrink_lruvec(lruvec, sc);
 
+			over_softlimit = true;
+		}
+
 		/*
 		 * Limit reclaim has historically picked one memcg and
 		 * scanned it with decreasing priority levels until
@@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		}
 		memcg = mem_cgroup_iter(root, memcg, &reclaim);
 	} while (memcg);
+
+	if (!over_softlimit) {
+		ignore_softlimit = true;
+		goto restart;
+	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.7.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-30 22:32 [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim Ying Han
@ 2012-07-31 15:59 ` Michal Hocko
  2012-07-31 16:07   ` Rik van Riel
  2012-07-31 17:54   ` Ying Han
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2012-07-31 15:59 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Mon 30-07-12 15:32:18, Ying Han wrote:
> In memcg kernel, cgroup under its softlimit is not targeted under global
> reclaim. It could be possible that all memcgs are under their softlimit for
> a particular zone. 

This is a bit misleading because there is no softlimit per zone...

> If that is the case, the current implementation will burn extra cpu
> cycles without making forward progress.

This scales with the number of groups which is bareable I guess. We do
not drop priority so the wasted round will not make a bigger pressure on
the reclaim.

> The idea is from LSF discussion where we detect it after the first round of
> scanning and restart the reclaim by not looking at softlimit at all. This
> allows us to make forward progress on shrink_zone().
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |   17 +++++++++++++++--
>  1 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 59e633c..747d903 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1861,6 +1861,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		.priority = sc->priority,
>  	};
>  	struct mem_cgroup *memcg;
> +	bool over_softlimit, ignore_softlimit = false;
> +
> +restart:
> +	over_softlimit = false;
>  
>  	memcg = mem_cgroup_iter(root, NULL, &reclaim);
>  	do {
> @@ -1879,10 +1883,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		 * we have to reclaim under softlimit instead of burning more
>  		 * cpu cycles.
>  		 */
> -		if (!global_reclaim(sc) || sc->priority < DEF_PRIORITY - 2 ||
> -				mem_cgroup_over_soft_limit(memcg))
> +		if (ignore_softlimit || !global_reclaim(sc) ||
> +				sc->priority < DEF_PRIORITY - 2 ||
> +				mem_cgroup_over_soft_limit(memcg)) {
>  			shrink_lruvec(lruvec, sc);
>  
> +			over_softlimit = true;
> +		}
> +
>  		/*
>  		 * Limit reclaim has historically picked one memcg and
>  		 * scanned it with decreasing priority levels until
> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		}
>  		memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  	} while (memcg);
> +
> +	if (!over_softlimit) {

Is this ever false? At least root cgroup is always above the limit.
Shouldn't we rather compare reclaimed pages?

> +		ignore_softlimit = true;
> +		goto restart;
> +	}
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */
> -- 
> 1.7.7.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 15:59 ` Michal Hocko
@ 2012-07-31 16:07   ` Rik van Riel
  2012-07-31 17:52     ` Ying Han
  2012-07-31 17:54   ` Ying Han
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2012-07-31 16:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 07/31/2012 11:59 AM, Michal Hocko wrote:

>> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>   		}
>>   		memcg = mem_cgroup_iter(root, memcg,&reclaim);
>>   	} while (memcg);
>> +
>> +	if (!over_softlimit) {
>
> Is this ever false? At least root cgroup is always above the limit.
> Shouldn't we rather compare reclaimed pages?

Uh oh.

That could also result in us always reclaiming from the root cgroup
first...

Is that really what we want?

Having said that, in April I discussed an algorithm of LRU list
weighting with Ying and others that should work.  Ying's patches
look like a good basis to implement that on top of...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 16:07   ` Rik van Riel
@ 2012-07-31 17:52     ` Ying Han
  0 siblings, 0 replies; 18+ messages in thread
From: Ying Han @ 2012-07-31 17:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Tue, Jul 31, 2012 at 9:07 AM, Rik van Riel <riel@redhat.com> wrote:
> On 07/31/2012 11:59 AM, Michal Hocko wrote:
>
>>> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct
>>> scan_control *sc)
>>>                 }
>>>                 memcg = mem_cgroup_iter(root, memcg,&reclaim);
>>>         } while (memcg);
>>> +
>>> +       if (!over_softlimit) {
>>
>>
>> Is this ever false? At least root cgroup is always above the limit.
>> Shouldn't we rather compare reclaimed pages?
>
>
> Uh oh.
>
> That could also result in us always reclaiming from the root cgroup
> first...

That is not true as far as I read. The mem_cgroup_reclaim_cookie
remembers the last scanned memcg under the priority in iter->position,
and the next round will just start at iter->position + 1. And that
cookie is shared between different reclaim threads, so depending on
how many threads entered reclaim and that starting point varies. By
saying that, it is true though if there is one reclaiming thread where
we always start from root and break when reading the end of the list.

>
> Is that really what we want?

Don't see my patch change that part. The only difference is that I
might end up scanning the same memcg list w/ the same priority twice.

>
> Having said that, in April I discussed an algorithm of LRU list
> weighting with Ying and others that should work.  Ying's patches
> look like a good basis to implement that on top of...

Yes.

--Ying
>
> --
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 15:59 ` Michal Hocko
  2012-07-31 16:07   ` Rik van Riel
@ 2012-07-31 17:54   ` Ying Han
  2012-07-31 20:02     ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-07-31 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Tue, Jul 31, 2012 at 8:59 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 30-07-12 15:32:18, Ying Han wrote:
>> In memcg kernel, cgroup under its softlimit is not targeted under global
>> reclaim. It could be possible that all memcgs are under their softlimit for
>> a particular zone.
>
> This is a bit misleading because there is no softlimit per zone...
>
>> If that is the case, the current implementation will burn extra cpu
>> cycles without making forward progress.
>
> This scales with the number of groups which is bareable I guess. We do
> not drop priority so the wasted round will not make a bigger pressure on
> the reclaim.
>
>> The idea is from LSF discussion where we detect it after the first round of
>> scanning and restart the reclaim by not looking at softlimit at all. This
>> allows us to make forward progress on shrink_zone().
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  mm/vmscan.c |   17 +++++++++++++++--
>>  1 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 59e633c..747d903 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1861,6 +1861,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>               .priority = sc->priority,
>>       };
>>       struct mem_cgroup *memcg;
>> +     bool over_softlimit, ignore_softlimit = false;
>> +
>> +restart:
>> +     over_softlimit = false;
>>
>>       memcg = mem_cgroup_iter(root, NULL, &reclaim);
>>       do {
>> @@ -1879,10 +1883,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>                * we have to reclaim under softlimit instead of burning more
>>                * cpu cycles.
>>                */
>> -             if (!global_reclaim(sc) || sc->priority < DEF_PRIORITY - 2 ||
>> -                             mem_cgroup_over_soft_limit(memcg))
>> +             if (ignore_softlimit || !global_reclaim(sc) ||
>> +                             sc->priority < DEF_PRIORITY - 2 ||
>> +                             mem_cgroup_over_soft_limit(memcg)) {
>>                       shrink_lruvec(lruvec, sc);
>>
>> +                     over_softlimit = true;
>> +             }
>> +
>>               /*
>>                * Limit reclaim has historically picked one memcg and
>>                * scanned it with decreasing priority levels until
>> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>               }
>>               memcg = mem_cgroup_iter(root, memcg, &reclaim);
>>       } while (memcg);
>> +
>> +     if (!over_softlimit) {
>
> Is this ever false? At least root cgroup is always above the limit.
> Shouldn't we rather compare reclaimed pages?

Do we always start from root? My understanding of reclaim_cookie is
that remembers the last scanned memcg under root and then start from
the one after it. The loop breaks everytime we reach the end of it,
and it could be possible we didn't reach root at all.

--Ying

>
>> +             ignore_softlimit = true;
>> +             goto restart;
>> +     }
>>  }
>>
>>  /* Returns true if compaction should go ahead for a high-order request */
>> --
>> 1.7.7.3
>>
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 17:54   ` Ying Han
@ 2012-07-31 20:02     ` Michal Hocko
  2012-07-31 20:59       ` Ying Han
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2012-07-31 20:02 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Tue 31-07-12 10:54:38, Ying Han wrote:
> On Tue, Jul 31, 2012 at 8:59 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 30-07-12 15:32:18, Ying Han wrote:
> >> In memcg kernel, cgroup under its softlimit is not targeted under global
> >> reclaim. It could be possible that all memcgs are under their softlimit for
> >> a particular zone.
> >
> > This is a bit misleading because there is no softlimit per zone...
> >
> >> If that is the case, the current implementation will burn extra cpu
> >> cycles without making forward progress.
> >
> > This scales with the number of groups which is bareable I guess. We do
> > not drop priority so the wasted round will not make a bigger pressure on
> > the reclaim.
> >
> >> The idea is from LSF discussion where we detect it after the first round of
> >> scanning and restart the reclaim by not looking at softlimit at all. This
> >> allows us to make forward progress on shrink_zone().
> >>
> >> Signed-off-by: Ying Han <yinghan@google.com>
> >> ---
> >>  mm/vmscan.c |   17 +++++++++++++++--
> >>  1 files changed, 15 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 59e633c..747d903 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1861,6 +1861,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >>               .priority = sc->priority,
> >>       };
> >>       struct mem_cgroup *memcg;
> >> +     bool over_softlimit, ignore_softlimit = false;
> >> +
> >> +restart:
> >> +     over_softlimit = false;
> >>
> >>       memcg = mem_cgroup_iter(root, NULL, &reclaim);
> >>       do {
> >> @@ -1879,10 +1883,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >>                * we have to reclaim under softlimit instead of burning more
> >>                * cpu cycles.
> >>                */
> >> -             if (!global_reclaim(sc) || sc->priority < DEF_PRIORITY - 2 ||
> >> -                             mem_cgroup_over_soft_limit(memcg))
> >> +             if (ignore_softlimit || !global_reclaim(sc) ||
> >> +                             sc->priority < DEF_PRIORITY - 2 ||
> >> +                             mem_cgroup_over_soft_limit(memcg)) {
> >>                       shrink_lruvec(lruvec, sc);
> >>
> >> +                     over_softlimit = true;
> >> +             }
> >> +
> >>               /*
> >>                * Limit reclaim has historically picked one memcg and
> >>                * scanned it with decreasing priority levels until
> >> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >>               }
> >>               memcg = mem_cgroup_iter(root, memcg, &reclaim);
> >>       } while (memcg);
> >> +
> >> +     if (!over_softlimit) {
> >
> > Is this ever false? At least root cgroup is always above the limit.
> > Shouldn't we rather compare reclaimed pages?
> 
> Do we always start from root? My understanding of reclaim_cookie is
> that remembers the last scanned memcg under root and then start from
> the one after it. 

Yes it visits all nodes of the hierarchy.

> The loop breaks everytime we reach the end of it, and it could be
> possible we didn't reach root at all.

Global reclaim means the root is involved and the we do not break out
the loop so the root will be visited as well. And if nobody is over the
soft limit then at least root is (according to mem_cgroup_over_soft_limit).


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 20:02     ` Michal Hocko
@ 2012-07-31 20:59       ` Ying Han
  2012-08-01  8:45         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-07-31 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Tue, Jul 31, 2012 at 1:02 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 31-07-12 10:54:38, Ying Han wrote:
>> On Tue, Jul 31, 2012 at 8:59 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Mon 30-07-12 15:32:18, Ying Han wrote:
>> >> In memcg kernel, cgroup under its softlimit is not targeted under global
>> >> reclaim. It could be possible that all memcgs are under their softlimit for
>> >> a particular zone.
>> >
>> > This is a bit misleading because there is no softlimit per zone...
>> >
>> >> If that is the case, the current implementation will burn extra cpu
>> >> cycles without making forward progress.
>> >
>> > This scales with the number of groups which is bareable I guess. We do
>> > not drop priority so the wasted round will not make a bigger pressure on
>> > the reclaim.
>> >
>> >> The idea is from LSF discussion where we detect it after the first round of
>> >> scanning and restart the reclaim by not looking at softlimit at all. This
>> >> allows us to make forward progress on shrink_zone().
>> >>
>> >> Signed-off-by: Ying Han <yinghan@google.com>
>> >> ---
>> >>  mm/vmscan.c |   17 +++++++++++++++--
>> >>  1 files changed, 15 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 59e633c..747d903 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -1861,6 +1861,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> >>               .priority = sc->priority,
>> >>       };
>> >>       struct mem_cgroup *memcg;
>> >> +     bool over_softlimit, ignore_softlimit = false;
>> >> +
>> >> +restart:
>> >> +     over_softlimit = false;
>> >>
>> >>       memcg = mem_cgroup_iter(root, NULL, &reclaim);
>> >>       do {
>> >> @@ -1879,10 +1883,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> >>                * we have to reclaim under softlimit instead of burning more
>> >>                * cpu cycles.
>> >>                */
>> >> -             if (!global_reclaim(sc) || sc->priority < DEF_PRIORITY - 2 ||
>> >> -                             mem_cgroup_over_soft_limit(memcg))
>> >> +             if (ignore_softlimit || !global_reclaim(sc) ||
>> >> +                             sc->priority < DEF_PRIORITY - 2 ||
>> >> +                             mem_cgroup_over_soft_limit(memcg)) {
>> >>                       shrink_lruvec(lruvec, sc);
>> >>
>> >> +                     over_softlimit = true;
>> >> +             }
>> >> +
>> >>               /*
>> >>                * Limit reclaim has historically picked one memcg and
>> >>                * scanned it with decreasing priority levels until
>> >> @@ -1899,6 +1907,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> >>               }
>> >>               memcg = mem_cgroup_iter(root, memcg, &reclaim);
>> >>       } while (memcg);
>> >> +
>> >> +     if (!over_softlimit) {
>> >
>> > Is this ever false? At least root cgroup is always above the limit.
>> > Shouldn't we rather compare reclaimed pages?
>>
>> Do we always start from root? My understanding of reclaim_cookie is
>> that remembers the last scanned memcg under root and then start from
>> the one after it.
>
> Yes it visits all nodes of the hierarchy.
>
>> The loop breaks everytime we reach the end of it, and it could be
>> possible we didn't reach root at all.
>
> Global reclaim means the root is involved and the we do not break out
> the loop so the root will be visited as well. And if nobody is over the
> soft limit then at least root is (according to mem_cgroup_over_soft_limit).

That is slightly different from my understanding. Forgive me if I
totally misunderstood how the mem_cgroup_iter() works.

In mem_cgroup_over_soft_limit(), we always return true for root cgroup
which says that always reclaim from root if we get to root cgroup.
However, there is no guarantee the reclaim thread will get to root for
invoking shrink_zone() each time.

Let's say the following example where the cgroup is sorted by css_id,
and none of the cgroup's usage is above softlimit (except root)

                                        root  a  b  c  d  e f ...max
thread_1 (priority = 12)         ^
                                         iter->position = 1        (
over_softlimit = true )

                                                ^
                                                 iter->position = 2

thread_2 (priority = 12)                     ^
                                                     iter->position = 3

                                                      ....
                                                                          ^

   iter->position = 0  ( over_softlimit = false )

In this case, thread 1 gets root but not thread 2 since they share the
walk under same zone (same node) and same reclaim priority.

--Ying

>
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-07-31 20:59       ` Ying Han
@ 2012-08-01  8:45         ` Michal Hocko
  2012-08-01 19:04           ` Ying Han
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2012-08-01  8:45 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Tue 31-07-12 13:59:35, Ying Han wrote:
[...]
> Let's say the following example where the cgroup is sorted by css_id,
> and none of the cgroup's usage is above softlimit (except root)
> 
>                                         root  a  b  c  d  e f ...max
> thread_1 (priority = 12)         ^
>                                          iter->position = 1        (
> over_softlimit = true )
> 
>                                                 ^
>                                                  iter->position = 2
> 
> thread_2 (priority = 12)                     ^
>                                                      iter->position = 3
> 
>                                                       ....
>                                                                           ^
> 
>    iter->position = 0  ( over_softlimit = false )
> 
> In this case, thread 1 gets root but not thread 2 since they share the
> walk under same zone (same node) and same reclaim priority.

That is true iterator is per zone per priority if the cookie is used but
that wasn't my point.
Take a much simpler case. Just the background reclaim without any direct
reclaim. Then there is nobody to race with and so we would always visit
the whole tree including the root and so if no group is above the soft
limit we would hammer the root cgroup until priority gets down when we
ignore the limit and reclaim from all. Makes sense?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-01  8:45         ` Michal Hocko
@ 2012-08-01 19:04           ` Ying Han
  2012-08-01 20:10             ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-08-01 19:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Wed, Aug 1, 2012 at 1:45 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 31-07-12 13:59:35, Ying Han wrote:
> [...]
>> Let's say the following example where the cgroup is sorted by css_id,
>> and none of the cgroup's usage is above softlimit (except root)
>>
>>                                         root  a  b  c  d  e f ...max
>> thread_1 (priority = 12)         ^
>>                                          iter->position = 1        (
>> over_softlimit = true )
>>
>>                                                 ^
>>                                                  iter->position = 2
>>
>> thread_2 (priority = 12)                     ^
>>                                                      iter->position = 3
>>
>>                                                       ....
>>                                                                           ^
>>
>>    iter->position = 0  ( over_softlimit = false )
>>
>> In this case, thread 1 gets root but not thread 2 since they share the
>> walk under same zone (same node) and same reclaim priority.
>
> That is true iterator is per zone per priority if the cookie is used but
> that wasn't my point.
> Take a much simpler case. Just the background reclaim without any direct
> reclaim. Then there is nobody to race with and so we would always visit
> the whole tree including the root and so if no group is above the soft
> limit we would hammer the root cgroup until priority gets down when we
> ignore the limit and reclaim from all. Makes sense?

That is true. Hmm, then two things i can do:

1. for kswapd case, make sure not counting the root cgroup
2. or check nr_scanned. I like the nr_scanned which is telling us
whether or not the reclaim ever make any attempt ?

--Ying

> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-01 19:04           ` Ying Han
@ 2012-08-01 20:10             ` Rik van Riel
  2012-08-02  0:09               ` Ying Han
  2012-08-06 14:03               ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Rik van Riel @ 2012-08-01 20:10 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 08/01/2012 03:04 PM, Ying Han wrote:

> That is true. Hmm, then two things i can do:
>
> 1. for kswapd case, make sure not counting the root cgroup
> 2. or check nr_scanned. I like the nr_scanned which is telling us
> whether or not the reclaim ever make any attempt ?

I am looking at a more advanced case of (3) right
now.  Once I have the basics working, I will send
you a prototype (that applies on top of your patches)
to play with.

Basically, for every LRU in the system, we can keep
track of 4 things:
- reclaim_stat->recent_scanned
- reclaim_stat->recent_rotated
- reclaim_stat->recent_pressure
- LRU size

The first two represent the fraction of pages on the
list that are actively used.  The larger the fraction
of recently used pages, the more valuable the cache
is. The inverse of that can be used to show us how
hard to reclaim this cache, compared to other caches
(everything else being equal).

The recent pressure can be used to keep track of how
many pages we have scanned on each LRU list recently.
Pressure is scaled with LRU size.

This would be the basic formula to decide which LRU
to reclaim from:

           recent_scanned   LRU size
score =   -------------- * ----------------
           recent_rotated   recent_pressure

In other words, the less the objects on an LRU are
used, the more we should reclaim from that LRU. The
larger an LRU is, the more we should reclaim from
that LRU.

The more we have already scanned an LRU, the lower
its score becomes. At some point, another LRU will
have the top score, and that will be the target to
scan.

We can adjust the score for different LRUs in different
ways, eg.:
- swappiness adjustment for file vs anon LRUs, within
   an LRU set
- if an LRU set contains a file LRU with more inactive
   than active pages, reclaim from this LRU set first
- if an LRU set is over it's soft limit, reclaim from
   this LRU set first

This also gives us a nice way to balance memory pressure
between zones, etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-01 20:10             ` Rik van Riel
@ 2012-08-02  0:09               ` Ying Han
  2012-08-02  0:43                 ` Rik van Riel
  2012-08-06 14:03               ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-08-02  0:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Wed, Aug 1, 2012 at 1:10 PM, Rik van Riel <riel@redhat.com> wrote:
> On 08/01/2012 03:04 PM, Ying Han wrote:
>
>> That is true. Hmm, then two things i can do:
>>
>> 1. for kswapd case, make sure not counting the root cgroup
>> 2. or check nr_scanned. I like the nr_scanned which is telling us
>> whether or not the reclaim ever make any attempt ?
>
>
> I am looking at a more advanced case of (3) right
> now.  Once I have the basics working, I will send
> you a prototype (that applies on top of your patches)
> to play with.

Rik,

Thank you for looking into that. Before I dig into the algorithm you
described here, do you think we need to hold this patchset for that?
It would be easier to build on top of the things after the ground work
is sorted out.

--Ying

> Basically, for every LRU in the system, we can keep
> track of 4 things:
> - reclaim_stat->recent_scanned
> - reclaim_stat->recent_rotated
> - reclaim_stat->recent_pressure
> - LRU size
>
> The first two represent the fraction of pages on the
> list that are actively used.  The larger the fraction
> of recently used pages, the more valuable the cache
> is. The inverse of that can be used to show us how
> hard to reclaim this cache, compared to other caches
> (everything else being equal).
>
> The recent pressure can be used to keep track of how
> many pages we have scanned on each LRU list recently.
> Pressure is scaled with LRU size.
>
> This would be the basic formula to decide which LRU
> to reclaim from:
>
>           recent_scanned   LRU size
> score =   -------------- * ----------------
>           recent_rotated   recent_pressure
>
>
> In other words, the less the objects on an LRU are
> used, the more we should reclaim from that LRU. The
> larger an LRU is, the more we should reclaim from
> that LRU.
>
> The more we have already scanned an LRU, the lower
> its score becomes. At some point, another LRU will
> have the top score, and that will be the target to
> scan.
>
> We can adjust the score for different LRUs in different
> ways, eg.:
> - swappiness adjustment for file vs anon LRUs, within
>   an LRU set
> - if an LRU set contains a file LRU with more inactive
>   than active pages, reclaim from this LRU set first
> - if an LRU set is over it's soft limit, reclaim from
>   this LRU set first
>
> This also gives us a nice way to balance memory pressure
> between zones, etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-02  0:09               ` Ying Han
@ 2012-08-02  0:43                 ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2012-08-02  0:43 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 08/01/2012 08:09 PM, Ying Han wrote:
> On Wed, Aug 1, 2012 at 1:10 PM, Rik van Riel<riel@redhat.com>  wrote:
>> On 08/01/2012 03:04 PM, Ying Han wrote:
>>
>>> That is true. Hmm, then two things i can do:
>>>
>>> 1. for kswapd case, make sure not counting the root cgroup
>>> 2. or check nr_scanned. I like the nr_scanned which is telling us
>>> whether or not the reclaim ever make any attempt ?
>>
>>
>> I am looking at a more advanced case of (3) right
>> now.  Once I have the basics working, I will send
>> you a prototype (that applies on top of your patches)
>> to play with.
>
> Rik,
>
> Thank you for looking into that. Before I dig into the algorithm you
> described here, do you think we need to hold this patchset for that?
> It would be easier to build on top of the things after the ground work
> is sorted out.

I'm fine either way.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-01 20:10             ` Rik van Riel
  2012-08-02  0:09               ` Ying Han
@ 2012-08-06 14:03               ` Michal Hocko
  2012-08-06 14:27                 ` Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2012-08-06 14:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ying Han, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Wed 01-08-12 16:10:32, Rik van Riel wrote:
> On 08/01/2012 03:04 PM, Ying Han wrote:
> 
> >That is true. Hmm, then two things i can do:
> >
> >1. for kswapd case, make sure not counting the root cgroup
> >2. or check nr_scanned. I like the nr_scanned which is telling us
> >whether or not the reclaim ever make any attempt ?
> 
> I am looking at a more advanced case of (3) right
> now.  Once I have the basics working, I will send
> you a prototype (that applies on top of your patches)
> to play with.
> 
> Basically, for every LRU in the system, we can keep
> track of 4 things:
> - reclaim_stat->recent_scanned
> - reclaim_stat->recent_rotated
> - reclaim_stat->recent_pressure
> - LRU size
> 
> The first two represent the fraction of pages on the
> list that are actively used.  The larger the fraction
> of recently used pages, the more valuable the cache
> is. The inverse of that can be used to show us how
> hard to reclaim this cache, compared to other caches
> (everything else being equal).
> 
> The recent pressure can be used to keep track of how
> many pages we have scanned on each LRU list recently.
> Pressure is scaled with LRU size.
> 
> This would be the basic formula to decide which LRU
> to reclaim from:
> 
>           recent_scanned   LRU size
> score =   -------------- * ----------------
>           recent_rotated   recent_pressure
> 
> 
> In other words, the less the objects on an LRU are
> used, the more we should reclaim from that LRU. The
> larger an LRU is, the more we should reclaim from
> that LRU.

The formula makes sense but I am afraid that it will be hard to tune it
into something that wouldn't regress. For example I have seen workloads
which had many small groups which are used to wrap up backup jobs and
those are scanned a lot, you would see also many rotations because of
the writeback but those are definitely good to scan rather than a large
group which needs to keep its data resident.
Anyway, I am not saying the score approach is a bad idea but I am afraid
it will be hard to validate and make it right.

> The more we have already scanned an LRU, the lower
> its score becomes. At some point, another LRU will
> have the top score, and that will be the target to
> scan.

So you think we shouldn't do the full round over memcgs in shrink_zone a
and rather do it oom way to pick up a victim and hammer it?

> We can adjust the score for different LRUs in different
> ways, eg.:
> - swappiness adjustment for file vs anon LRUs, within
>   an LRU set
> - if an LRU set contains a file LRU with more inactive
>   than active pages, reclaim from this LRU set first
> - if an LRU set is over it's soft limit, reclaim from
>   this LRU set first

maybe we could replace LRU size by (LRU size - soft_limit) in the above
formula?

> 
> This also gives us a nice way to balance memory pressure
> between zones, etc...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-06 14:03               ` Michal Hocko
@ 2012-08-06 14:27                 ` Rik van Riel
  2012-08-06 15:11                   ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2012-08-06 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 08/06/2012 10:03 AM, Michal Hocko wrote:
> On Wed 01-08-12 16:10:32, Rik van Riel wrote:
>> On 08/01/2012 03:04 PM, Ying Han wrote:
>>
>>> That is true. Hmm, then two things i can do:
>>>
>>> 1. for kswapd case, make sure not counting the root cgroup
>>> 2. or check nr_scanned. I like the nr_scanned which is telling us
>>> whether or not the reclaim ever make any attempt ?
>>
>> I am looking at a more advanced case of (3) right
>> now.  Once I have the basics working, I will send
>> you a prototype (that applies on top of your patches)
>> to play with.
>>
>> Basically, for every LRU in the system, we can keep
>> track of 4 things:
>> - reclaim_stat->recent_scanned
>> - reclaim_stat->recent_rotated
>> - reclaim_stat->recent_pressure
>> - LRU size
>>
>> The first two represent the fraction of pages on the
>> list that are actively used.  The larger the fraction
>> of recently used pages, the more valuable the cache
>> is. The inverse of that can be used to show us how
>> hard to reclaim this cache, compared to other caches
>> (everything else being equal).
>>
>> The recent pressure can be used to keep track of how
>> many pages we have scanned on each LRU list recently.
>> Pressure is scaled with LRU size.
>>
>> This would be the basic formula to decide which LRU
>> to reclaim from:
>>
>>            recent_scanned   LRU size
>> score =   -------------- * ----------------
>>            recent_rotated   recent_pressure
>>
>>
>> In other words, the less the objects on an LRU are
>> used, the more we should reclaim from that LRU. The
>> larger an LRU is, the more we should reclaim from
>> that LRU.
>
> The formula makes sense but I am afraid that it will be hard to tune it
> into something that wouldn't regress. For example I have seen workloads
> which had many small groups which are used to wrap up backup jobs and
> those are scanned a lot, you would see also many rotations because of
> the writeback but those are definitely good to scan rather than a large
> group which needs to keep its data resident.

Writeback rotations are not counted in
lruvec->reclaim_stat->recent_rotated - only the rotations
that were done because we really want to keep the page are
counted.

> Anyway, I am not saying the score approach is a bad idea but I am afraid
> it will be hard to validate and make it right.

One thing about the recent_scanned / recent_rotated metric is
that we have been using it since 2.6.28, to balance between
scanning the file and anonymous LRUs.

I believe it would help us balance between multiple sets of
LRUs, too.

>> The more we have already scanned an LRU, the lower
>> its score becomes. At some point, another LRU will
>> have the top score, and that will be the target to
>> scan.
>
> So you think we shouldn't do the full round over memcgs in shrink_zone a
> and rather do it oom way to pick up a victim and hammer it?

Not hammer it too far.  Only until its score ends up well
below (25% lower?) than that of the second highest scoring
list.

That way all the lists get hammered a little bit, in turn.

>> We can adjust the score for different LRUs in different
>> ways, eg.:
>> - swappiness adjustment for file vs anon LRUs, within
>>    an LRU set
>> - if an LRU set contains a file LRU with more inactive
>>    than active pages, reclaim from this LRU set first
>> - if an LRU set is over it's soft limit, reclaim from
>>    this LRU set first
>
> maybe we could replace LRU size by (LRU size - soft_limit) in the above
> formula?

Good idea, that could work.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-06 14:27                 ` Rik van Riel
@ 2012-08-06 15:11                   ` Michal Hocko
  2012-08-06 18:51                     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2012-08-06 15:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ying Han, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Mon 06-08-12 10:27:25, Rik van Riel wrote:
> On 08/06/2012 10:03 AM, Michal Hocko wrote:
> >On Wed 01-08-12 16:10:32, Rik van Riel wrote:
> >>On 08/01/2012 03:04 PM, Ying Han wrote:
> >>
> >>>That is true. Hmm, then two things i can do:
> >>>
> >>>1. for kswapd case, make sure not counting the root cgroup
> >>>2. or check nr_scanned. I like the nr_scanned which is telling us
> >>>whether or not the reclaim ever make any attempt ?
> >>
> >>I am looking at a more advanced case of (3) right
> >>now.  Once I have the basics working, I will send
> >>you a prototype (that applies on top of your patches)
> >>to play with.
> >>
> >>Basically, for every LRU in the system, we can keep
> >>track of 4 things:
> >>- reclaim_stat->recent_scanned
> >>- reclaim_stat->recent_rotated
> >>- reclaim_stat->recent_pressure
> >>- LRU size
> >>
> >>The first two represent the fraction of pages on the
> >>list that are actively used.  The larger the fraction
> >>of recently used pages, the more valuable the cache
> >>is. The inverse of that can be used to show us how
> >>hard to reclaim this cache, compared to other caches
> >>(everything else being equal).
> >>
> >>The recent pressure can be used to keep track of how
> >>many pages we have scanned on each LRU list recently.
> >>Pressure is scaled with LRU size.
> >>
> >>This would be the basic formula to decide which LRU
> >>to reclaim from:
> >>
> >>           recent_scanned   LRU size
> >>score =   -------------- * ----------------
> >>           recent_rotated   recent_pressure
> >>
> >>
> >>In other words, the less the objects on an LRU are
> >>used, the more we should reclaim from that LRU. The
> >>larger an LRU is, the more we should reclaim from
> >>that LRU.
> >
> >The formula makes sense but I am afraid that it will be hard to tune it
> >into something that wouldn't regress. For example I have seen workloads
> >which had many small groups which are used to wrap up backup jobs and
> >those are scanned a lot, you would see also many rotations because of
> >the writeback but those are definitely good to scan rather than a large
> >group which needs to keep its data resident.
> 
> Writeback rotations are not counted in
> lruvec->reclaim_stat->recent_rotated - only the rotations
> that were done because we really want to keep the page are
> counted.

OK. I missed that.

> >Anyway, I am not saying the score approach is a bad idea but I am afraid
> >it will be hard to validate and make it right.
> 
> One thing about the recent_scanned / recent_rotated metric is
> that we have been using it since 2.6.28, to balance between
> scanning the file and anonymous LRUs.
> 
> I believe it would help us balance between multiple sets of
> LRUs, too.
> 
> >>The more we have already scanned an LRU, the lower
> >>its score becomes. At some point, another LRU will
> >>have the top score, and that will be the target to
> >>scan.
> >
> >So you think we shouldn't do the full round over memcgs in shrink_zone a
> >and rather do it oom way to pick up a victim and hammer it?
> 
> Not hammer it too far.  Only until its score ends up well
> below (25% lower?) than that of the second highest scoring
> list.
> 
> That way all the lists get hammered a little bit, in turn.

How do we provide the soft limit guarantee then?

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-06 15:11                   ` Michal Hocko
@ 2012-08-06 18:51                     ` Rik van Riel
  2012-08-06 21:18                       ` Ying Han
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2012-08-06 18:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 08/06/2012 11:11 AM, Michal Hocko wrote:
> On Mon 06-08-12 10:27:25, Rik van Riel wrote:

>>> So you think we shouldn't do the full round over memcgs in shrink_zone a
>>> and rather do it oom way to pick up a victim and hammer it?
>>
>> Not hammer it too far.  Only until its score ends up well
>> below (25% lower?) than that of the second highest scoring
>> list.
>>
>> That way all the lists get hammered a little bit, in turn.
>
> How do we provide the soft limit guarantee then?
>
> [...]

The easiest way would be to find the top 2 or 3 scoring memcgs
when we reclaim memory. After reclaiming some pages, recalculate
the scores of just these top lists, and see if the list we started
out with now has a lower score than the second one.

Once we have reclaimed some from each of the 2 or 3 lists, we can
go back and find the highest priority lists again.

Direct reclaim only reclaims a little bit at a time, anyway.

For kswapd, we could also remember the number of pages the group
has in excess of its soft limit, and recalculate after that...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-06 18:51                     ` Rik van Riel
@ 2012-08-06 21:18                       ` Ying Han
  2012-08-06 22:54                         ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Ying Han @ 2012-08-06 21:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On Mon, Aug 6, 2012 at 11:51 AM, Rik van Riel <riel@redhat.com> wrote:
> On 08/06/2012 11:11 AM, Michal Hocko wrote:
>>
>> On Mon 06-08-12 10:27:25, Rik van Riel wrote:
>
>
>>>> So you think we shouldn't do the full round over memcgs in shrink_zone a
>>>> and rather do it oom way to pick up a victim and hammer it?
>>>
>>>
>>> Not hammer it too far.  Only until its score ends up well
>>> below (25% lower?) than that of the second highest scoring
>>> list.
>>>
>>> That way all the lists get hammered a little bit, in turn.
>>
>>
>> How do we provide the soft limit guarantee then?
>>
>> [...]
>
>
> The easiest way would be to find the top 2 or 3 scoring memcgs
> when we reclaim memory. After reclaiming some pages, recalculate
> the scores of just these top lists, and see if the list we started
> out with now has a lower score than the second one.
>
> Once we have reclaimed some from each of the 2 or 3 lists, we can
> go back and find the highest priority lists again.

Sounds like quite a lot of calculation to pick which memcg to reclaim
from, and I wonder if that is necessary at all.
For most of the use cases, we don't need to pick the lowest score
memcg to reclaim from first. My understanding is that if we can
respect the (lru_size - softlimit) to calculate the nr_to_scan, that
is good move from what we have today.

If so, can we just still do the round-robin fashion in shrink_zone()
and for each memcg, we calculate the nr_to_scan similar to
get_scan_count() what have today but w/ the new formula. For memcg
under its softlimit, we avoid reclaim pages unless no more pages can
be reclaimed, and then we start reclaiming under the softlimit. That
part can use the same logic depending on (softlimit - lru_size)

--Ying

Can we do something similar to the
>
> Direct reclaim only reclaims a little bit at a time, anyway.
>
> For kswapd, we could also remember the number of pages the group
> has in excess of its soft limit, and recalculate after that...
>
> --
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim
  2012-08-06 21:18                       ` Ying Han
@ 2012-08-06 22:54                         ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2012-08-06 22:54 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, KOSAKI Motohiro, Andrew Morton,
	linux-mm

On 08/06/2012 05:18 PM, Ying Han wrote:
> On Mon, Aug 6, 2012 at 11:51 AM, Rik van Riel<riel@redhat.com>  wrote:
>> On 08/06/2012 11:11 AM, Michal Hocko wrote:
>>>
>>> On Mon 06-08-12 10:27:25, Rik van Riel wrote:
>>
>>
>>>>> So you think we shouldn't do the full round over memcgs in shrink_zone a
>>>>> and rather do it oom way to pick up a victim and hammer it?
>>>>
>>>>
>>>> Not hammer it too far.  Only until its score ends up well
>>>> below (25% lower?) than that of the second highest scoring
>>>> list.
>>>>
>>>> That way all the lists get hammered a little bit, in turn.
>>>
>>>
>>> How do we provide the soft limit guarantee then?
>>>
>>> [...]
>>
>>
>> The easiest way would be to find the top 2 or 3 scoring memcgs
>> when we reclaim memory. After reclaiming some pages, recalculate
>> the scores of just these top lists, and see if the list we started
>> out with now has a lower score than the second one.
>>
>> Once we have reclaimed some from each of the 2 or 3 lists, we can
>> go back and find the highest priority lists again.
>
> Sounds like quite a lot of calculation to pick which memcg to reclaim
> from, and I wonder if that is necessary at all.
> For most of the use cases, we don't need to pick the lowest score
> memcg to reclaim from first. My understanding is that if we can
> respect the (lru_size - softlimit) to calculate the nr_to_scan, that
> is good move from what we have today.
>
> If so, can we just still do the round-robin fashion in shrink_zone()
> and for each memcg, we calculate the nr_to_scan similar to
> get_scan_count() what have today but w/ the new formula. For memcg
> under its softlimit, we avoid reclaim pages unless no more pages can
> be reclaimed, and then we start reclaiming under the softlimit. That
> part can use the same logic depending on (softlimit - lru_size)

If we do the round robin, we will not know in advance whether
or not there are memcgs over (or under) the softlimit.

Another thing to consider is that the round robin code will
always iterate over the cgroups AND try to reclaim a little
from every one of them.

The first version of my code will just iterate over them to
pick the highest priority cgroups, and will then reclaim from
the that (or those) groups.  This is less work than what your
code does right now.

In the future, we can find a way to sort the cgroups (in a
tree?), so we do not have to walk over all of them.

Some workloads have thousands of cgroups on a system.
Iterating over all of them is not going to scale, it will
be an inconvenience when just calculating the priority,
and has the potential to be a total disaster when doing a
little bit of reclaim from every one of them.

Lets look at this one step at a time.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-08-06 22:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-30 22:32 [PATCH V7 2/2] mm: memcg detect no memcgs above softlimit under zone reclaim Ying Han
2012-07-31 15:59 ` Michal Hocko
2012-07-31 16:07   ` Rik van Riel
2012-07-31 17:52     ` Ying Han
2012-07-31 17:54   ` Ying Han
2012-07-31 20:02     ` Michal Hocko
2012-07-31 20:59       ` Ying Han
2012-08-01  8:45         ` Michal Hocko
2012-08-01 19:04           ` Ying Han
2012-08-01 20:10             ` Rik van Riel
2012-08-02  0:09               ` Ying Han
2012-08-02  0:43                 ` Rik van Riel
2012-08-06 14:03               ` Michal Hocko
2012-08-06 14:27                 ` Rik van Riel
2012-08-06 15:11                   ` Michal Hocko
2012-08-06 18:51                     ` Rik van Riel
2012-08-06 21:18                       ` Ying Han
2012-08-06 22:54                         ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).