From: Wu Fengguang <fengguang.wu@intel.com>
To: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Andrew Morton <akpm@linux-foundation.org>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>, Avi Kivity <avi@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Dike, Jeffrey G" <jeffrey.g.dike@intel.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Christoph Lameter <cl@linux-foundation.org>,
Mel Gorman <mel@csn.ul.ie>, LKML <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
"lizf@cn.fujitsu.com" <lizf@cn.fujitsu.com>,
"menage@google.com" <menage@google.com>
Subject: Re: [PATCH -v2] mm: do batched scans for mem_cgroup
Date: Fri, 21 Aug 2009 09:46:28 +0800 [thread overview]
Message-ID: <20090821014628.GA31483@localhost> (raw)
In-Reply-To: <20090821013926.GA30823@localhost>
On Fri, Aug 21, 2009 at 09:39:26AM +0800, Wu Fengguang wrote:
> On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> >
> > > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > >
> > > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > > > scan rate by up to 32 times.
> > > > >
> > > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > > >
> > > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > > accurate, however we can tolerate small errors and the resulted small
> > > > > imbalanced scan rates between zones.
> > > > >
> > > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > > again to accumulate up nr_saved_scan.
> > > > >
> > > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > > >
> > > > > CC: Rik van Riel <riel@redhat.com>
> > > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > ---
> > > >
> > > > Hmm, how about this ?
> > > > ==
> > > > Now, nr_saved_scan is tied to zone's LRU.
> > > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > > >
> > > > By this, memcg can make use of nr_saved_scan information seamlessly.
> > >
> > > Good idea, full patch updated with your signed-off-by :)
> > >
> > > Thanks,
> > > Fengguang
> > > ---
> > > mm: do batched scans for mem_cgroup
> > >
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > >
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > >
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > >
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > >
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> >
> > Looks good to me, how did you test it?
>
> I observed the shrink_inactive_list() calls with this patch:
>
> @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int lumpy_reclaim = 0;
>
> + if (!scanning_global_lru(sc))
> + printk("shrink inactive %s count=%lu scan=%lu\n",
> + file ? "file" : "anon",
> + mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
> + LRU_INACTIVE_ANON + 2 * !!file),
> + max_scan);
>
> and these commands:
>
> mkdir /cgroup/0
> echo 100M > /cgroup/0/memory.limit_in_bytes
> echo $$ > /cgroup/0/tasks
> cp /tmp/10G /dev/null
And I can reduce the limit to 1M and 500K without triggering OOM:
[ 963.329746] shrink inactive file count=201 scan=32
[ 963.335076] shrink inactive file count=177 scan=15
[ 963.350719] shrink inactive file count=201 scan=32
[ 963.356020] shrink inactive file count=177 scan=15
[ 963.371914] shrink inactive file count=201 scan=32
[ 963.377225] shrink inactive file count=177 scan=15
[ 963.393022] shrink inactive file count=201 scan=32
[ 963.398362] shrink inactive file count=177 scan=15
[ 1103.951251] shrink inactive file count=70 scan=32
[ 1104.054242] shrink inactive file count=46 scan=32
[ 1104.077381] shrink inactive file count=70 scan=32
[ 1104.083095] shrink inactive file count=73 scan=32
[ 1104.088513] shrink inactive file count=45 scan=2
[ 1104.113545] shrink inactive file count=70 scan=32
[ 1104.118915] shrink inactive file count=73 scan=32
[ 1104.124612] shrink inactive file count=45 scan=2
[ 1104.130093] shrink inactive file count=69 scan=32
So the patch is pretty safe for tiny mem cgroups.
Thanks,
Fengguang
> before patch:
>
> [ 3682.646008] shrink inactive file count=25535 scan=6
> [ 3682.661548] shrink inactive file count=25535 scan=6
> [ 3682.666933] shrink inactive file count=25535 scan=6
> [ 3682.682865] shrink inactive file count=25535 scan=6
> [ 3682.688572] shrink inactive file count=25535 scan=6
> [ 3682.703908] shrink inactive file count=25535 scan=6
> [ 3682.709431] shrink inactive file count=25535 scan=6
>
> after patch:
>
> [ 223.146544] shrink inactive file count=25531 scan=32
> [ 223.152060] shrink inactive file count=25507 scan=10
> [ 223.167503] shrink inactive file count=25531 scan=32
> [ 223.173426] shrink inactive file count=25507 scan=10
> [ 223.188764] shrink inactive file count=25531 scan=32
> [ 223.194270] shrink inactive file count=25507 scan=10
> [ 223.209885] shrink inactive file count=25531 scan=32
> [ 223.215388] shrink inactive file count=25507 scan=10
>
> Before patch, the inactive list is over scanned by 30/6=5 times;
> After patch, it is over scanned by 64/42=1.5 times. It's much better,
> and can be further improved if necessary.
>
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> Thanks!
WARNING: multiple messages have this Message-ID (diff)
From: Wu Fengguang <fengguang.wu@intel.com>
To: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Andrew Morton <akpm@linux-foundation.org>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>, Avi Kivity <avi@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Dike, Jeffrey G" <jeffrey.g.dike@intel.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Christoph Lameter <cl@linux-foundation.org>,
Mel Gorman <mel@csn.ul.ie>, LKML <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
"lizf@cn.fujitsu.com" <lizf@cn.fujitsu.com>,
"menage@google.com" <menage@google.com>
Subject: Re: [PATCH -v2] mm: do batched scans for mem_cgroup
Date: Fri, 21 Aug 2009 09:46:28 +0800 [thread overview]
Message-ID: <20090821014628.GA31483@localhost> (raw)
In-Reply-To: <20090821013926.GA30823@localhost>
On Fri, Aug 21, 2009 at 09:39:26AM +0800, Wu Fengguang wrote:
> On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> >
> > > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > >
> > > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > > > scan rate by up to 32 times.
> > > > >
> > > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > > >
> > > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > > accurate, however we can tolerate small errors and the resulted small
> > > > > imbalanced scan rates between zones.
> > > > >
> > > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > > again to accumulate up nr_saved_scan.
> > > > >
> > > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > > >
> > > > > CC: Rik van Riel <riel@redhat.com>
> > > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > ---
> > > >
> > > > Hmm, how about this ?
> > > > ==
> > > > Now, nr_saved_scan is tied to zone's LRU.
> > > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > > >
> > > > By this, memcg can make use of nr_saved_scan information seamlessly.
> > >
> > > Good idea, full patch updated with your signed-off-by :)
> > >
> > > Thanks,
> > > Fengguang
> > > ---
> > > mm: do batched scans for mem_cgroup
> > >
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > >
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > >
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > >
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > >
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> >
> > Looks good to me, how did you test it?
>
> I observed the shrink_inactive_list() calls with this patch:
>
> @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int lumpy_reclaim = 0;
>
> + if (!scanning_global_lru(sc))
> + printk("shrink inactive %s count=%lu scan=%lu\n",
> + file ? "file" : "anon",
> + mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
> + LRU_INACTIVE_ANON + 2 * !!file),
> + max_scan);
>
> and these commands:
>
> mkdir /cgroup/0
> echo 100M > /cgroup/0/memory.limit_in_bytes
> echo $$ > /cgroup/0/tasks
> cp /tmp/10G /dev/null
And I can reduce the limit to 1M and 500K without triggering OOM:
[ 963.329746] shrink inactive file count=201 scan=32
[ 963.335076] shrink inactive file count=177 scan=15
[ 963.350719] shrink inactive file count=201 scan=32
[ 963.356020] shrink inactive file count=177 scan=15
[ 963.371914] shrink inactive file count=201 scan=32
[ 963.377225] shrink inactive file count=177 scan=15
[ 963.393022] shrink inactive file count=201 scan=32
[ 963.398362] shrink inactive file count=177 scan=15
[ 1103.951251] shrink inactive file count=70 scan=32
[ 1104.054242] shrink inactive file count=46 scan=32
[ 1104.077381] shrink inactive file count=70 scan=32
[ 1104.083095] shrink inactive file count=73 scan=32
[ 1104.088513] shrink inactive file count=45 scan=2
[ 1104.113545] shrink inactive file count=70 scan=32
[ 1104.118915] shrink inactive file count=73 scan=32
[ 1104.124612] shrink inactive file count=45 scan=2
[ 1104.130093] shrink inactive file count=69 scan=32
So the patch is pretty safe for tiny mem cgroups.
Thanks,
Fengguang
> before patch:
>
> [ 3682.646008] shrink inactive file count=25535 scan=6
> [ 3682.661548] shrink inactive file count=25535 scan=6
> [ 3682.666933] shrink inactive file count=25535 scan=6
> [ 3682.682865] shrink inactive file count=25535 scan=6
> [ 3682.688572] shrink inactive file count=25535 scan=6
> [ 3682.703908] shrink inactive file count=25535 scan=6
> [ 3682.709431] shrink inactive file count=25535 scan=6
>
> after patch:
>
> [ 223.146544] shrink inactive file count=25531 scan=32
> [ 223.152060] shrink inactive file count=25507 scan=10
> [ 223.167503] shrink inactive file count=25531 scan=32
> [ 223.173426] shrink inactive file count=25507 scan=10
> [ 223.188764] shrink inactive file count=25531 scan=32
> [ 223.194270] shrink inactive file count=25507 scan=10
> [ 223.209885] shrink inactive file count=25531 scan=32
> [ 223.215388] shrink inactive file count=25507 scan=10
>
> Before patch, the inactive list is over scanned by 30/6=5 times;
> After patch, it is over scanned by 64/42=1.5 times. It's much better,
> and can be further improved if necessary.
>
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-08-21 1:46 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-20 2:49 [PATCH] mm: do batched scans for mem_cgroup Wu Fengguang
2009-08-20 2:49 ` Wu Fengguang
2009-08-20 2:52 ` [PATCH] mm: make nr_scan_try_batch() more safe on races Wu Fengguang
2009-08-20 2:52 ` Wu Fengguang
2009-08-20 3:17 ` [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list() Wu Fengguang
2009-08-20 3:17 ` Wu Fengguang
2009-08-21 11:09 ` KOSAKI Motohiro
2009-08-21 11:09 ` KOSAKI Motohiro
2009-08-21 11:22 ` Wu Fengguang
2009-08-21 11:22 ` Wu Fengguang
2009-08-27 0:20 ` KOSAKI Motohiro
2009-08-27 0:20 ` KOSAKI Motohiro
2009-08-20 3:13 ` [PATCH] mm: do batched scans for mem_cgroup KAMEZAWA Hiroyuki
2009-08-20 3:13 ` KAMEZAWA Hiroyuki
2009-08-20 4:05 ` [PATCH -v2] " Wu Fengguang
2009-08-20 4:05 ` Wu Fengguang
2009-08-20 4:06 ` KAMEZAWA Hiroyuki
2009-08-20 4:06 ` KAMEZAWA Hiroyuki
2009-08-20 5:16 ` Balbir Singh
2009-08-20 5:16 ` Balbir Singh
2009-08-21 1:39 ` Wu Fengguang
2009-08-21 1:39 ` Wu Fengguang
2009-08-21 1:46 ` Wu Fengguang [this message]
2009-08-21 1:46 ` Wu Fengguang
2009-08-20 11:01 ` Minchan Kim
2009-08-20 11:01 ` Minchan Kim
2009-08-20 11:49 ` Wu Fengguang
2009-08-20 11:49 ` Wu Fengguang
2009-08-20 12:13 ` Minchan Kim
2009-08-20 12:13 ` Minchan Kim
2009-08-20 12:32 ` Wu Fengguang
2009-08-20 12:32 ` Wu Fengguang
2009-08-21 3:55 ` Minchan Kim
2009-08-21 3:55 ` Minchan Kim
2009-08-21 7:27 ` [PATCH -v2 changelog updated] " Wu Fengguang
2009-08-21 7:27 ` Wu Fengguang
2009-08-21 10:57 ` KOSAKI Motohiro
2009-08-21 10:57 ` KOSAKI Motohiro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090821014628.GA31483@localhost \
--to=fengguang.wu@intel.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=avi@redhat.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=cl@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hugh.dickins@tiscali.co.uk \
--cc=jeffrey.g.dike@intel.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizf@cn.fujitsu.com \
--cc=mel@csn.ul.ie \
--cc=menage@google.com \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.