From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 98AF46B0022 for ; Fri, 13 May 2011 20:29:48 -0400 (EDT) Received: from kpbe18.cbf.corp.google.com (kpbe18.cbf.corp.google.com [172.25.105.82]) by smtp-out.google.com with ESMTP id p4E0TjCM003882 for ; Fri, 13 May 2011 17:29:45 -0700 Received: from qyk7 (qyk7.prod.google.com [10.241.83.135]) by kpbe18.cbf.corp.google.com with ESMTP id p4E0TJaf009680 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Fri, 13 May 2011 17:29:44 -0700 Received: by qyk7 with SMTP id 7so1999500qyk.17 for ; Fri, 13 May 2011 17:29:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20110510190216.f4eefef7.kamezawa.hiroyu@jp.fujitsu.com> <20110511182844.d128c995.akpm@linux-foundation.org> <20110512103503.717f4a96.kamezawa.hiroyu@jp.fujitsu.com> <20110511205110.354fa05e.akpm@linux-foundation.org> <20110512132237.813a7c7f.kamezawa.hiroyu@jp.fujitsu.com> <20110512171725.d367980f.kamezawa.hiroyu@jp.fujitsu.com> <20110513120318.63ff7d0e.kamezawa.hiroyu@jp.fujitsu.com> Date: Fri, 13 May 2011 17:29:43 -0700 Message-ID: Subject: Re: [RFC][PATCH 0/7] memcg async reclaim From: Ying Han Content-Type: multipart/alternative; boundary=002354470aa8cfbc1704a33184c3 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki , Andrew Morton Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Johannes Weiner , Michal Hocko , "balbir@linux.vnet.ibm.com" , "nishimura@mxp.nes.nec.co.jp" , Greg Thelen --002354470aa8cfbc1704a33184c3 Content-Type: text/plain; charset=ISO-8859-1 Sorry forgot to post the script i capture the result: echo $$ >/dev/cgroup/memory/A/tasks time cat /export/hdc3/dd_A/tf0 > /dev/zero & sleep 10 echo $$ >/dev/cgroup/memory/tasks ( while /root/getdelays -dip `pidof cat`; do sleep 10; done ) --Ying On Fri, May 13, 2011 at 5:25 PM, Ying Han wrote: > Here I ran some tests and the result. > > On a 32G machine, I created a memcg with 4G hard_limit (limit_in_bytes) > and and ran cat on a 20g file. Then I use getdelays to measure the > ttfp "delay average" under RECLAIM. When the workload is reaching its > hard_limit and > without background reclaim, each ttfp is triggered by a pagefault. I would > like to demostrate the average delay average for ttfp (thus page fault > latency) on the streaming read/write workload and compare it w/ per-memcg bg > reclaim enabled. > > Note: > 1. I applied a patch on getdelays.c from fengguang which shows > average CPU/IO/SWAP/RECLAIM delays in ns. > > 2. I used my latest version of per-memcg-per-kswapd patch for the > following test. The patch could have been improved since then and I can run > the same test when Kame has his patch ready. > > Configuration: > $ cat /proc/meminfo > MemTotal: 33045832 kB > > $ cat /dev/cgroup/memory/A/memory.limit_in_bytes > 4294967296 > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks > low_wmark 4137680896 > high_wmark 4085252096 > > Test: > $ echo $$ >/dev/cgroup/memory/A/tasks > $ cat /export/hdc3/dd_A/tf0 > /dev/zero > > Without per-memcg background reclaim: > > CPU count real total virtual total delay total delay > average > 176589 17248377848 27344548685 1093693318 > 6193.440ns > IO count delay total delay average > 160704 242072632962 1506326ns > SWAP count delay total delay average > 0 0 0ns > RECLAIM count delay total delay average > 15944 3512140153 220279ns > cat: read=20947877888, write=0, cancelled_write=0 > > real>---4m26.912s > user>---0m0.227s > sys>----0m27.823s > > With per-memcg background reclaim: > > $ ps -ef | grep memcg > root 5803 2 2 13:56 ? 00:04:20 [memcg_4] > > CPU count real total virtual total delay total delay > average > 161085 13185995424 23863858944 72902585 > 452.572ns > IO count delay total delay average > 160915 246145533109 1529661ns > SWAP count delay total delay average > 0 0 0ns > RECLAIM count delay total delay average > 0 0 0ns > cat: read=20974891008, write=0, cancelled_write=0 > > real>---4m26.572s > user>---0m0.246s > sys>----0m24.192s > > memcg_4 cputime: 2.86sec > > Observation: > 1. Without the background reclaim, the cat hit ttfp heavely and the "delay > average" goes above 220 microsec. > > 2. With background reclaim, the ttfp delay average is always 0. Since the > ttfp happens synchronously and that implies the latency of the application > overtime. > > 3. The real time goes slighly better w/ bg reclaim and the sys time is > about the same ( adding the memcg_4 time on top of sys time of cat). But i > don't expect big cpu benefit. The async reclaim uses spare cputime to > proactivly reclaim pages on the side which gurantees less latency variation > of application over time. > > --Ying > > On Thu, May 12, 2011 at 10:10 PM, Ying Han wrote: > >> >> >> On Thu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki < >> kamezawa.hiroyu@jp.fujitsu.com> wrote: >> >>> On Thu, 12 May 2011 17:17:25 +0900 >>> KAMEZAWA Hiroyuki wrote: >>> >>> > On Thu, 12 May 2011 13:22:37 +0900 >>> > KAMEZAWA Hiroyuki wrote: >>> > I'll check what codes in vmscan.c or /mm affects memcg and post a >>> > required fix in step by step. I think I found some.. >>> > >>> >>> After some tests, I doubt that 'automatic' one is unnecessary until >>> memcg's dirty_ratio is supported. And as Andrew pointed out, >>> total cpu consumption is unchanged and I don't have workloads which >>> shows me meaningful speed up. >>> >> >> The total cpu consumption is one way to measure the background reclaim, >> another thing I would like to measure is a histogram of page fault latency >> for a heavy page allocation application. I would expect with background >> reclaim, we will get less variation on the page fault latency than w/o it. >> >> Sorry i haven't got chance to run some tests to back it up. I will try to >> get some data. >> >> >>> But I guess...with dirty_ratio, amount of dirty pages in memcg is >>> limited and background reclaim can work enough without noise of >>> write_page() while applications are throttled by dirty_ratio. >>> >> >> Definitely. I have run into the issue while debugging the soft_limit >> reclaim. The background reclaim became very inefficient if we have dirty >> pages greater than the soft_limit. Talking w/ Greg about it regarding his >> per-memcg dirty page limit effort, we should consider setting the dirty >> ratio which not allowing the dirty pages greater the reclaim watermarks >> (here is the soft_limit). >> >> --Ying >> >> >>> Hmm, I'll study for a while but it seems better to start active soft >>> limit, >>> (or some threshold users can set) first. >>> >>> Anyway, this work makes me to see vmscan.c carefully and I think I can >>> post some patches for fix, tunes. >>> >>> Thanks, >>> -Kame >>> >>> >> > --002354470aa8cfbc1704a33184c3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sorry forgot to post the script i capture the result:

echo $$ >/dev/cgroup/memory/A/tasks
time cat /export/hdc3/d= d_A/tf0 > /dev/zero &

sleep 10
echo $$ >/dev/cgroup/memory/tasks

(
w= hile /root/getdelays -dip `pidof cat`;
do
=A0 =A0 =A0 = =A0sleep 10;
done
)

--Ying

Test:
$ echo $$ >/dev/cgroup/memory/A/task= s
$ cat /export/hdc3/dd_A/tf0 > /dev/zero

Without per-memcg background reclaim:

CPU =A0 =A0 =A0 =A0 =A0 =A0 count =A0 =A0 real total =A0virtual tot= al =A0 =A0delay total =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0176589 =A0 =A017248377848 =A0 =A027344548685 =A0 =A0 1093693318 =A0 = =A0 =A0 6193.440ns
IO =A0 =A0 =A0 =A0 =A0 =A0 =A0count =A0 =A0delay total =A0delay averag= e
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0160704 =A0 242072632962 =A0 =A0 = =A0 =A01506326ns
SWAP =A0 =A0 =A0 =A0 =A0 =A0count =A0 =A0delay t= otal =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 = =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 =A0 =A0 =A00ns
RECLAIM =A0 =A0 =A0 =A0 count =A0 =A0delay total =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 15944 =A0 =A0 3512140153 =A0 =A0 =A0 =A0= 220279ns
cat: read=3D20947877888, write=3D0, cancelled_write=3D0=

real>---4m26.912s
user>---0m0.227s
sys>----0m27.823s

With per-memcg background reclaim:

$ ps -ef = | grep memcg
root =A0 =A0 =A05803 =A0 =A0 2 =A02 13:56 ? =A0 =A0 = =A0 =A000:04:20 [memcg_4]

CPU =A0 =A0 =A0 =A0 =A0 =A0 count =A0 =A0 real total = =A0virtual total =A0 =A0delay total =A0delay average
=A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0161085 =A0 =A013185995424 =A0 =A023863858944 =A0 =A0 =A0= 72902585 =A0 =A0 =A0 =A0452.572ns
IO =A0 =A0 =A0 =A0 =A0 =A0 =A0= count =A0 =A0delay total =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0160915 =A0 246145533109 =A0 =A0 =A0 =A0= 1529661ns
SWAP =A0 =A0 =A0 =A0 =A0 =A0count =A0 =A0delay total = =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 = =A0 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 =A0 =A0 =A00ns
RECLAIM = =A0 =A0 =A0 =A0 count =A0 =A0delay total =A0delay average
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A00= =A0 =A0 =A0 =A0 =A0 =A0 =A00ns
cat: read=3D20974891008, write=3D= 0, cancelled_write=3D0

real>---4m26.572s
<= div>user>---0m0.246s
sys>----0m24.192s

memcg_4 cputime: 2.86sec

= Observation:
1. Without the background reclaim, the cat hit ttfp = heavely and the "delay average"=A0goes above 220 microsec.

2. With background reclaim, the ttfp delay average is a= lways 0. Since the ttfp=A0happens synchronously and that implies the latenc= y of the application overtime.

3. The real time go= es slighly better w/ bg reclaim and the sys time is about=A0the same ( addi= ng the memcg_4 time on top of sys time of cat). But i don't=A0expect bi= g cpu benefit. The async reclaim uses spare cputime to proactivly=A0reclaim= pages on the side which gurantees less latency variation of application=A0= over time.

--Ying
<= div>

On Th= u, May 12, 2011 at 10:10 PM, Ying Han <yinghan@google.com> = wrote:


On T= hu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki <kamezawa.hiro= yu@jp.fujitsu.com> wrote:
On Thu, 12 May 2011 17:17:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 12 May 2011 13:22:37 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I'll check what codes in vmscan.c or /mm affects memcg = and post a
> required fix in step by step. I think I found some..
>

After some tests, I doubt that 'automatic' one is unnecessary= until
memcg's dirty_ratio is supported. And as Andrew pointed out,
total cpu consumption is unchanged and I don't have workloads which
shows me meaningful speed up.

The= total cpu consumption is one way to measure the background reclaim, anothe= r thing I would like to measure is a histogram of page fault latency
for a heavy page allocation application. I would expect with background rec= laim, we will get less variation on the page fault latency than w/o it.=A0<= /div>

Sorry i haven't got chance to run some tests t= o back it up. I will try to get some data.
=A0
But I guess...with dirty_ratio, amount of dirty pages in memcg is
limited and background reclaim can work enough without noise of
write_page() while applications are throttled by dirty_ratio.

Definitely. I have run into the issue while de= bugging the soft_limit reclaim. The background reclaim became very ineffici= ent if we have dirty pages greater than the soft_limit. Talking w/ Greg abo= ut it regarding his per-memcg dirty page limit effort, we should consider s= etting the dirty ratio which not allowing the dirty pages greater the recla= im watermarks (here is the soft_limit).

--Ying
= =A0
Hmm, I'll study for a while but it seems better to start active soft li= mit,
(or some threshold users can set) first.

Anyway, this work makes me to see vmscan.c carefully and I think I can
post some patches for fix, tunes.

Thanks,
-Kame




--002354470aa8cfbc1704a33184c3-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org