linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>, Shaohua Li <shaohua.li@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	"Chen, Tim C" <tim.c.chen@intel.com>
Subject: Re: too big min_free_kbytes
Date: Sat, 29 Jan 2011 20:45:34 +0100	[thread overview]
Message-ID: <20110129194534.GX16981@random.random> (raw)
In-Reply-To: <4D432D2D.4020504@redhat.com>

On Fri, Jan 28, 2011 at 03:55:09PM -0500, Rik van Riel wrote:
> In that case, every zone will go down to the low watermark
> before kswapd is woken up.

This isn't what happens though, if that would be what happens, we
would see free memory going down back to ~130M and then up to 700M and
then down again to 130M, and not stuck at 700M at all times like
below. Example:

 0  0  70512 134940 379408 2753936    0    0   118    71    5    3  2  1 97  1
 0  0  70512 134808 379408 2753936    0    0     0     0   54   48  0  0 100  0
 0  1  70512 131228 383448 2753928    0    0  4160    68  149  172  0  0 99  1
 0  1  70512 276548 502184 2495564    0    0 118784    36 1357 2084  0  5 73 21
 1  1  70512 507932 624128 2151616    0    0 121984     0 1521 2166  0  6 77 17
 0  1  70512 699264 746484 1860468    0    0 122368     4 1443 2242  0  5 74 20
 0  1  70512 727040 865936 1722716    0    0 119552     0 1344 2194  0  5 75 21
 0  1  70512 733116 984396 1610292    0    0 118528     0 1311 2139  0  4 76 20
 1  0  70512 724064 1102864 1510256    0    0 118528     0 1302 2132  0  4 75 21
 1  0  70512 728900 1224312 1394328    0    0 121472     0 1395 2168  0  4 77 19
 1  0  70512 733736 1337224 1286852    0    0 115840    40 1404 2074  0  4 74 22

> At that point, kswapd will reclaim until every zone is at
> the high watermark, and go back to sleep.
> 
> There is no "free up to high + gap" in your scenario.

Well there clearly is from vmstat... I think you should be able to
reproduce if you boot with something like mem=4200m or so, workload is
simple "cp /dev/sda /dev/null".

Maybe we're waking kswapd too soon. But kswapd definitely goes to
sleep, infact it sleeps most of the time and it runs every once in a
while and it's unclear why the free memory never reaches back the 130M
level that it usually sits when there's no intensive read I/O like
shown above. For now, given what I see, I have to assume kswapd is
waken too soon, and not only when all wmarks reach low or the free
memory wouldn't be stuck at ~700M at all times while cp runs.

If kswapd is wakenup too soon, to me that is a separate problem and I
still don't see a significant benefit of having any "gap" bigger than
"high-low" there...

Like you said kswapd shouldn't run until we hit the low wmark again on
all zones, and I think that's more than enough without more "gap" than
the already available default "high-low" gap for the lower zones. If
the zone is bigger (like the below4g zone above) the wmark will be
bigger relative to the other zones. So when kswapd is wakenup because
all zones reach low wmark (we agree this is what should happen even if
it doesn't look like it's working right with "cp"), assuming all cache
is clean and immediately freeable kswapd will have to invoke
shrink_cache more times for the below4g zone. This "gap" added to
"high-low" will make the above4g lru rotate more times than needed to
reach the high wmark. But we allocated only "high-low" amount of cache
in the above4g zone lru. So I'm not sure if shrinking more than
"high-low" from it is right even from a balancing prospective in the
absolute trivial case of just 1 wakeup every time all zones hits the
low wmark.

At the same time if kswapd frees memory at the same rate that an
over4g allocator is allocating it, kswapd won't go to sleep and there
will be no rotation in the below4g lru at all. This is similar of what
we see above in fact, except for me kswapd goes to sleep because cp
isn't fast enough but a page fault could trigger it and prevent the
lru of the lower zones to ever rotate (simulating a kswapd wakeup too
soon, by just not making kswapd go to sleep and keeping hitting on the
high-low range on the over4g zone). So you see, there is no real
reliable way to have balancing guarantees from kswapd, and for the
trivial case where there is no concurrency between allocator and
kswapd freeing, rotating more the tiny above4g lru than "high-low"
despite we only allocated "high-low" cache into it doesn't sound
obviously right either. Bigger gap to me looks like will do more harm
than good and if we need a real guarantee of balancing we should
rotate the allocations across the zones (bigger lru in a zone will
require it to be hit more frequently because it'll rotate slower than
the other zones, the bias should not even dependent on the zone size
but on the lru size).

So for now it's all statistical but I doubt the "gap" shrunk in
addition of the "high-low" cache max allocated, is providing benefit.

Even in the non racing case all I can see is the smaller zones
(satisfying the "high" wmark faster than the bigger zones) (and the
smaller zones statistically should get a smaller lru too) being
lru-rotated way more than their small "high-low". Smaller zone should
be rotated in proportion of their small "high-low" only, and not
potentially as big as the biggest "high-low" for the biggest zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-01-29 19:46 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-24  3:56 too big min_free_kbytes Shaohua Li
2011-01-24 15:00 ` Andrea Arcangeli
2011-01-25 14:35   ` Mel Gorman
2011-01-26 14:17   ` Mel Gorman
2011-01-26 15:23     ` Mel Gorman
2011-01-26 15:42       ` Andrea Arcangeli
2011-01-26 16:36         ` Mel Gorman
2011-01-26 17:42           ` Mel Gorman
2011-01-27 13:40             ` Mel Gorman
2011-01-27 15:27               ` Andrea Arcangeli
2011-01-27 16:03                 ` Mel Gorman
2011-01-27 18:52                   ` Andrea Arcangeli
2011-01-27 20:33                     ` Rik van Riel
2011-01-27 21:31                     ` Mel Gorman
2011-01-27 23:18                       ` Rik van Riel
2011-01-28 10:35                         ` Mel Gorman
2011-01-28 16:28                           ` Andrea Arcangeli
2011-01-28 16:46                             ` Mel Gorman
2011-01-28 17:16                               ` Rik van Riel
2011-01-28 17:46                                 ` Andrea Arcangeli
2011-01-28 18:03                                   ` Rik van Riel
2011-01-28 18:24                                     ` Andrea Arcangeli
2011-01-28 19:34                                       ` Rik van Riel
2011-01-28 19:45                                         ` Andrea Arcangeli
2011-01-28 20:55                                           ` Rik van Riel
2011-01-29 19:45                                             ` Andrea Arcangeli [this message]
2011-01-28 17:34                               ` Andrea Arcangeli
2011-01-28 17:10                             ` Rik van Riel
2011-02-03  2:58                 ` Andrea Arcangeli
2011-02-03 13:15                   ` Mel Gorman
2011-02-03 18:59                     ` Andrea Arcangeli
2011-02-03 14:36                   ` Rik van Riel
2011-02-03 19:11                     ` Andrea Arcangeli
2011-02-12  1:28                       ` Simon Kirby
2011-02-14  2:25                   ` Shaohua Li
2011-02-22 14:25                     ` Mel Gorman
2011-02-22 14:42                       ` Andrea Arcangeli
2011-02-22 14:50                         ` Mel Gorman
2011-02-22 14:54                           ` Andrea Arcangeli
2011-02-22 16:04                         ` Mel Gorman
2011-02-22 16:40                           ` Rik van Riel
2011-02-23  5:29                       ` Shaohua Li
2011-02-23 14:45                         ` Andrea Arcangeli
2011-02-24  8:08                           ` Shaohua Li
2011-02-24  9:52                             ` Mel Gorman
2011-02-24  9:57                               ` Mel Gorman
2011-02-24 14:27                                 ` Andrea Arcangeli
2011-02-24 14:04                             ` Andrea Arcangeli
2011-02-25  0:51                               ` Shaohua Li
2011-02-25 12:13                                 ` Mel Gorman
2011-02-12  9:48                 ` alex shi
2011-02-22 14:24                   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110129194534.GX16981@random.random \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    --cc=tim.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).