From: Con Kolivas <kernel@kolivas.org>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>,
ck@vds.kolivas.org, linux list <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org
Subject: Re: [PATCH] mm: limit lowmem_reserve
Date: Thu, 18 May 2006 00:11:41 +1000 [thread overview]
Message-ID: <200605180011.43216.kernel@kolivas.org> (raw)
In-Reply-To: <443710F7.3040201@yahoo.com.au>
I hate to resuscitate this old thread, sorry but I'm still not sure we
resolved it and I want to make sure this issue isn't here as I see it.
On Saturday 08 April 2006 11:25, Nick Piggin wrote:
> Con Kolivas wrote:
> > Ok. I think I presented enough information for why I thought
> > zone_watermark_ok would fail (for ZONE_DMA). With 16MB ZONE_DMA and a
> > vmsplit of 3GB we have a lowmem_reserve of 12MB. It's pretty hard to keep
> > that much ZONE_DMA free, I don't think I've ever seen that much free on
> > my ZONE_DMA on an ordinary desktop without any particular ZONE_DMA users.
> > Changing the tunable can make the lowmem_reserve larger than ZONE_DMA is
> > on any vmsplit too as far as I understand the ratio.
>
> Umm, for ZONE_DMA allocations, ZONE_DMA isn't a lower zone. So that
> 12MB protection should never come into it (unless it is buggy?).
An i386 pc with a 3GB split will have approx
4000 pages ZONE_DMA
and lowmem reserve will set lowmem reserve to approx
0 0 3000 3000
So if we call zone_watermark_ok with zone of ZONE_DMA and a classzone_idx of a
ZONE_NORMAL we will fail a zone_watermark_ok test almost always since it's
almost impossible to have 3000 free ZONE_DMA pages. I believe it can happen
like this:
In balance_pgdat (vmscan.c:1116) if we end up with end_zone being a
ZONE_NORMAL zone, then during the scan below we (vmscan.c:1137) iterate over
all zones from 0 to end_zone and (vmscan.c:1147) we end up calling
if (!zone_watermark_ok(zone, order, zone->pages_high, end_zone, 0))
which would now call zone_watermark_ok with zone being a ZONE_DMA, and
end_zone being the idx of a ZONE_NORMAL.
So in summary if I'm not mistaken (and I'm good at being mistaken), if we
balance pgdat and find that ZONE_NORMAL or higher needs scanning, we'll end
up trying to flush the crap out of ZONE_DMA.
On my test case this indeed happens and my ZONE_DMA never goes below 3000
pages free. If I lower the reserve even further my pages free gets stuck at
3208 and can't free any more, and doesn't ever drop below that either.
Here is the patch I was proposing
---
It is possible with a low enough lowmem_reserve ratio to make
zone_watermark_ok fail repeatedly if the lower_zone is small enough.
Impose a lower limit on the ratio to only allow 1/4 of the lower_zone
size to be set as lowmem_reserve. This limit is hit in ZONE_DMA by changing
the default vmsplit on i386 even without changing the default sysctl values.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
---
mm/page_alloc.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)
Index: linux-2.6.17-rc1-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/page_alloc.c 2006-04-06 10:32:31.000000000 +1000
+++ linux-2.6.17-rc1-mm1/mm/page_alloc.c 2006-04-06 11:28:11.000000000 +1000
@@ -2566,14 +2566,32 @@ static void setup_per_zone_lowmem_reserv
zone->lowmem_reserve[j] = 0;
for (idx = j-1; idx >= 0; idx--) {
+ unsigned long max_reserve;
+ unsigned long reserve;
struct zone *lower_zone;
+ lower_zone = pgdat->node_zones + idx;
+ /*
+ * Put an upper limit on the reserve at 1/4
+ * the lower_zone size. This prevents large
+ * zone size differences such as 3G VMSPLIT
+ * or low sysctl values from making
+ * zone_watermark_ok always fail. This
+ * enforces a lower limit on the reserve_ratio
+ */
+ max_reserve = lower_zone->present_pages / 4;
+
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
-
- lower_zone = pgdat->node_zones + idx;
- lower_zone->lowmem_reserve[j] = present_pages /
+ reserve = present_pages /
sysctl_lowmem_reserve_ratio[idx];
+ if (max_reserve && reserve > max_reserve) {
+ reserve = max_reserve;
+ sysctl_lowmem_reserve_ratio[idx] =
+ present_pages / max_reserve;
+ }
+
+ lower_zone->lowmem_reserve[j] = reserve;
present_pages += lower_zone->present_pages;
}
}
--
-ck
WARNING: multiple messages have this Message-ID (diff)
From: Con Kolivas <kernel@kolivas.org>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>,
ck@vds.kolivas.org, linux list <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org
Subject: Re: [PATCH] mm: limit lowmem_reserve
Date: Thu, 18 May 2006 00:11:41 +1000 [thread overview]
Message-ID: <200605180011.43216.kernel@kolivas.org> (raw)
In-Reply-To: <443710F7.3040201@yahoo.com.au>
I hate to resuscitate this old thread, sorry but I'm still not sure we
resolved it and I want to make sure this issue isn't here as I see it.
On Saturday 08 April 2006 11:25, Nick Piggin wrote:
> Con Kolivas wrote:
> > Ok. I think I presented enough information for why I thought
> > zone_watermark_ok would fail (for ZONE_DMA). With 16MB ZONE_DMA and a
> > vmsplit of 3GB we have a lowmem_reserve of 12MB. It's pretty hard to keep
> > that much ZONE_DMA free, I don't think I've ever seen that much free on
> > my ZONE_DMA on an ordinary desktop without any particular ZONE_DMA users.
> > Changing the tunable can make the lowmem_reserve larger than ZONE_DMA is
> > on any vmsplit too as far as I understand the ratio.
>
> Umm, for ZONE_DMA allocations, ZONE_DMA isn't a lower zone. So that
> 12MB protection should never come into it (unless it is buggy?).
An i386 pc with a 3GB split will have approx
4000 pages ZONE_DMA
and lowmem reserve will set lowmem reserve to approx
0 0 3000 3000
So if we call zone_watermark_ok with zone of ZONE_DMA and a classzone_idx of a
ZONE_NORMAL we will fail a zone_watermark_ok test almost always since it's
almost impossible to have 3000 free ZONE_DMA pages. I believe it can happen
like this:
In balance_pgdat (vmscan.c:1116) if we end up with end_zone being a
ZONE_NORMAL zone, then during the scan below we (vmscan.c:1137) iterate over
all zones from 0 to end_zone and (vmscan.c:1147) we end up calling
if (!zone_watermark_ok(zone, order, zone->pages_high, end_zone, 0))
which would now call zone_watermark_ok with zone being a ZONE_DMA, and
end_zone being the idx of a ZONE_NORMAL.
So in summary if I'm not mistaken (and I'm good at being mistaken), if we
balance pgdat and find that ZONE_NORMAL or higher needs scanning, we'll end
up trying to flush the crap out of ZONE_DMA.
On my test case this indeed happens and my ZONE_DMA never goes below 3000
pages free. If I lower the reserve even further my pages free gets stuck at
3208 and can't free any more, and doesn't ever drop below that either.
Here is the patch I was proposing
---
It is possible with a low enough lowmem_reserve ratio to make
zone_watermark_ok fail repeatedly if the lower_zone is small enough.
Impose a lower limit on the ratio to only allow 1/4 of the lower_zone
size to be set as lowmem_reserve. This limit is hit in ZONE_DMA by changing
the default vmsplit on i386 even without changing the default sysctl values.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
---
mm/page_alloc.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)
Index: linux-2.6.17-rc1-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/page_alloc.c 2006-04-06 10:32:31.000000000 +1000
+++ linux-2.6.17-rc1-mm1/mm/page_alloc.c 2006-04-06 11:28:11.000000000 +1000
@@ -2566,14 +2566,32 @@ static void setup_per_zone_lowmem_reserv
zone->lowmem_reserve[j] = 0;
for (idx = j-1; idx >= 0; idx--) {
+ unsigned long max_reserve;
+ unsigned long reserve;
struct zone *lower_zone;
+ lower_zone = pgdat->node_zones + idx;
+ /*
+ * Put an upper limit on the reserve at 1/4
+ * the lower_zone size. This prevents large
+ * zone size differences such as 3G VMSPLIT
+ * or low sysctl values from making
+ * zone_watermark_ok always fail. This
+ * enforces a lower limit on the reserve_ratio
+ */
+ max_reserve = lower_zone->present_pages / 4;
+
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
-
- lower_zone = pgdat->node_zones + idx;
- lower_zone->lowmem_reserve[j] = present_pages /
+ reserve = present_pages /
sysctl_lowmem_reserve_ratio[idx];
+ if (max_reserve && reserve > max_reserve) {
+ reserve = max_reserve;
+ sysctl_lowmem_reserve_ratio[idx] =
+ present_pages / max_reserve;
+ }
+
+ lower_zone->lowmem_reserve[j] = reserve;
present_pages += lower_zone->present_pages;
}
}
--
-ck
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-05-17 14:12 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-02 4:01 2.6.16-ck3 Con Kolivas
2006-04-02 4:46 ` 2.6.16-ck3 Nick Piggin
2006-04-02 8:51 ` 2.6.16-ck3 Con Kolivas
2006-04-02 9:37 ` 2.6.16-ck3 Nick Piggin
2006-04-02 9:39 ` [ck] 2.6.16-ck3 Con Kolivas
2006-04-02 9:51 ` Nick Piggin
2006-04-03 2:48 ` lowmem_reserve question Con Kolivas
2006-04-03 4:42 ` Mike Galbraith
2006-04-03 4:48 ` Con Kolivas
2006-04-03 4:50 ` [ck] " Con Kolivas
2006-04-03 5:14 ` Mike Galbraith
2006-04-03 5:18 ` Con Kolivas
2006-04-03 5:31 ` Mike Galbraith
2006-04-04 2:35 ` [ck] " Con Kolivas
2006-04-06 1:10 ` [PATCH] mm: limit lowmem_reserve Con Kolivas
2006-04-06 1:10 ` Con Kolivas
2006-04-06 1:29 ` Respin: " Con Kolivas
2006-04-06 1:29 ` Con Kolivas
2006-04-06 2:43 ` Andrew Morton
2006-04-06 2:43 ` Andrew Morton
2006-04-06 2:55 ` Con Kolivas
2006-04-06 2:55 ` Con Kolivas
2006-04-06 2:58 ` Con Kolivas
2006-04-06 2:58 ` Con Kolivas
2006-04-06 3:40 ` Andrew Morton
2006-04-06 3:40 ` Andrew Morton
2006-04-06 4:36 ` Con Kolivas
2006-04-06 4:36 ` Con Kolivas
2006-04-06 4:52 ` Con Kolivas
2006-04-06 4:52 ` Con Kolivas
2006-04-07 6:25 ` Nick Piggin
2006-04-07 6:25 ` Nick Piggin
2006-04-07 9:02 ` Con Kolivas
2006-04-07 9:02 ` Con Kolivas
2006-04-07 12:40 ` Nick Piggin
2006-04-07 12:40 ` Nick Piggin
2006-04-08 0:15 ` Con Kolivas
2006-04-08 0:15 ` Con Kolivas
2006-04-08 0:55 ` Nick Piggin
2006-04-08 0:55 ` Nick Piggin
2006-04-08 1:01 ` Con Kolivas
2006-04-08 1:01 ` Con Kolivas
2006-04-08 1:25 ` Nick Piggin
2006-04-08 1:25 ` Nick Piggin
2006-05-17 14:11 ` Con Kolivas [this message]
2006-05-17 14:11 ` Con Kolivas
2006-05-18 7:11 ` Nick Piggin
2006-05-18 7:11 ` Nick Piggin
2006-05-18 7:21 ` Con Kolivas
2006-05-18 7:21 ` Con Kolivas
2006-05-18 7:26 ` Nick Piggin
2006-05-18 7:26 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200605180011.43216.kernel@kolivas.org \
--to=kernel@kolivas.org \
--cc=akpm@osdl.org \
--cc=ck@vds.kolivas.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nickpiggin@yahoo.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.