2.4.15pre6aa1 (fixes google VM problem)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.15pre6aa1 (fixes google VM problem)
@ 2001-11-18  8:24 Andrea Arcangeli
  2001-11-19 17:40 ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2001-11-18  8:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: ben, brownfld, phillips, Linus Torvalds, Marcelo Tosatti

It would be interesting if people experiencing the VM problems
originally reported by google (but also trivially reproducible with
simple cache operations) could verify that this update fixes those
troubles. I wrote some documentation on the bug and the relevant fix in
the vm-14 section below. Thanks.

If all works right on Monday I will port the fix to mainline (it's
basically only a matter of extracting a few bits from the vm-14 patch,
it's not really controversial but I didn't had much time to extract it
yet, the reason it's not in a self contained patch from the first place
is because of the way it was written). Comments are welcome of course, I
don't think there's another way around it though, even if we would
generate a logical swap cache not in function of the swap entry that
still wouldn't solve the problem of mlocked highmem users [or very
frequently accessed ptes] in the lowmem zones. The lowmem ram wasted for
this purpose is very minor compared to the total waste of all the
highmem zones, and the algorithm I implemented adapts in function of the
amount of highmem so the lowmem waste is proportial with the potential
highmem waste. However the lower_zone_reserve defaults could be changed,
I choosen the current defaults in a conservative manner.

URL:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1.bz2
	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1/

Only in 2.4.15pre1aa1: 00_lvm-1.0.1-rc4-3.bz2
Only in 2.4.15pre6aa1: 00_lvm-1.0.1-rc4-4.bz2

	Rest of the rc4 diffs rediffed.

Only in 2.4.15pre1aa1: 00_rwsem-fair-23
Only in 2.4.15pre6aa1: 00_rwsem-fair-24
Only in 2.4.15pre1aa1: 00_rwsem-fair-23-recursive-4
Only in 2.4.15pre6aa1: 00_rwsem-fair-24-recursive-5

	Rediffed.

Only in 2.4.15pre1aa1: 00_strnlen_user-x86-ret1-1

	Merged in mainline.

Only in 2.4.15pre1aa1: 10_lvm-deadlock-fix-1

	Now in mainline.

Only in 2.4.15pre1aa1: 10_lvm-incremental-1
Only in 2.4.15pre6aa1: 10_lvm-incremental-2

	Part of it in mainline, rediffed the rest.

Only in 2.4.15pre1aa1: 10_vm-13
Only in 2.4.15pre6aa1: 10_vm-14

	This should be the first kernel out there without the google VM
	troubles (that are affecting more than just google testcase). The
	broken piece of VM was this kind of loop in the allocator:

	for (;;) {
		zone_t *z = *(zone++);
		if (!z)
			break;

		if (zone_free_pages(z, order) > z->pages_low) {
			page = rmqueue(z, order);
			if (page)
				return page;
		}
	}

	and the above logic is present in all 2.4 kernels out there (2.3 as well).
	So the bug has nearly nothing to do with the memory balancing engine as
	most of us would expect, it's an allocator zone balancing bug instead in
	a piece of code that one would assume to be obviously correct.

	The problem cames from the fact that all the ZONE_NORMAL can be allocated with
	unfreeable highmem users (like anon pages when no swap is available).
	If that happens the machine runs out of memory no matter what (even if
	there are 63G of cache clean ready to be freed).  Mainline deadlocks
	because of the infinite loop in the allocator, -aa was ""correctly""
	just killing tasks as soon as the normal zone was filled of mlocked
	cache or anon pages with no swap.

	The fix is to have a per-classzone per-zone set of watermarks (see the
	zone->watermarks[class_idx] array). Seems to work fine here. Of course
	this means potentially wasting some memory when the highmem zone is
	huge but there's no other way around it and the potential waste of all the
	highmem memory is huge compared to a very small waste of the normal
	zone (it could be more finegrined of course, for example we don't keep
	track if an allocation will generate a page freeable from the VM or
	not, but those are minor issues and not easily solvable anyways [we pin
	pages with a get_page and we certainly don't want to migrate pages
	across zones within get_page], and the core problem should be just fixed).

	Since the logic is generic and applies also to the zone dma vs zone
	normal (not only zone normal vs zone highmem) this should be tested a
	bit on the lowmem boxes too (I just took care of the lowmem boxes in
	theory, but I didn't tested it in practice).

	In short now we reserve a part of the lower zones for the lower
	classzone allocations. The algorithm I wrote calculates the "reserved
	portion" in function of the size of the higher zone (higher zone means
	the "zone" that matches the "classzone"). For example a 1G machine will
	reserve a very little part of the zone_normal. A 64G machine is going
	to reserve all the 800mbyte of zone normal for allocations from
	the normal classzone instead (this is fine because it would be a total
	waste if a 64G machine would risk to run OOM because the zone normal
	is all occupied by unfreeable highmem users that would much better stay
	in the highmem zone instead). The ratio between higher zone size and
	reserved lower zone size, is selectable via boot option ala memfrac=
	(the new option is called lower_zone_reserve=). Default values should
	work well (they as usual doesn't need to be perfect, but they can be
	changed if you've suggestions), the boot option is there just in case.

Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1
Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1-simple-cache-1

	Backed out the anon pages from the lru again, mainly to avoid to
	swapout too easily and because this is going to be tested on the
	big boxes with no swap at all anyways.

Only in 2.4.15pre1aa1: 50_uml-patch-2.4.13-5.bz2
Only in 2.4.15pre6aa1: 50_uml-patch-2.4.14-2.bz2

	Latest Jeff's uml update.

Only in 2.4.15pre1aa1: 60_tux-2.4.13-ac5-B0.bz2
Only in 2.4.15pre6aa1: 60_tux-2.4.13-ac5-B1.bz2

	Latest Ingo's tux update.

Andrea

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.15pre6aa1 (fixes google VM problem)
  2001-11-18  8:24 2.4.15pre6aa1 (fixes google VM problem) Andrea Arcangeli
@ 2001-11-19 17:40 ` Andrea Arcangeli
  2001-11-19 18:57   ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2001-11-19 17:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: ben, brownfld, phillips, Linus Torvalds, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

On Sun, Nov 18, 2001 at 09:24:34AM +0100, Andrea Arcangeli wrote:
> If all works right on Monday I will port the fix to mainline (it's

Ok here it is against 2.4.15pre6:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/zone-watermarks-1

(also attached to the email). Untested on top of mainline but should be
safe to apply. also avoids GFP_ATOMIC from interrupts to eat the
PF_MEMALLOC (longstanding fix from Manfred).

Andrea

[-- Attachment #2: zone-watermarks-1 --]
[-- Type: text/plain, Size: 7339 bytes --]

diff -urN 2.4.15pre6/include/linux/mmzone.h google/include/linux/mmzone.h
--- 2.4.15pre6/include/linux/mmzone.h	Tue Nov 13 05:18:58 2001
+++ google/include/linux/mmzone.h	Mon Nov 19 18:19:00 2001
@@ -18,6 +18,11 @@
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
 
+#define ZONE_DMA		0
+#define ZONE_NORMAL		1
+#define ZONE_HIGHMEM		2
+#define MAX_NR_ZONES		3
+
 typedef struct free_area_struct {
 	struct list_head	free_list;
 	unsigned long		*map;
@@ -25,6 +30,10 @@
 
 struct pglist_data;
 
+typedef struct zone_watermarks_s {
+	unsigned long min, low, high;
+} zone_watermarks_t;
+
 /*
  * On machines where it is needed (eg PCs) we divide physical memory
  * into multiple physical zones. On a PC we have 3 zones:
@@ -39,8 +48,17 @@
 	 */
 	spinlock_t		lock;
 	unsigned long		free_pages;
-	unsigned long		pages_min, pages_low, pages_high;
-	int			need_balance;
+
+	/*
+	 * We don't know if the memory that we're going to allocate will be freeable
+	 * or/and it will be released eventually, so to avoid totally wasting several
+	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
+	 * to run OOM on the lower zones despite there's tons of freeable ram
+	 * on the higher zones).
+	 */
+	zone_watermarks_t	watermarks[MAX_NR_ZONES];
+
+	unsigned long		need_balance;
 
 	/*
 	 * free areas of different sizes
@@ -60,6 +78,7 @@
 	 */
 	char			*name;
 	unsigned long		size;
+	unsigned long		realsize;
 } zone_t;
 
 #define ZONE_DMA		0
@@ -113,8 +132,8 @@
 extern int numnodes;
 extern pg_data_t *pgdat_list;
 
-#define memclass(pgzone, classzone)	(((pgzone)->zone_pgdat == (classzone)->zone_pgdat) \
-			&& ((pgzone) <= (classzone)))
+#define zone_idx(zone)			((zone) - (zone)->zone_pgdat->node_zones)
+#define memclass(pgzone, classzone)	(zone_idx(pgzone) <= zone_idx(classzone))
 
 /*
  * The following two are not meant for general usage. They are here as
diff -urN 2.4.15pre6/mm/page_alloc.c google/mm/page_alloc.c
--- 2.4.15pre6/mm/page_alloc.c	Sun Nov 18 06:04:47 2001
+++ google/mm/page_alloc.c	Mon Nov 19 18:09:15 2001
@@ -27,9 +27,10 @@
 pg_data_t *pgdat_list;
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
-static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 32, 128, 128, };
+static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 128, 128, 128, };
 static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, };
 static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, };
+static int lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 128, 8 };
 
 /*
  * Free_page() adds the page to the free lists. This is optimized for
@@ -312,16 +313,18 @@
 {
 	zone_t **zone, * classzone;
 	struct page * page;
-	int freed;
+	int freed, class_idx;
 
 	zone = zonelist->zones;
 	classzone = *zone;
+	class_idx = zone_idx(classzone);
+
 	for (;;) {
 		zone_t *z = *(zone++);
 		if (!z)
 			break;
 
-		if (zone_free_pages(z, order) > z->pages_low) {
+		if (zone_free_pages(z, order) > z->watermarks[class_idx].low) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -340,7 +343,7 @@
 		if (!z)
 			break;
 
-		min = z->pages_min;
+		min = z->watermarks[class_idx].min;
 		if (!(gfp_mask & __GFP_WAIT))
 			min >>= 2;
 		if (zone_free_pages(z, order) > min) {
@@ -353,7 +356,7 @@
 	/* here we're in the low on memory slow path */
 
 rebalance:
-	if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
+	if (current->flags & (PF_MEMALLOC | PF_MEMDIE) && !in_interrupt()) {
 		zone = zonelist->zones;
 		for (;;) {
 			zone_t *z = *(zone++);
@@ -381,7 +384,7 @@
 		if (!z)
 			break;
 
-		if (zone_free_pages(z, order) > z->pages_min) {
+		if (zone_free_pages(z, order) > z->watermarks[class_idx].min) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -473,13 +476,15 @@
 	unsigned int sum = 0;
 
 	do {
+		int class_idx;
 		zonelist_t *zonelist = pgdat->node_zonelists + (GFP_USER & GFP_ZONEMASK);
 		zone_t **zonep = zonelist->zones;
 		zone_t *zone;
 
+		class_idx = zone_idx(*zonep);
 		for (zone = *zonep++; zone; zone = *zonep++) {
 			unsigned long size = zone->size;
-			unsigned long high = zone->pages_high;
+			unsigned long high = zone->watermarks[class_idx].high;
 			if (size > high)
 				sum += size - high;
 		}
@@ -525,13 +530,9 @@
 		zone_t *zone;
 		for (zone = tmpdat->node_zones;
 			       	zone < tmpdat->node_zones + MAX_NR_ZONES; zone++)
-			printk("Zone:%s freepages:%6lukB min:%6lukB low:%6lukB " 
-				       "high:%6lukB\n", 
+			printk("Zone:%s freepages:%6lukB\n", 
 					zone->name,
-					K(zone->free_pages),
-					K(zone->pages_min),
-					K(zone->pages_low),
-					K(zone->pages_high));
+					K(zone->free_pages));
 			
 		tmpdat = tmpdat->node_next;
 	}
@@ -697,6 +698,7 @@
 		zone_t *zone = pgdat->node_zones + j;
 		unsigned long mask;
 		unsigned long size, realsize;
+		int idx;
 
 		realsize = size = zones_size[j];
 		if (zholes_size)
@@ -704,6 +706,7 @@
 
 		printk("zone(%lu): %lu pages.\n", j, size);
 		zone->size = size;
+		zone->realsize = realsize;
 		zone->name = zone_names[j];
 		zone->lock = SPIN_LOCK_UNLOCKED;
 		zone->zone_pgdat = pgdat;
@@ -719,9 +722,29 @@
 			mask = zone_balance_min[j];
 		else if (mask > zone_balance_max[j])
 			mask = zone_balance_max[j];
-		zone->pages_min = mask;
-		zone->pages_low = mask*2;
-		zone->pages_high = mask*3;
+		zone->watermarks[j].min = mask;
+		zone->watermarks[j].low = mask*2;
+		zone->watermarks[j].high = mask*3;
+		/* now set the watermarks of the lower zones in the "j" classzone */
+		for (idx = j-1; idx >= 0; idx--) {
+			zone_t * lower_zone = pgdat->node_zones + idx;
+			unsigned long lower_zone_reserve;
+			if (!lower_zone->size)
+				continue;
+
+			mask = lower_zone->watermarks[idx].min;
+			lower_zone->watermarks[j].min = mask;
+			lower_zone->watermarks[j].low = mask*2;
+			lower_zone->watermarks[j].high = mask*3;
+
+			/* now the brainer part */
+			lower_zone_reserve = realsize / lower_zone_reserve_ratio[idx];
+			lower_zone->watermarks[j].min += lower_zone_reserve;
+			lower_zone->watermarks[j].low += lower_zone_reserve;
+			lower_zone->watermarks[j].high += lower_zone_reserve;
+
+			realsize += lower_zone->realsize;
+		}
 
 		zone->zone_mem_map = mem_map + offset;
 		zone->zone_start_mapnr = offset;
@@ -797,3 +820,16 @@
 }
 
 __setup("memfrac=", setup_mem_frac);
+
+static int __init setup_lower_zone_reserve(char *str)
+{
+	int j = 0;
+
+	while (get_option(&str, &lower_zone_reserve_ratio[j++]) == 2);
+	printk("setup_lower_zone_reserve: ");
+	for (j = 0; j < MAX_NR_ZONES-1; j++) printk("%d  ", lower_zone_reserve_ratio[j]);
+	printk("\n");
+	return 1;
+}
+
+__setup("lower_zone_reserve=", setup_lower_zone_reserve);
diff -urN 2.4.15pre6/mm/vmscan.c google/mm/vmscan.c
--- 2.4.15pre6/mm/vmscan.c	Sun Nov 18 06:04:47 2001
+++ google/mm/vmscan.c	Mon Nov 19 18:10:19 2001
@@ -606,11 +606,12 @@
 
 static int check_classzone_need_balance(zone_t * classzone)
 {
-	zone_t * first_classzone;
+	zone_t * first_zone;
+	int class_idx = zone_idx(classzone);
 
-	first_classzone = classzone->zone_pgdat->node_zones;
-	while (classzone >= first_classzone) {
-		if (classzone->free_pages > classzone->pages_high)
+	first_zone = classzone->zone_pgdat->node_zones;
+	while (classzone >= first_zone) {
+		if (classzone->free_pages > classzone->watermarks[class_idx].high)
 			return 0;
 		classzone--;
 	}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.15pre6aa1 (fixes google VM problem)
  2001-11-19 17:40 ` Andrea Arcangeli
@ 2001-11-19 18:57   ` Linus Torvalds
  2001-11-19 20:38     ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2001-11-19 18:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, ben, brownfld, phillips, Marcelo Tosatti

On Mon, 19 Nov 2001, Andrea Arcangeli wrote:

>
> Ok here it is against 2.4.15pre6:
>
> 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/zone-watermarks-1

Hmm.. I see what you're trying to do, but it seems overly complicated to
me.

Correct me if I'm wrong, but what you really want to say is basically:

   If we're doing a allocation from many zones, we don't want to allow an
   allocator that can use big zones to deplete the small zones.

You do this by building up this "per-zone-per-classzone" array, which
basically says that "if you had a big classzone, your minimum requirements
for the next zonelist are higher".

Now, I'd rather look at it from another angle: the fact is that the simple
for-loop that allows any allocator to allocate equal amounts of memory
from any zone it wants is kind of unfair. So the for-loop is arguably
broken.

So we currently have a for-loop that looks like

	for (;;) {
		zone_t *z = *(zone++);
		..
		min = z->pages_low;
		..
	}

and the basic problem is that the above loop doesn't have any "memory": we
really want it to remember the fact that it has had an earlier zone that
was perhaps large, and not just see each new zone as an independent
allocation decision.

So why not have the much simpler patch to just say:

	min = 0;
	for (;;) {
		zone_t *z = *(zone++);
		..
		min = (min >> 2) + z->pages_low;
		..
	}

or similar that simply _ages_ the "min" according to previous zones that
we've already tried. That makes the data structures much simpler, and
shows much more clearly what it is we are actually trying to do. We're
trying to say that the size of the previous zones in the allocation list
_does_ matter. Basically now we have a "history" of how much memory we
have already looked at.

(The "(min >> 2) + new" is obviously just a first try, I'm not claiming
it's a particularly good aging function, but it's the standard kind of
exponential aging approach).

With something like the above, the threshold of allocation in smaller
zones is much higher: let's say that your HIGHMEM zone is four times as
big as your NORMAL zone, then a HIGHMEM allocation will want to see twice
as much memory in the NORMAL zone than a NORMAL allocation would want to.

See what I'm saying? The above algorithm more closely follows what we
really want to do, and by doing so it makes the code much simpler to
follow (no "What does this 'z->watermarks[class_idx].low' thing mean?"
questions), not to mention causing simpler data structures etc.

The actual _behaviour_ should be pretty close to yours (modulo the
differences in calculating the watermarks - your
"lower_zone_reserve_ratio" setup is not quite the same thing as just
shifting by 2 every time).

		Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.15pre6aa1 (fixes google VM problem)
  2001-11-19 18:57   ` Linus Torvalds
@ 2001-11-19 20:38     ` Linus Torvalds
  0 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2001-11-19 20:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, ben, brownfld, phillips, Marcelo Tosatti

On Mon, 19 Nov 2001, Linus Torvalds wrote:
>
> So why not have the much simpler patch to just say:
>
> 	min = 0;
> 	for (;;) {
> 		zone_t *z = *(zone++);
> 		..
> 		min = (min >> 2) + z->pages_low;
> 		..

Actually, as we already limit "pages_low" (for _all_ zones) through the
use of zone_balance_max[], I don't think we need to even age the minimum
pages.

And instead of doing "zone->free_pages - (1UL << order)" in
zone_free_pages(), we can do it much more efficiently just once for the
for-loop by initializing "min" to "(1UL << order)" instead of zero. So
we'd just make the loop be

	min = (1UL << order);
	for (;;) {
		zone_t *z = *(zone++);
		..
		min += z->pages_low;
		...

instead, which is even simpler (and then just compare page->free_pages
against "min" directly..

		Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2001-11-19 21:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-11-18  8:24 2.4.15pre6aa1 (fixes google VM problem) Andrea Arcangeli
2001-11-19 17:40 ` Andrea Arcangeli
2001-11-19 18:57   ` Linus Torvalds
2001-11-19 20:38     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox