* 2.4.15pre6aa1 (fixes google VM problem)
@ 2001-11-18 8:24 Andrea Arcangeli
2001-11-19 17:40 ` Andrea Arcangeli
0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2001-11-18 8:24 UTC (permalink / raw)
To: linux-kernel; +Cc: ben, brownfld, phillips, Linus Torvalds, Marcelo Tosatti
It would be interesting if people experiencing the VM problems
originally reported by google (but also trivially reproducible with
simple cache operations) could verify that this update fixes those
troubles. I wrote some documentation on the bug and the relevant fix in
the vm-14 section below. Thanks.
If all works right on Monday I will port the fix to mainline (it's
basically only a matter of extracting a few bits from the vm-14 patch,
it's not really controversial but I didn't had much time to extract it
yet, the reason it's not in a self contained patch from the first place
is because of the way it was written). Comments are welcome of course, I
don't think there's another way around it though, even if we would
generate a logical swap cache not in function of the swap entry that
still wouldn't solve the problem of mlocked highmem users [or very
frequently accessed ptes] in the lowmem zones. The lowmem ram wasted for
this purpose is very minor compared to the total waste of all the
highmem zones, and the algorithm I implemented adapts in function of the
amount of highmem so the lowmem waste is proportial with the potential
highmem waste. However the lower_zone_reserve defaults could be changed,
I choosen the current defaults in a conservative manner.
URL:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1.bz2
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1/
Only in 2.4.15pre1aa1: 00_lvm-1.0.1-rc4-3.bz2
Only in 2.4.15pre6aa1: 00_lvm-1.0.1-rc4-4.bz2
Rest of the rc4 diffs rediffed.
Only in 2.4.15pre1aa1: 00_rwsem-fair-23
Only in 2.4.15pre6aa1: 00_rwsem-fair-24
Only in 2.4.15pre1aa1: 00_rwsem-fair-23-recursive-4
Only in 2.4.15pre6aa1: 00_rwsem-fair-24-recursive-5
Rediffed.
Only in 2.4.15pre1aa1: 00_strnlen_user-x86-ret1-1
Merged in mainline.
Only in 2.4.15pre1aa1: 10_lvm-deadlock-fix-1
Now in mainline.
Only in 2.4.15pre1aa1: 10_lvm-incremental-1
Only in 2.4.15pre6aa1: 10_lvm-incremental-2
Part of it in mainline, rediffed the rest.
Only in 2.4.15pre1aa1: 10_vm-13
Only in 2.4.15pre6aa1: 10_vm-14
This should be the first kernel out there without the google VM
troubles (that are affecting more than just google testcase). The
broken piece of VM was this kind of loop in the allocator:
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
if (zone_free_pages(z, order) > z->pages_low) {
page = rmqueue(z, order);
if (page)
return page;
}
}
and the above logic is present in all 2.4 kernels out there (2.3 as well).
So the bug has nearly nothing to do with the memory balancing engine as
most of us would expect, it's an allocator zone balancing bug instead in
a piece of code that one would assume to be obviously correct.
The problem cames from the fact that all the ZONE_NORMAL can be allocated with
unfreeable highmem users (like anon pages when no swap is available).
If that happens the machine runs out of memory no matter what (even if
there are 63G of cache clean ready to be freed). Mainline deadlocks
because of the infinite loop in the allocator, -aa was ""correctly""
just killing tasks as soon as the normal zone was filled of mlocked
cache or anon pages with no swap.
The fix is to have a per-classzone per-zone set of watermarks (see the
zone->watermarks[class_idx] array). Seems to work fine here. Of course
this means potentially wasting some memory when the highmem zone is
huge but there's no other way around it and the potential waste of all the
highmem memory is huge compared to a very small waste of the normal
zone (it could be more finegrined of course, for example we don't keep
track if an allocation will generate a page freeable from the VM or
not, but those are minor issues and not easily solvable anyways [we pin
pages with a get_page and we certainly don't want to migrate pages
across zones within get_page], and the core problem should be just fixed).
Since the logic is generic and applies also to the zone dma vs zone
normal (not only zone normal vs zone highmem) this should be tested a
bit on the lowmem boxes too (I just took care of the lowmem boxes in
theory, but I didn't tested it in practice).
In short now we reserve a part of the lower zones for the lower
classzone allocations. The algorithm I wrote calculates the "reserved
portion" in function of the size of the higher zone (higher zone means
the "zone" that matches the "classzone"). For example a 1G machine will
reserve a very little part of the zone_normal. A 64G machine is going
to reserve all the 800mbyte of zone normal for allocations from
the normal classzone instead (this is fine because it would be a total
waste if a 64G machine would risk to run OOM because the zone normal
is all occupied by unfreeable highmem users that would much better stay
in the highmem zone instead). The ratio between higher zone size and
reserved lower zone size, is selectable via boot option ala memfrac=
(the new option is called lower_zone_reserve=). Default values should
work well (they as usual doesn't need to be perfect, but they can be
changed if you've suggestions), the boot option is there just in case.
Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1
Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1-simple-cache-1
Backed out the anon pages from the lru again, mainly to avoid to
swapout too easily and because this is going to be tested on the
big boxes with no swap at all anyways.
Only in 2.4.15pre1aa1: 50_uml-patch-2.4.13-5.bz2
Only in 2.4.15pre6aa1: 50_uml-patch-2.4.14-2.bz2
Latest Jeff's uml update.
Only in 2.4.15pre1aa1: 60_tux-2.4.13-ac5-B0.bz2
Only in 2.4.15pre6aa1: 60_tux-2.4.13-ac5-B1.bz2
Latest Ingo's tux update.
Andrea
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: 2.4.15pre6aa1 (fixes google VM problem) 2001-11-18 8:24 2.4.15pre6aa1 (fixes google VM problem) Andrea Arcangeli @ 2001-11-19 17:40 ` Andrea Arcangeli 2001-11-19 18:57 ` Linus Torvalds 0 siblings, 1 reply; 4+ messages in thread From: Andrea Arcangeli @ 2001-11-19 17:40 UTC (permalink / raw) To: linux-kernel; +Cc: ben, brownfld, phillips, Linus Torvalds, Marcelo Tosatti [-- Attachment #1: Type: text/plain, Size: 460 bytes --] On Sun, Nov 18, 2001 at 09:24:34AM +0100, Andrea Arcangeli wrote: > If all works right on Monday I will port the fix to mainline (it's Ok here it is against 2.4.15pre6: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/zone-watermarks-1 (also attached to the email). Untested on top of mainline but should be safe to apply. also avoids GFP_ATOMIC from interrupts to eat the PF_MEMALLOC (longstanding fix from Manfred). Andrea [-- Attachment #2: zone-watermarks-1 --] [-- Type: text/plain, Size: 7339 bytes --] diff -urN 2.4.15pre6/include/linux/mmzone.h google/include/linux/mmzone.h --- 2.4.15pre6/include/linux/mmzone.h Tue Nov 13 05:18:58 2001 +++ google/include/linux/mmzone.h Mon Nov 19 18:19:00 2001 @@ -18,6 +18,11 @@ #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER #endif +#define ZONE_DMA 0 +#define ZONE_NORMAL 1 +#define ZONE_HIGHMEM 2 +#define MAX_NR_ZONES 3 + typedef struct free_area_struct { struct list_head free_list; unsigned long *map; @@ -25,6 +30,10 @@ struct pglist_data; +typedef struct zone_watermarks_s { + unsigned long min, low, high; +} zone_watermarks_t; + /* * On machines where it is needed (eg PCs) we divide physical memory * into multiple physical zones. On a PC we have 3 zones: @@ -39,8 +48,17 @@ */ spinlock_t lock; unsigned long free_pages; - unsigned long pages_min, pages_low, pages_high; - int need_balance; + + /* + * We don't know if the memory that we're going to allocate will be freeable + * or/and it will be released eventually, so to avoid totally wasting several + * GB of ram we must reserve some of the lower zone memory (otherwise we risk + * to run OOM on the lower zones despite there's tons of freeable ram + * on the higher zones). + */ + zone_watermarks_t watermarks[MAX_NR_ZONES]; + + unsigned long need_balance; /* * free areas of different sizes @@ -60,6 +78,7 @@ */ char *name; unsigned long size; + unsigned long realsize; } zone_t; #define ZONE_DMA 0 @@ -113,8 +132,8 @@ extern int numnodes; extern pg_data_t *pgdat_list; -#define memclass(pgzone, classzone) (((pgzone)->zone_pgdat == (classzone)->zone_pgdat) \ - && ((pgzone) <= (classzone))) +#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +#define memclass(pgzone, classzone) (zone_idx(pgzone) <= zone_idx(classzone)) /* * The following two are not meant for general usage. They are here as diff -urN 2.4.15pre6/mm/page_alloc.c google/mm/page_alloc.c --- 2.4.15pre6/mm/page_alloc.c Sun Nov 18 06:04:47 2001 +++ google/mm/page_alloc.c Mon Nov 19 18:09:15 2001 @@ -27,9 +27,10 @@ pg_data_t *pgdat_list; static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" }; -static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 32, 128, 128, }; +static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 128, 128, 128, }; static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, }; static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, }; +static int lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 128, 8 }; /* * Free_page() adds the page to the free lists. This is optimized for @@ -312,16 +313,18 @@ { zone_t **zone, * classzone; struct page * page; - int freed; + int freed, class_idx; zone = zonelist->zones; classzone = *zone; + class_idx = zone_idx(classzone); + for (;;) { zone_t *z = *(zone++); if (!z) break; - if (zone_free_pages(z, order) > z->pages_low) { + if (zone_free_pages(z, order) > z->watermarks[class_idx].low) { page = rmqueue(z, order); if (page) return page; @@ -340,7 +343,7 @@ if (!z) break; - min = z->pages_min; + min = z->watermarks[class_idx].min; if (!(gfp_mask & __GFP_WAIT)) min >>= 2; if (zone_free_pages(z, order) > min) { @@ -353,7 +356,7 @@ /* here we're in the low on memory slow path */ rebalance: - if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) { + if (current->flags & (PF_MEMALLOC | PF_MEMDIE) && !in_interrupt()) { zone = zonelist->zones; for (;;) { zone_t *z = *(zone++); @@ -381,7 +384,7 @@ if (!z) break; - if (zone_free_pages(z, order) > z->pages_min) { + if (zone_free_pages(z, order) > z->watermarks[class_idx].min) { page = rmqueue(z, order); if (page) return page; @@ -473,13 +476,15 @@ unsigned int sum = 0; do { + int class_idx; zonelist_t *zonelist = pgdat->node_zonelists + (GFP_USER & GFP_ZONEMASK); zone_t **zonep = zonelist->zones; zone_t *zone; + class_idx = zone_idx(*zonep); for (zone = *zonep++; zone; zone = *zonep++) { unsigned long size = zone->size; - unsigned long high = zone->pages_high; + unsigned long high = zone->watermarks[class_idx].high; if (size > high) sum += size - high; } @@ -525,13 +530,9 @@ zone_t *zone; for (zone = tmpdat->node_zones; zone < tmpdat->node_zones + MAX_NR_ZONES; zone++) - printk("Zone:%s freepages:%6lukB min:%6lukB low:%6lukB " - "high:%6lukB\n", + printk("Zone:%s freepages:%6lukB\n", zone->name, - K(zone->free_pages), - K(zone->pages_min), - K(zone->pages_low), - K(zone->pages_high)); + K(zone->free_pages)); tmpdat = tmpdat->node_next; } @@ -697,6 +698,7 @@ zone_t *zone = pgdat->node_zones + j; unsigned long mask; unsigned long size, realsize; + int idx; realsize = size = zones_size[j]; if (zholes_size) @@ -704,6 +706,7 @@ printk("zone(%lu): %lu pages.\n", j, size); zone->size = size; + zone->realsize = realsize; zone->name = zone_names[j]; zone->lock = SPIN_LOCK_UNLOCKED; zone->zone_pgdat = pgdat; @@ -719,9 +722,29 @@ mask = zone_balance_min[j]; else if (mask > zone_balance_max[j]) mask = zone_balance_max[j]; - zone->pages_min = mask; - zone->pages_low = mask*2; - zone->pages_high = mask*3; + zone->watermarks[j].min = mask; + zone->watermarks[j].low = mask*2; + zone->watermarks[j].high = mask*3; + /* now set the watermarks of the lower zones in the "j" classzone */ + for (idx = j-1; idx >= 0; idx--) { + zone_t * lower_zone = pgdat->node_zones + idx; + unsigned long lower_zone_reserve; + if (!lower_zone->size) + continue; + + mask = lower_zone->watermarks[idx].min; + lower_zone->watermarks[j].min = mask; + lower_zone->watermarks[j].low = mask*2; + lower_zone->watermarks[j].high = mask*3; + + /* now the brainer part */ + lower_zone_reserve = realsize / lower_zone_reserve_ratio[idx]; + lower_zone->watermarks[j].min += lower_zone_reserve; + lower_zone->watermarks[j].low += lower_zone_reserve; + lower_zone->watermarks[j].high += lower_zone_reserve; + + realsize += lower_zone->realsize; + } zone->zone_mem_map = mem_map + offset; zone->zone_start_mapnr = offset; @@ -797,3 +820,16 @@ } __setup("memfrac=", setup_mem_frac); + +static int __init setup_lower_zone_reserve(char *str) +{ + int j = 0; + + while (get_option(&str, &lower_zone_reserve_ratio[j++]) == 2); + printk("setup_lower_zone_reserve: "); + for (j = 0; j < MAX_NR_ZONES-1; j++) printk("%d ", lower_zone_reserve_ratio[j]); + printk("\n"); + return 1; +} + +__setup("lower_zone_reserve=", setup_lower_zone_reserve); diff -urN 2.4.15pre6/mm/vmscan.c google/mm/vmscan.c --- 2.4.15pre6/mm/vmscan.c Sun Nov 18 06:04:47 2001 +++ google/mm/vmscan.c Mon Nov 19 18:10:19 2001 @@ -606,11 +606,12 @@ static int check_classzone_need_balance(zone_t * classzone) { - zone_t * first_classzone; + zone_t * first_zone; + int class_idx = zone_idx(classzone); - first_classzone = classzone->zone_pgdat->node_zones; - while (classzone >= first_classzone) { - if (classzone->free_pages > classzone->pages_high) + first_zone = classzone->zone_pgdat->node_zones; + while (classzone >= first_zone) { + if (classzone->free_pages > classzone->watermarks[class_idx].high) return 0; classzone--; } ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.4.15pre6aa1 (fixes google VM problem) 2001-11-19 17:40 ` Andrea Arcangeli @ 2001-11-19 18:57 ` Linus Torvalds 2001-11-19 20:38 ` Linus Torvalds 0 siblings, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2001-11-19 18:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, ben, brownfld, phillips, Marcelo Tosatti On Mon, 19 Nov 2001, Andrea Arcangeli wrote: > > Ok here it is against 2.4.15pre6: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/zone-watermarks-1 Hmm.. I see what you're trying to do, but it seems overly complicated to me. Correct me if I'm wrong, but what you really want to say is basically: If we're doing a allocation from many zones, we don't want to allow an allocator that can use big zones to deplete the small zones. You do this by building up this "per-zone-per-classzone" array, which basically says that "if you had a big classzone, your minimum requirements for the next zonelist are higher". Now, I'd rather look at it from another angle: the fact is that the simple for-loop that allows any allocator to allocate equal amounts of memory from any zone it wants is kind of unfair. So the for-loop is arguably broken. So we currently have a for-loop that looks like for (;;) { zone_t *z = *(zone++); .. min = z->pages_low; .. } and the basic problem is that the above loop doesn't have any "memory": we really want it to remember the fact that it has had an earlier zone that was perhaps large, and not just see each new zone as an independent allocation decision. So why not have the much simpler patch to just say: min = 0; for (;;) { zone_t *z = *(zone++); .. min = (min >> 2) + z->pages_low; .. } or similar that simply _ages_ the "min" according to previous zones that we've already tried. That makes the data structures much simpler, and shows much more clearly what it is we are actually trying to do. We're trying to say that the size of the previous zones in the allocation list _does_ matter. Basically now we have a "history" of how much memory we have already looked at. (The "(min >> 2) + new" is obviously just a first try, I'm not claiming it's a particularly good aging function, but it's the standard kind of exponential aging approach). With something like the above, the threshold of allocation in smaller zones is much higher: let's say that your HIGHMEM zone is four times as big as your NORMAL zone, then a HIGHMEM allocation will want to see twice as much memory in the NORMAL zone than a NORMAL allocation would want to. See what I'm saying? The above algorithm more closely follows what we really want to do, and by doing so it makes the code much simpler to follow (no "What does this 'z->watermarks[class_idx].low' thing mean?" questions), not to mention causing simpler data structures etc. The actual _behaviour_ should be pretty close to yours (modulo the differences in calculating the watermarks - your "lower_zone_reserve_ratio" setup is not quite the same thing as just shifting by 2 every time). Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.4.15pre6aa1 (fixes google VM problem) 2001-11-19 18:57 ` Linus Torvalds @ 2001-11-19 20:38 ` Linus Torvalds 0 siblings, 0 replies; 4+ messages in thread From: Linus Torvalds @ 2001-11-19 20:38 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, ben, brownfld, phillips, Marcelo Tosatti On Mon, 19 Nov 2001, Linus Torvalds wrote: > > So why not have the much simpler patch to just say: > > min = 0; > for (;;) { > zone_t *z = *(zone++); > .. > min = (min >> 2) + z->pages_low; > .. Actually, as we already limit "pages_low" (for _all_ zones) through the use of zone_balance_max[], I don't think we need to even age the minimum pages. And instead of doing "zone->free_pages - (1UL << order)" in zone_free_pages(), we can do it much more efficiently just once for the for-loop by initializing "min" to "(1UL << order)" instead of zero. So we'd just make the loop be min = (1UL << order); for (;;) { zone_t *z = *(zone++); .. min += z->pages_low; ... instead, which is even simpler (and then just compare page->free_pages against "min" directly.. Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2001-11-19 21:44 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-11-18 8:24 2.4.15pre6aa1 (fixes google VM problem) Andrea Arcangeli 2001-11-19 17:40 ` Andrea Arcangeli 2001-11-19 18:57 ` Linus Torvalds 2001-11-19 20:38 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox