* [with-PATCH-really] highmem deadlock removal, balancing & cleanup
@ 2001-05-25 20:00 Rik van Riel
2001-05-25 21:12 ` Linus Torvalds
0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-25 20:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
OK, shoot me. Here it is again, this time _with_ patch...
---------- Forwarded message ----------
Date: Fri, 25 May 2001 16:53:38 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
Hi Linus,
the following patch does:
1) Remove GFP_BUFFER and HIGHMEM related deadlocks, by letting
these allocations fail instead of looping forever in
__alloc_pages() when they cannot make any progress there.
Now Linux no longer hangs on highmem machines with heavy
write loads.
2) Clean up the __alloc_pages() / __alloc_pages_limit() code
a bit, moving the direct reclaim condition from the latter
function into the former so we run it less often ;)
3) Remove the superfluous wakeups from __alloc_pages(), not
only are the tests a real CPU eater, they also have the
potential of waking up bdflush in a situation where it
shouldn't run in the first place. The kswapd wakeup didn't
seem to have any effect either.
4) Do make sure GFP_BUFFER allocations NEVER eat into the
very last pages of the system. It is important to preserve
the following ordering:
- normal allocations
- GFP_BUFFER
- atomic allocations
- other recursive allocations
Using this ordering, we can be pretty sure that eg. a
GFP_BUFFER allocation to swap something out to an
encrypted device won't eat the memory the device driver
will need to perform its functions. It also means that
a gigabit network flood won't eat those pages...
5) Change nr_free_buffer_pages() a bit to not return pages
which cannot be used as buffer pages, this makes a BIG
difference on highmem machines (which now DO have a working
write throttling again).
6) Simplify the refill_inactive() loop enough that it actually
works again. Calling page_launder() and shrink_i/d_memory()
by the same if condition means that the different caches
get balanced against each other again.
The illogical argument for not shrinking the slab cache
while we're under a free shortage turned out to be very
much illogical too. All needed buffer heads will have been
allocated in page_launder() and shrink_i/d_memory() before
we get here and we can be pretty sure that these functions
will keep re-using those same buffer heads as soon as the
IO finishes.
regards,
Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
--- linux-2.4.5-pre6/mm/page_alloc.c.orig Fri May 25 16:13:39 2001
+++ linux-2.4.5-pre6/mm/page_alloc.c Fri May 25 16:35:50 2001
@@ -251,10 +251,10 @@
water_mark = z->pages_high;
}
- if (z->free_pages + z->inactive_clean_pages > water_mark) {
+ if (z->free_pages + z->inactive_clean_pages >= water_mark) {
struct page *page = NULL;
/* If possible, reclaim a page directly. */
- if (direct_reclaim && z->free_pages < z->pages_min + 8)
+ if (direct_reclaim)
page = reclaim_page(z);
/* If that fails, fall back to rmqueue. */
if (!page)
@@ -299,21 +299,6 @@
if (order == 0 && (gfp_mask & __GFP_WAIT))
direct_reclaim = 1;
- /*
- * If we are about to get low on free pages and we also have
- * an inactive page shortage, wake up kswapd.
- */
- if (inactive_shortage() > inactive_target / 2 && free_shortage())
- wakeup_kswapd();
- /*
- * If we are about to get low on free pages and cleaning
- * the inactive_dirty pages would fix the situation,
- * wake up bdflush.
- */
- else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
- && nr_inactive_dirty_pages >= freepages.high)
- wakeup_bdflush(0);
-
try_again:
/*
* First, see if we have any zones with lots of free memory.
@@ -329,7 +314,7 @@
if (!z->size)
BUG();
- if (z->free_pages >= z->pages_low) {
+ if (z->free_pages >= z->pages_min + 8) {
page = rmqueue(z, order);
if (page)
return page;
@@ -443,18 +428,26 @@
}
/*
* When we arrive here, we are really tight on memory.
+ * Since kswapd didn't succeed in freeing pages for us,
+ * we try to help it.
+ *
+ * Single page allocs loop until the allocation succeeds.
+ * Multi-page allocs can fail due to memory fragmentation;
+ * in that case we bail out to prevent infinite loops and
+ * hanging device drivers ...
*
- * We try to free pages ourselves by:
- * - shrinking the i/d caches.
- * - reclaiming unused memory from the slab caches.
- * - swapping/syncing pages to disk (done by page_launder)
- * - moving clean pages from the inactive dirty list to
- * the inactive clean list. (done by page_launder)
+ * Another issue are GFP_BUFFER allocations; because they
+ * do not have __GFP_IO set it's possible we cannot make
+ * any progress freeing pages, in that case it's better
+ * to give up than to deadlock the kernel looping here.
*/
if (gfp_mask & __GFP_WAIT) {
memory_pressure++;
- try_to_free_pages(gfp_mask);
- goto try_again;
+ if (!order || free_shortage()) {
+ int progress = try_to_free_pages(gfp_mask);
+ if (progress || gfp_mask & __GFP_IO)
+ goto try_again;
+ }
}
}
@@ -489,6 +482,10 @@
return page;
}
+ /* Don't let GFP_BUFFER allocations eat all the memory. */
+ if (gfp_mask==GFP_BUFFER && z->free_pages < z->pages_min * 3/4)
+ continue;
+
/* XXX: is pages_min/4 a good amount to reserve for this? */
if (z->free_pages < z->pages_min / 4 &&
!(current->flags & PF_MEMALLOC))
@@ -499,7 +496,7 @@
}
/* No luck.. */
- printk(KERN_ERR "__alloc_pages: %lu-order allocation failed.\n", order);
+// printk(KERN_ERR "__alloc_pages: %lu-order allocation failed.\n", order);
return NULL;
}
@@ -578,34 +575,66 @@
}
/*
+ * Total amount of inactive_clean (allocatable) RAM in a given zone.
+ */
+#ifdef CONFIG_HIGHMEM
+unsigned int nr_free_buffer_pages_zone (int zone_type)
+{
+ pg_data_t *pgdat;
+ unsigned int sum;
+
+ sum = 0;
+ pgdat = pgdat_list;
+ while (pgdat) {
+ sum += (pgdat->node_zones+zone_type)->free_pages;
+ sum += (pgdat->node_zones+zone_type)->inactive_clean_pages;
+ sum += (pgdat->node_zones+zone_type)->inactive_dirty_pages;
+ pgdat = pgdat->node_next;
+ }
+ return sum;
+}
+#endif
+
+/*
* Amount of free RAM allocatable as buffer memory:
+ *
+ * For HIGHMEM systems don't count HIGHMEM pages.
+ * This is function is still far from perfect for HIGHMEM systems, but
+ * it is close enough for the time being.
*/
unsigned int nr_free_buffer_pages (void)
{
unsigned int sum;
- sum = nr_free_pages();
- sum += nr_inactive_clean_pages();
+#ifdef CONFIG_HIGHMEM
+ sum = nr_free_buffer_pages_zone(ZONE_NORMAL) +
+ nr_free_buffer_pages_zone(ZONE_DMA);
+#else
+ sum = nr_free_pages() +
+ nr_inactive_clean_pages();
sum += nr_inactive_dirty_pages;
+#endif
/*
* Keep our write behind queue filled, even if
- * kswapd lags a bit right now.
+ * kswapd lags a bit right now. Make sure not
+ * to clog up the whole inactive_dirty list with
+ * dirty pages, though.
*/
- if (sum < freepages.high + inactive_target)
- sum = freepages.high + inactive_target;
+ if (sum < freepages.high + inactive_target / 2)
+ sum = freepages.high + inactive_target / 2;
/*
* We don't want dirty page writebehind to put too
* much pressure on the working set, but we want it
* to be possible to have some dirty pages in the
* working set without upsetting the writebehind logic.
*/
- sum += nr_active_pages >> 4;
+ sum += nr_active_pages >> 5;
return sum;
}
-#if CONFIG_HIGHMEM
+#ifdef CONFIG_HIGHMEM
unsigned int nr_free_highpages (void)
{
pg_data_t *pgdat = pgdat_list;
--- linux-2.4.5-pre6/mm/vmscan.c.orig Fri May 25 16:13:40 2001
+++ linux-2.4.5-pre6/mm/vmscan.c Fri May 25 16:13:52 2001
@@ -865,14 +865,18 @@
/*
* If we're low on free pages, move pages from the
- * inactive_dirty list to the inactive_clean list.
+ * inactive_dirty list to the inactive_clean list
+ * and shrink the inode and dentry caches.
*
* Usually bdflush will have pre-cleaned the pages
* before we get around to moving them to the other
* list, so this is a relatively cheap operation.
*/
- if (free_shortage())
+ if (free_shortage()) {
ret += page_launder(gfp_mask, user);
+ shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
+ shrink_icache_memory(DEF_PRIORITY, gfp_mask);
+ }
/*
* If needed, we move pages from the active list
@@ -882,21 +886,10 @@
ret += refill_inactive(gfp_mask, user);
/*
- * Delete pages from the inode and dentry caches and
- * reclaim unused slab cache if memory is low.
+ * If we're still short on free pages, reclaim unused
+ * slab cache memory.
*/
if (free_shortage()) {
- shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
- shrink_icache_memory(DEF_PRIORITY, gfp_mask);
- } else {
- /*
- * Illogical, but true. At least for now.
- *
- * If we're _not_ under shortage any more, we
- * reap the caches. Why? Because a noticeable
- * part of the caches are the buffer-heads,
- * which we'll want to keep if under shortage.
- */
kmem_cache_reap(gfp_mask);
}
^ permalink raw reply [flat|nested] 64+ messages in thread* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 20:00 [with-PATCH-really] highmem deadlock removal, balancing & cleanup Rik van Riel @ 2001-05-25 21:12 ` Linus Torvalds 2001-05-25 22:20 ` Rik van Riel 2001-05-25 22:35 ` Rik van Riel 0 siblings, 2 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-25 21:12 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Fri, 25 May 2001, Rik van Riel wrote: > > OK, shoot me. Here it is again, this time _with_ patch... I'm not going to apply this as long as it plays experimental games with "shrink_icache()" and friends. I haven't seen anybody comment on the performance on this, and I can well imagine that it would potentially shrink the dcache too aggressively when there are lots of inactive-dirty pages around, where page_launder is the right thing to do, and shrinking icache/dcache might not be. I'd really like to avoid having the MM stuff fluctuate too much. Have people tested this under different loads etc? Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 21:12 ` Linus Torvalds @ 2001-05-25 22:20 ` Rik van Riel 2001-05-25 23:05 ` Linus Torvalds 2001-05-25 23:13 ` Alan Cox 2001-05-25 22:35 ` Rik van Riel 1 sibling, 2 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-25 22:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > On Fri, 25 May 2001, Rik van Riel wrote: > > > > OK, shoot me. Here it is again, this time _with_ patch... > > I'm not going to apply this as long as it plays experimental games with > "shrink_icache()" and friends. I haven't seen anybody comment on the > performance on this, Yeah, I guess the way Linux 2.2 balances things is way too experimental ;) Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 22:20 ` Rik van Riel @ 2001-05-25 23:05 ` Linus Torvalds 2001-05-25 23:13 ` Alan Cox 1 sibling, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-25 23:05 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Fri, 25 May 2001, Rik van Riel wrote: > > Yeah, I guess the way Linux 2.2 balances things is way too > experimental ;) Ehh.. Take a look at the other differences between the VM's. Which may make a 2.2.x approach completely bogus. And take a look at how long the 2.2.x VM took to stabilize, and how INCREDIBLY BAD some of those kernels were. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 22:20 ` Rik van Riel 2001-05-25 23:05 ` Linus Torvalds @ 2001-05-25 23:13 ` Alan Cox 2001-05-25 23:19 ` Rik van Riel 2001-05-26 0:02 ` Linus Torvalds 1 sibling, 2 replies; 64+ messages in thread From: Alan Cox @ 2001-05-25 23:13 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, linux-kernel > On Fri, 25 May 2001, Linus Torvalds wrote: > > On Fri, 25 May 2001, Rik van Riel wrote: > > > > > > OK, shoot me. Here it is again, this time _with_ patch... > > > > I'm not going to apply this as long as it plays experimental games with > > "shrink_icache()" and friends. I haven't seen anybody comment on the > > performance on this, > > Yeah, I guess the way Linux 2.2 balances things is way too > experimental ;) Compared to the 2.0 performance - yes. 2.0 is faster than 2.2 is twice the speed of 2.4 starting X and the session apps on my MediaGX box with 64Mb But Linus is right I think - VM changes often prove 'interesting'. Test it in -ac , gets some figures for real world use then plan further ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 23:13 ` Alan Cox @ 2001-05-25 23:19 ` Rik van Riel 2001-05-26 0:02 ` Linus Torvalds 1 sibling, 0 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-25 23:19 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, linux-kernel On Sat, 26 May 2001, Alan Cox wrote: > But Linus is right I think - VM changes often prove > 'interesting'. Test it in -ac , gets some figures for real world > use then plan further Oh well. As long as he takes the patch to page_alloc.c, otherwise everybody _will_ have to "experiment" with the -ac kernels just to have a system with highmem which doesn't deadlock ;) cheers, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 23:13 ` Alan Cox 2001-05-25 23:19 ` Rik van Riel @ 2001-05-26 0:02 ` Linus Torvalds 2001-05-26 0:07 ` Rik van Riel 2001-05-26 0:29 ` Ben LaHaise 1 sibling, 2 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 0:02 UTC (permalink / raw) To: Alan Cox; +Cc: Rik van Riel, linux-kernel On Sat, 26 May 2001, Alan Cox wrote: > > But Linus is right I think - VM changes often prove 'interesting'. Test it in > -ac , gets some figures for real world use then plan further .. on the other hand, thinking more about this, I'd rather be called "stupid" than "stodgy". So I think I'll buy some experimentation. That HIGHMEM change is too ugly to live, though, I'd really like to hear more about why something that ugly is necessary. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:02 ` Linus Torvalds @ 2001-05-26 0:07 ` Rik van Riel 2001-05-26 0:16 ` Linus Torvalds 2001-05-26 0:23 ` Linus Torvalds 2001-05-26 0:29 ` Ben LaHaise 1 sibling, 2 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 0:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > So I think I'll buy some experimentation. That HIGHMEM change is > too ugly to live, though, I'd really like to hear more about why > something that ugly is necessary. If you mean the "GFP_BUFFER allocations should fail instead of looping forever" thing, it is because: 1) GFP_BUFFER allocations are made in order to try to flush (and free) pages and allocate highmem pages. 2) This is the page flushing equivalent of PF_MEMALLOC, in the sense that we should not go and try to "recursively" flush more random pages until we find one that succeeds without an allocation; instead, we should just break the loop and let the caller deal with it. If you mean the change to nr_free_buffer_pages(), that one is needed because that function should return what the name implies ... returning "2 GB of memory available for dirty buffers" makes for a completely filled up ZONE_NORMAL and processes which never get throttled for writing (and instead, end up looping for more memory and killing performance for the rest of the system). regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:07 ` Rik van Riel @ 2001-05-26 0:16 ` Linus Torvalds 2001-05-26 0:23 ` Linus Torvalds 1 sibling, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 0:16 UTC (permalink / raw) To: Rik van Riel; +Cc: Alan Cox, linux-kernel On Fri, 25 May 2001, Rik van Riel wrote: > On Fri, 25 May 2001, Linus Torvalds wrote: > > > So I think I'll buy some experimentation. That HIGHMEM change is > > too ugly to live, though, I'd really like to hear more about why > > something that ugly is necessary. > > If you mean the "GFP_BUFFER allocations should fail instead > of looping forever" thing, it is because: No, I was thinking more of the dirty buffer balancing thing. It seems to have this hardcoded notion of "DMA + NORMAL", which is just wrong. There could be more zones that are acceptable to buffers, so what it _should_ do is to just walk the zone list that GFP_BUFFER points to. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:07 ` Rik van Riel 2001-05-26 0:16 ` Linus Torvalds @ 2001-05-26 0:23 ` Linus Torvalds 2001-05-26 0:26 ` Rik van Riel 1 sibling, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 0:23 UTC (permalink / raw) To: Rik van Riel; +Cc: Alan Cox, linux-kernel Oh, also: the logic behind the change of the kmem_cache_reap() - instead of making it conditional on the _reverse_ test of what it has historically been, why isn't it just completely unconditional? You've basically dismissed the only valid reason for it to have been (illogically) conditional, so I'd have expected that just _removing_ the test is better than reversing it like your patch does.. No? Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:23 ` Linus Torvalds @ 2001-05-26 0:26 ` Rik van Riel 2001-05-26 0:30 ` Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 0:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > Oh, also: the logic behind the change of the kmem_cache_reap() - instead > of making it conditional on the _reverse_ test of what it has historically > been, why isn't it just completely unconditional? You've basically > dismissed the only valid reason for it to have been (illogically) > conditional, so I'd have expected that just _removing_ the test is better > than reversing it like your patch does.. > > No? The function do_try_to_free_pages() also gets called when we're only short on inactive pages, but we still have TONS of free memory. In that case, I don't think we'd actually want to steal free memory from anyone. Moving it into the same if() conditional the other memory freeers are in would make sense, though ... regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:26 ` Rik van Riel @ 2001-05-26 0:30 ` Linus Torvalds 0 siblings, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 0:30 UTC (permalink / raw) To: Rik van Riel; +Cc: Alan Cox, linux-kernel On Fri, 25 May 2001, Rik van Riel wrote: > > The function do_try_to_free_pages() also gets called when we're > only short on inactive pages, but we still have TONS of free > memory. In that case, I don't think we'd actually want to steal > free memory from anyone. Well, kmem_cache_reap() doesn't even steal memory from anybody: it just makes this "tagged for xxx" memory be available to "non-xxx" users too. And the fact that we're calling do_try_to_free_pages() at all obviously implies that even if we have memory free, it isn't in the right hands.. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:02 ` Linus Torvalds 2001-05-26 0:07 ` Rik van Riel @ 2001-05-26 0:29 ` Ben LaHaise 2001-05-26 0:34 ` Linus Torvalds 2001-05-26 0:42 ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli 1 sibling, 2 replies; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 0:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, Rik van Riel, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > So I think I'll buy some experimentation. That HIGHMEM change is too ugly > to live, though, I'd really like to hear more about why something that > ugly is necessary. Highmem systems currently manage to hang themselves quite completely upon running out of memory in the normal zone. One of the failure modes is looping in __alloc_pages from get_unused_buffer_head to map a dirty page. Another results in looping on allocation of a bounce page for writing a dirty highmem page back to disk. Also, note that the current highmem design for bounce buffers is inherently unstable. Because all highmem pages that are currently in flight must have bounce buffers allocated for them, we require a huge amount of bounce buffers to guarentee progress while submitting io. The -ac kernels have a patch from Ingo that provides private pools for bounce buffers and buffer_heads. I went a step further and have a memory reservation patch that provides for memory pools being reserved against a particular zone. This is needed to prevent the starvation that irq allocations can cause. Some of these cleanups are 2.5 fodder, but we really need something in 2.4 right now, so... -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:29 ` Ben LaHaise @ 2001-05-26 0:34 ` Linus Torvalds 2001-05-26 0:38 ` Rik van Riel 2001-05-26 1:28 ` Linux-2.4.5 Linus Torvalds 2001-05-26 0:42 ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli 1 sibling, 2 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 0:34 UTC (permalink / raw) To: Ben LaHaise; +Cc: Alan Cox, Rik van Riel, linux-kernel On Fri, 25 May 2001, Ben LaHaise wrote: > > Highmem systems currently manage to hang themselves quite completely upon > running out of memory in the normal zone. One of the failure modes is > looping in __alloc_pages from get_unused_buffer_head to map a dirty page. > Another results in looping on allocation of a bounce page for writing a > dirty highmem page back to disk. That's not the part of the patch I object to - fixing that is fine. What I object to it that it special-cases the zone names, even though that doesn't necessarily make any sense at all. What about architectures that have other zones? THAT is the kind of fundmanental design mistake that special-casing DMA and NORMAL is horrible for. alloc_pages() doesn't have that kind of problem. To alloc_pages(), GFP_BUFFER is not "oh, DMA or NORMAL". There, it is simply "oh, use the zonelist pointed to by GFP_BUFFER". No special casing, no stupid #ifdef CONFIG_HIGHMEM. THAT is what I object to. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:34 ` Linus Torvalds @ 2001-05-26 0:38 ` Rik van Riel 2001-05-26 1:28 ` Linux-2.4.5 Linus Torvalds 1 sibling, 0 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 0:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ben LaHaise, Alan Cox, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > That's not the part of the patch I object to - fixing that is fine. > > What I object to it that it special-cases the zone names, even > though that doesn't necessarily make any sense at all. OK, I'll fix that part. Maybe even this weekend, before I go on holidays ;) Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Linux-2.4.5 2001-05-26 0:34 ` Linus Torvalds 2001-05-26 0:38 ` Rik van Riel @ 2001-05-26 1:28 ` Linus Torvalds 2001-05-26 1:35 ` Linux-2.4.5 Rik van Riel 2001-05-26 1:39 ` Linux-2.4.5 Ben LaHaise 1 sibling, 2 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 1:28 UTC (permalink / raw) To: Alan Cox, Andrea Arcangeli, Ben LaHaise; +Cc: Rik van Riel, linux-kernel Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly attacked the real deadlock cause. It's there as 2.4.5 now. I'm going to be gone in Japan for the next week (leaving tomorrow morning), so please don't send me patches - I won't be able to react to them anyway. Consider the -ac series and the kernel mailing list the regular communications channels.. Thanks, Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 1:28 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 1:35 ` Rik van Riel 2001-05-26 1:39 ` Linux-2.4.5 Ben LaHaise 1 sibling, 0 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 1:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, Andrea Arcangeli, Ben LaHaise, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly > attacked the real deadlock cause. It's there as 2.4.5 now. But only for highmem bounce buffers. Normal GFP_BUFFER allocations can still headlock. > I'm going to be gone in Japan for the next week Oh well, I guess people can always run the -ac kernel ;) Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 1:28 ` Linux-2.4.5 Linus Torvalds 2001-05-26 1:35 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 1:39 ` Ben LaHaise 2001-05-26 1:59 ` Linux-2.4.5 Andrea Arcangeli 1 sibling, 1 reply; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 1:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, Andrea Arcangeli, Rik van Riel, linux-kernel Sorry, this doesn't fix the problem. It still hangs on highmem machines. Try running cerberus on a PAE kernel sometime. -ben On Fri, 25 May 2001, Linus Torvalds wrote: > Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly > attacked the real deadlock cause. It's there as 2.4.5 now. > > I'm going to be gone in Japan for the next week (leaving tomorrow > morning), so please don't send me patches - I won't be able to react to > them anyway. Consider the -ac series and the kernel mailing list the > regular communications channels.. > > Thanks, > > Linus > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 1:39 ` Linux-2.4.5 Ben LaHaise @ 2001-05-26 1:59 ` Andrea Arcangeli 2001-05-26 2:11 ` Linux-2.4.5 Ben LaHaise 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 1:59 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 09:39:36PM -0400, Ben LaHaise wrote: > Sorry, this doesn't fix the problem. It still hangs on highmem machines. > Try running cerberus on a PAE kernel sometime. There can be more bugs of course, two patches I posted are only meant to fix deadlocks in the allocation fail path of alloc_bounces() and the second patch in getblk() allocation fail path, nothing more. Those are strictly necessary fixes as far I can tell, and their implementation was quite obviously right to my eyes. Now if you send some debugging info with deadlocks you gets with 2.4.5 vanilla I will be certainly interested to found their source. Also Rik just said to have a fix for other bugs in that area, I didn't checked that part. What I can say is that with my tree I didn't reproduced deadlocks highmem related with cerberus but I'm using highmem emulation not real highmem. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 1:59 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 2:11 ` Ben LaHaise 2001-05-26 2:38 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 15:41 ` Linux-2.4.5 Rik van Riel 0 siblings, 2 replies; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 2:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > Now if you send some debugging info with deadlocks you gets with 2.4.5 > vanilla I will be certainly interested to found their source. Also Rik > just said to have a fix for other bugs in that area, I didn't checked > that part. In the red hat tree, we fixed the problem by adding __GFP_FAIL to GFP_BUFFER, as well as adding a yield to alloc_pages. Think of what happens when you try to allocate a buffer_head to swap out a page when you're out of memory. See below. -ben diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h --- v2.4.4-ac9/include/linux/mm.h Mon May 14 15:22:17 2001 +++ work/include/linux/mm.h Mon May 14 18:33:21 2001 @@ -528,7 +528,7 @@ #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL) -#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) +#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c --- v2.4.4-ac9/mm/vmscan.c Mon May 14 14:57:00 2001 +++ work/mm/vmscan.c Mon May 14 16:43:05 2001 @@ -636,6 +636,12 @@ */ shortage = free_shortage(); if (can_get_io_locks && !launder_loop && shortage) { + if (gfp_mask & __GFP_WAIT) { + __set_current_state(TASK_RUNNING); + current->policy |= SCHED_YIELD; + schedule(); + } + launder_loop = 1; /* ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 2:11 ` Linux-2.4.5 Ben LaHaise @ 2001-05-26 2:38 ` Andrea Arcangeli 2001-05-26 2:49 ` Linux-2.4.5 Ben LaHaise 2001-05-26 15:41 ` Linux-2.4.5 Rik van Riel 1 sibling, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 2:38 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 10:11:40PM -0400, Ben LaHaise wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > Now if you send some debugging info with deadlocks you gets with 2.4.5 > > vanilla I will be certainly interested to found their source. Also Rik > > just said to have a fix for other bugs in that area, I didn't checked > > that part. > > In the red hat tree, we fixed the problem by adding __GFP_FAIL to > GFP_BUFFER, as well as adding a yield to alloc_pages. Think of what > happens when you try to allocate a buffer_head to swap out a page when > you're out of memory. See below. Allocating a buffer head during out of memory is always been deadlock prone, 2.2 always had the same problem too. I'm not sure why you are telling me this, I didn't changed anything that happens to be in the swapout path (besides removing the deadlock in create_bounces with evolution of first Ingo's patch but that is not specific to the swapout). I only changed the getblk path (which is not used by the swapout, at least unless you swapout on a file not on a blkdev, but even in that case the change is fine). About yelding interally to alloc_pages it would make sense if we had a private pool internally to the allocator, otherwise with current code you definitely want to return a faliure fast to the caller so the caller will try the private pool before the yield. btw in the below patch __GFP_FAIL is a noop. > > -ben > > diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h > --- v2.4.4-ac9/include/linux/mm.h Mon May 14 15:22:17 2001 > +++ work/include/linux/mm.h Mon May 14 18:33:21 2001 > @@ -528,7 +528,7 @@ > > > #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL) > -#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) > +#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT) > #define GFP_ATOMIC (__GFP_HIGH) > #define GFP_USER ( __GFP_WAIT | __GFP_IO) > #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) > diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c > --- v2.4.4-ac9/mm/vmscan.c Mon May 14 14:57:00 2001 > +++ work/mm/vmscan.c Mon May 14 16:43:05 2001 > @@ -636,6 +636,12 @@ > */ > shortage = free_shortage(); > if (can_get_io_locks && !launder_loop && shortage) { > + if (gfp_mask & __GFP_WAIT) { > + __set_current_state(TASK_RUNNING); > + current->policy |= SCHED_YIELD; > + schedule(); > + } > + > launder_loop = 1; > > /* > Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 2:38 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 2:49 ` Ben LaHaise 2001-05-26 3:11 ` Linux-2.4.5 Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 2:49 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > Allocating a buffer head during out of memory is always been deadlock > prone, 2.2 always had the same problem too. I'm not sure why you are > telling me this, I didn't changed anything that happens to be in the > swapout path (besides removing the deadlock in create_bounces with > evolution of first Ingo's patch but that is not specific to the > swapout). I only changed the getblk path (which is not used by the > swapout, at least unless you swapout on a file not on a blkdev, but even > in that case the change is fine). Highmem. 0 free pages in ZONE_NORMAL. Now try to allocate a buffer_head. Running under heavy load runs into this even after there is a highmem bounce buffer pool. > btw in the below patch __GFP_FAIL is a noop. <shrug> merge that patch from -ac then. -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 2:49 ` Linux-2.4.5 Ben LaHaise @ 2001-05-26 3:11 ` Andrea Arcangeli 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds 2001-05-26 4:45 ` Linux-2.4.5 Rik van Riel 0 siblings, 2 replies; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 3:11 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote: > Highmem. 0 free pages in ZONE_NORMAL. Now try to allocate a buffer_head. That's a longstanding deadlock, it was there the first time I read fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also getblk is deadlock prone in a smiliar manner. Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE and see if it makes a difference? The unused_list logic doesn't give a guarantee either and it's one of the "hiding" logics, but it was working pretty well usually, maybe something changed that needs more than 2 pages (16 bh) reserved? Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 3:11 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 4:22 ` Linus Torvalds 2001-05-26 4:31 ` Linux-2.4.5 Rik van Riel ` (2 more replies) 2001-05-26 4:45 ` Linux-2.4.5 Rik van Riel 1 sibling, 3 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 4:22 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > > On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote: > > Highmem. 0 free pages in ZONE_NORMAL. Now try to allocate a buffer_head. > > That's a longstanding deadlock, it was there the first time I read > fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also > getblk is deadlock prone in a smiliar manner. Indeed. This is why we always leave a few pages free, exactly to allow nested page allocators to steal the reserved pages that we keep around. If that deadlocks, then that's a separate issue altogether. If people are able to trigger the "we run out of reserved pages" behaviour under any load, that indicates that we either have too few reserved pages per zone, or that we have a real thinko somewhere that allows eating up the reserves we're supposed to have. And yes, things like spraying the box really hard with network packets can temporarily eat up the reserves, but that's why kswapd and friends exist, to get them back. But I could easily imagine that there are more schedule points missing that could cause user mode to not get to run much. Andrea fixed one, I think. If there are more situations like this, please get a stack trace on the deadlock (fairly easy these days with ctrl-scrolllock), and let's think about what make for the nasty pattern in the first place instead of trying to add more "reserved" pages. For example, maybe we can use HIGHMEM pages more aggressively in some places, to avoid the case where we're only freeing HIGHMEM pages and making the non-HIGHMEM case just worse. For example, we used to have logic in swapout_process to _not_ swap out zones that don't need it. We changed swapout to happen in "page_launder()", but that logic got lost. It's entirely possible that we should just say "don't bother writing out dirty pages that are in zones that have no memory pressure", so that we don't use up pages from the _precious_ zones to free pages in zones that don't need freeing. So don't try to paper over this by making up new rules. We should think about WHY the problem happens in the first place, not about trying to fix it once it has happened (and see the above as an example of why it might be happening). But sometimes the right solution is just to have more reserves. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 4:31 ` Rik van Riel 2001-05-26 8:10 ` Linux-2.4.5 Linus Torvalds 2001-05-26 9:18 ` Linux-2.4.5 arjan 2001-05-26 14:18 ` Linux-2.4.5 Andrea Arcangeli 2 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 4:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > This is why we always leave a few pages free, exactly to allow nested > page allocators to steal the reserved pages that we keep around. If > that deadlocks, then that's a separate issue altogether. > > If people are able to trigger the "we run out of reserved pages" > behaviour under any load, that indicates that we either have too few > reserved pages per zone, or that we have a real thinko somewhere that > allows eating up the reserves we're supposed to have. This is exactly what gets fixed in the patch I sent you. Feel free to reimplement it in a complex way if you want ;) > But sometimes the right solution is just to have more reserves. Won't work if the ethernet card is allocating memory at gigabit speed. And no, my patch won't protect against this thing either, only memory reservations can. All my patch does is give us a 2.4 kernel now which doesn't hang immediately as soon as you run on highmem machines with a heavy swapping load. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:31 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 8:10 ` Linus Torvalds 2001-05-26 9:01 ` Linux-2.4.5 Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 8:10 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > > This is exactly what gets fixed in the patch I sent you. This is not AT ALL what your patch does. 99% of your patch does other things, including playing games with the "balance dirty" threshold etc. In ways that make it look very much like the standard dbench kind of load cannot use HIGHMEM for dirty buffers very effectively. There's _one_ line of your patch special-cases GFP_BUFFER in page_alloc and protects the reserves, but the point is that they shouldn't need protecting: there's something else wrong that makes them be depleted in the first place. Did you read my email? Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 8:10 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 9:01 ` Linus Torvalds 0 siblings, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 9:01 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Linus Torvalds wrote: > > There's _one_ line of your patch special-cases GFP_BUFFER in page_alloc > and protects the reserves, but the point is that they shouldn't need > protecting: there's something else wrong that makes them be depleted in > the first place. In fact, there seems to be a major confusion about the use of GFP_BUFFER around the kernel. The kernel uses GFP_BUFFER in a few places: - grow_buffers(), as called from bread() and friends. Here it is correct: we use GFP_BUFFER because we must not cause a deadlock on various filesystem datastructures, and whena filesystem does a bread() we'd better _never_ cause a writeout that could cause some nasty FS lock self-deadlock. - creating buffer heads on pages for the page cache uses SLAB_BUFFER. This is bogus. It should use SLAB_KERNEL here, because this is not called by low-level filesystems, it's called by the VM layer, and we don't hold any magic locks or anything like that. The code actually has some comments to this effect, but is confusing the issue of "async" and "sync", and that confusion makes the code (a) not dare do the right thing (there's an attempt that is #ifdef'ed out), and (b) even the right thing is kind of confused. I'm leaving for Japan, so this is the last I'll write on this, but as a request-for-discussion, what about this patch? Does it help? It should decrease the GFP_BUFFER pressure noticeably (yeah, I removed too much of the error handling code, but it shouldn't be all that noticeable when using SLAB_KERNEL, as the allocations should succeed until the machine is _so_ low on memory that it is truly dead). Now, this is obviously untested and does a bit too much surgery, but I really think that the reason for the deadlock is because buffer allocation itself does things wrong, not so much that we should try to keep infinite reserves. Anybody want to play with these kinds of approaches? Fixing the underlying problems instead of trying to hide them with special reserve cases... Linus ----- --- v2.4.5/linux/fs/buffer.c Fri May 25 18:28:55 2001 +++ linux/fs/buffer.c Sat May 26 01:52:43 2001 @@ -1206,7 +1206,7 @@ * no-buffer-head deadlock. Return NULL on failure; waiting for * buffer heads is now handled in create_buffers(). */ -static struct buffer_head * get_unused_buffer_head(int async) +static struct buffer_head * get_unused_buffer_head(int can_do_io) { struct buffer_head * bh; @@ -1220,46 +1220,7 @@ } spin_unlock(&unused_list_lock); - /* This is critical. We can't swap out pages to get - * more buffer heads, because the swap-out may need - * more buffer-heads itself. Thus SLAB_BUFFER. - */ - if((bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER)) != NULL) { - bh->b_blocknr = -1; - bh->b_this_page = NULL; - return bh; - } - - /* - * If we need an async buffer, use the reserved buffer heads. - */ - if (async) { - spin_lock(&unused_list_lock); - if (unused_list) { - bh = unused_list; - unused_list = bh->b_next_free; - nr_unused_buffer_heads--; - spin_unlock(&unused_list_lock); - return bh; - } - spin_unlock(&unused_list_lock); - } -#if 0 - /* - * (Pending further analysis ...) - * Ordinary (non-async) requests can use a different memory priority - * to free up pages. Any swapping thus generated will use async - * buffer heads. - */ - if(!async && - (bh = kmem_cache_alloc(bh_cachep, SLAB_KERNEL)) != NULL) { - memset(bh, 0, sizeof(*bh)); - init_waitqueue_head(&bh->b_wait); - return bh; - } -#endif - - return NULL; + return kmem_cache_alloc(bh_cachep, can_do_io ? SLAB_KERNEL : SLAB_BUFFER); } void set_bh_page (struct buffer_head *bh, struct page *page, unsigned long offset) @@ -1285,16 +1246,16 @@ * from ordinary buffer allocations, and only async requests are allowed * to sleep waiting for buffer heads. */ -static struct buffer_head * create_buffers(struct page * page, unsigned long size, int async) +static struct buffer_head * create_buffers(struct page * page, unsigned long size, int can_do_io) { struct buffer_head *bh, *head; long offset; -try_again: head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { - bh = get_unused_buffer_head(async); + bh = get_unused_buffer_head(can_do_io); + /* Can return failure in the "!can_do_io" case */ if (!bh) goto no_grow; @@ -1331,29 +1292,7 @@ wake_up(&buffer_wait); } - /* - * Return failure for non-async IO requests. Async IO requests - * are not allowed to fail, so we have to wait until buffer heads - * become available. But we don't want tasks sleeping with - * partially complete buffers, so all were released above. - */ - if (!async) - return NULL; - - /* We're _really_ low on memory. Now we just - * wait for old buffer heads to become free due to - * finishing IO. Since this is an async request and - * the reserve list is empty, we're sure there are - * async buffer heads in use. - */ - run_task_queue(&tq_disk); - - /* - * Set our state for sleeping, then check again for buffer heads. - * This ensures we won't miss a wake_up from an interrupt. - */ - wait_event(buffer_wait, nr_unused_buffer_heads >= MAX_BUF_PER_PAGE); - goto try_again; + return NULL; } static void unmap_buffer(struct buffer_head * bh) @@ -1425,7 +1364,6 @@ { struct buffer_head *bh, *head, *tail; - /* FIXME: create_buffers should fail if there's no enough memory */ head = create_buffers(page, blocksize, 1); if (page->buffers) BUG(); ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds 2001-05-26 4:31 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 9:18 ` arjan 2001-05-26 14:18 ` Linux-2.4.5 Andrea Arcangeli 2 siblings, 0 replies; 64+ messages in thread From: arjan @ 2001-05-26 9:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel In article <Pine.LNX.4.21.0105252107010.1520-100000@penguin.transmeta.com> you wrote: > If there are more situations like this, please get a stack trace on the > deadlock (fairly easy these days with ctrl-scrolllock), and let's think > about what make for the nasty pattern in the first place instead of trying > to add more "reserved" pages. What I've been seeing so far is that, say, 16 processes all need memory, and start to write out stuff. All only half succeed, eg they allocate memory for the writeout but don't actually get enough and sleep for more. Now that isn't too much of a problem UNTIL kswapd is one of those..... ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds 2001-05-26 4:31 ` Linux-2.4.5 Rik van Riel 2001-05-26 9:18 ` Linux-2.4.5 arjan @ 2001-05-26 14:18 ` Andrea Arcangeli 2001-05-26 14:21 ` Linux-2.4.5 Rik van Riel 2 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 14:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ben LaHaise, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 09:22:05PM -0700, Linus Torvalds wrote: > per zone, or that we have a real thinko somewhere that allows eating up I think I found the real thinko for the create_buffers other deadlock experienced by Ben, I suspect this fix is enough to address the deadlock, this was real bug not just hiding factor (reserved bh aren't released by irqs, the bounces frees memory, memory balancing stop, no bh is released and we dealdock in wait_event), I also increased a bit the reserved bh just in case: --- 2.4.5pre6aa1/fs/buffer.c.~1~ Fri May 25 04:57:46 2001 +++ 2.4.5pre6aa1/fs/buffer.c Sat May 26 16:15:03 2001 @@ -61,7 +61,7 @@ #define BUFSIZE_INDEX(X) ((int) buffersize_index[(X)>>9]) #define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512) -#define NR_RESERVED (2*MAX_BUF_PER_PAGE) +#define NR_RESERVED (10*MAX_BUF_PER_PAGE) #define MAX_UNUSED_BUFFERS NR_RESERVED+20 /* don't ever have more than this number of unused buffer heads */ @@ -1416,11 +1416,9 @@ */ run_task_queue(&tq_disk); - /* - * Set our state for sleeping, then check again for buffer heads. - * This ensures we won't miss a wake_up from an interrupt. - */ - wait_event(buffer_wait, nr_unused_buffer_heads >= MAX_BUF_PER_PAGE); + current->policy |= SCHED_YIELD; + __set_current_state(TASK_RUNNING); + schedule(); goto try_again; } please people with >1G machines try to reproduce the deadlock with cerberus on top of 2.4.5 + the above patch. I didn't checked the alloc_pages() other thing mentioned by Ben, if alloc_pages() deadlocks internally that's yet another completly orthogonal bug and that will be addressed by another patch if it persists. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:18 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 14:21 ` Rik van Riel 2001-05-26 14:38 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 15:09 ` Linux-2.4.5 Linus Torvalds 0 siblings, 2 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 14:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > I didn't checked the alloc_pages() other thing mentioned by Ben, if > alloc_pages() deadlocks internally that's yet another completly > orthogonal bug and that will be addressed by another patch if it > persists. O, that part is fixed by the patch that Linus threw away yesterday ;) Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:21 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 14:38 ` Andrea Arcangeli 2001-05-26 14:40 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:09 ` Linux-2.4.5 Linus Torvalds 1 sibling, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 14:38 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, May 26, 2001 at 11:21:18AM -0300, Rik van Riel wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > I didn't checked the alloc_pages() other thing mentioned by Ben, if > > alloc_pages() deadlocks internally that's yet another completly > > orthogonal bug and that will be addressed by another patch if it > > persists. > > O, that part is fixed by the patch that Linus threw away > yesterday ;) what are you smoking? I read your patch and there's nothing related to such fix in your patch. I won't have much more time to follow up on this thread. From now on about the highmem deadlocks topic I will only listen to real world bugreports of 2.4.5 plus the bugfix I posted a few minuts ago. If anybody is able to reproduce the deadlocks press SYSRQ+T and send me the output along the System.map. Thanks in advance for your cooperation. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:38 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 14:40 ` Rik van Riel 2001-05-26 15:17 ` Linux-2.4.5 Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 14:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Sat, May 26, 2001 at 11:21:18AM -0300, Rik van Riel wrote: > > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > > > I didn't checked the alloc_pages() other thing mentioned by Ben, if > > > alloc_pages() deadlocks internally that's yet another completly > > > orthogonal bug and that will be addressed by another patch if it > > > persists. > > > > O, that part is fixed by the patch that Linus threw away > > yesterday ;) > > what are you smoking? I am smoking the "tested the patch and wasn't able to reproduce a deadlock" stuff. > I read your patch and there's nothing related to > such fix in your patch. I explained the thing to you about 5 times now. If you don't have the time to understand the deadlock, please don't try to confuse the issue by sending out non-fixes. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:40 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:17 ` Linus Torvalds 2001-05-26 15:28 ` Linux-2.4.5 Rik van Riel 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 15:17 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > > I am smoking the "tested the patch and wasn't able to reproduce > a deadlock" stuff. I'd be happier if _anybody_ was smoking the "thought about the problem some more" stuff as well. I can easily imagine that this part of your patch if (gfp_mask & __GFP_WAIT) { memory_pressure++; - try_to_free_pages(gfp_mask); - goto try_again; + if (!order || free_shortage()) { + int progress = try_to_free_pages(gfp_mask); + if (progress || gfp_mask & __GFP_IO) + goto try_again; + } } is fine. The fact that other parts of your patch were NOT fine should make you go "Hmm, maybe Linus dismissed it for a reason" instead. Testing is good. But I want to understand how we get into the situation in the first place, and whether there are ways to alleviate those problems too. Testing and finding that the bug went away under your workload is good. But thinking that that should stop discussion about the _proper_ fix is stupid, Rik. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:17 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 15:28 ` Rik van Riel 2001-05-26 15:59 ` Linux-2.4.5 Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 15:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Linus Torvalds wrote: > if (gfp_mask & __GFP_WAIT) { > memory_pressure++; > - try_to_free_pages(gfp_mask); > - goto try_again; > + if (!order || free_shortage()) { > + int progress = try_to_free_pages(gfp_mask); > + if (progress || gfp_mask & __GFP_IO) > + goto try_again; > + } > } Yes, this is it. > Testing is good. But I want to understand how we get into the > situation in the first place, and whether there are ways to alleviate > those problems too. As I said create_buffers() -> get_unused_buffer_head() -> __alloc_pages() -> loop infinitely. Your simplification of get_unused_buffer_head() fits in nicely with this. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:28 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:59 ` Linus Torvalds 2001-05-26 22:12 ` Linux-2.4.5 Marcelo Tosatti 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 15:59 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > > > Testing is good. But I want to understand how we get into the > > situation in the first place, and whether there are ways to alleviate > > those problems too. > > As I said create_buffers() -> get_unused_buffer_head() > -> __alloc_pages() -> loop infinitely. No no no. Get outside the small world of where it's looping. Think more on the problem of "how did we get into such a state that we're so low on memory for swapouts in the FIRST PLACE?" That's the problem I want to fix. And I suspect part of the equation is exactly the fact that we use SLAB_BUFFER for the buffer heads. Example schenario: - we're low on memory, but not disastrously so. We have lots of highmem pages, but not that much NORMAL. But we're not uncomfortable yet. - somebody starts writing out to a file. He nicely allocates HIGHMEM pages for the actual data (no memory pressure on HIGHMEM, so no "try_to_free_pages()" called at all) - the writer _also_ needs the buffer heads for those written pages (ie "generic_block_prepare_write()"), and here he allocates them with SLAB_BUFFER. The NORMAL zone starts getting low, and we start calling "try_to_free_pages()". - we deplete "try_to_free_pages()" and start swapping. Except everybody is calling "try_to_free_pages()" with GFP_BUFFER, so nobody can actually do any IO, and if we're unlucky there's not much to be free'd in the NORMAL zone. Now do you start to see a pattern here? You're trying to fix the symptoms, by attacking the final end. And what I've been trying to say is that this problem likely has a higher-level _cause_, and I want that _cause_ fixed. Not the symptoms. Now, I suspect that part of the cause for this is the fact that we're using GFP_BUFFER where we shouldn't, which is why we cannot free stuff up properly in the NORMAL zone. Now, there could be other things going on too. I'm not saying that it is necessarily _just_ the experimental silly test-patch I sent out yesterday. But I _am_ saying that we should try to fix the root cause first, and then apply the page_alloc.c thing as a last-ditch effort (but I'd like that "last effort" to be something that nobody can trigger under any reasonable load). Do you NOW see my argument? The argument of "we shouldn't have gotten us into this situation in the first place". See? Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:59 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 22:12 ` Marcelo Tosatti 2001-05-27 6:53 ` Linux-2.4.5 Marcelo Tosatti 0 siblings, 1 reply; 64+ messages in thread From: Marcelo Tosatti @ 2001-05-26 22:12 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Linus Torvalds wrote: > > On Sat, 26 May 2001, Rik van Riel wrote: > > > > > Testing is good. But I want to understand how we get into the > > > situation in the first place, and whether there are ways to alleviate > > > those problems too. > > > > As I said create_buffers() -> get_unused_buffer_head() > > -> __alloc_pages() -> loop infinitely. > > No no no. > > Get outside the small world of where it's looping. There are two problems here. It is senseless to make GFP_BUFFER allocations loop inside __alloc_pages() (calling try_to_free_pages()) because these allocations _can't_ block. If there is no free memory, they are going to loop. This is "independant" from the highmem deadlock. See? Now lets talk about the highmem deadlock. > Think more on the problem of "how did we get into such a state that we're > so low on memory for swapouts in the FIRST PLACE?" > > That's the problem I want to fix. And I suspect part of the equation is > exactly the fact that we use SLAB_BUFFER for the buffer heads. > > Example schenario: > - we're low on memory, but not disastrously so. We have lots of highmem > pages, but not that much NORMAL. But we're not uncomfortable yet. > - somebody starts writing out to a file. He nicely allocates HIGHMEM > pages for the actual data (no memory pressure on HIGHMEM, so no > "try_to_free_pages()" called at all) > - the writer _also_ needs the buffer heads for those written pages (ie > "generic_block_prepare_write()"), and here he allocates them with > SLAB_BUFFER. The NORMAL zone starts getting low, and we start calling > "try_to_free_pages()". > - we deplete "try_to_free_pages()" and start swapping. Except everybody > is calling "try_to_free_pages()" with GFP_BUFFER, so nobody can > actually do any IO, and if we're unlucky there's not much to be free'd > in the NORMAL zone. > > Now do you start to see a pattern here? > > You're trying to fix the symptoms, by attacking the final end. And what > I've been trying to say is that this problem likely has a higher-level > _cause_, and I want that _cause_ fixed. Not the symptoms. You are not going to fix the problem by _only_ doing this bh allocation change. Even if some bh allocators _can_ block on IO, there is no guarantee that they are going to free low memory --- they may start more IO on highmem pages. We cannot treat highmem as "yet another zone" zone. All highmem data goes through the lowmem before reaching the disk, so its clear for me that we should not try to write out highmem pages in case we have a lowmem shortage. Well, IMO. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 22:12 ` Linux-2.4.5 Marcelo Tosatti @ 2001-05-27 6:53 ` Marcelo Tosatti 2001-06-03 23:32 ` Linux-2.4.5 Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Marcelo Tosatti @ 2001-05-27 6:53 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml On Sat, 26 May 2001, Marcelo Tosatti wrote: > > You're trying to fix the symptoms, by attacking the final end. And what > > I've been trying to say is that this problem likely has a higher-level > > _cause_, and I want that _cause_ fixed. Not the symptoms. > > You are not going to fix the problem by _only_ doing this bh allocation > change. > > Even if some bh allocators _can_ block on IO, there is no guarantee that > they are going to free low memory --- they may start more IO on highmem > pages. > > We cannot treat highmem as "yet another zone" zone. All highmem data goes > through the lowmem before reaching the disk, so its clear for me that we > should not try to write out highmem pages in case we have a lowmem > shortage. > > Well, IMO. I've just tried something similar to the attached patch, which is a "more aggressive" version of your suggestion to use SLAB_KERNEL for bh allocations when possible. This one makes all bh allocators block on IO. I've tested multiple "dd if=/dev/zero of=bigfile..." calls. (8GB machine, different amounts of data being written) The patch makes the kernel handle heavier loads much better than .5 and .5aa, but it does not fix the problem for this specific test. (this one is not against 2.4.5, but what I tested under .5 is basically the same patch) diff -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c --- linux.orig/mm/vmscan.c Mon Apr 2 23:41:08 2001 +++ linux/mm/vmscan.c Tue Apr 3 01:53:13 2001 @@ -422,7 +422,7 @@ int page_launder(int gfp_mask, int sync) { int launder_loop, maxscan, cleaned_pages, maxlaunder; - int can_get_io_locks; + int can_get_io_locks, can_queue_buffers; struct list_head * page_lru; struct page * page; @@ -431,6 +431,7 @@ * buffers to disk) if __GFP_IO is set. */ can_get_io_locks = gfp_mask & __GFP_IO; + can_queue_buffers = gfp_mask & __GFP_PAGE_IO; launder_loop = 0; maxlaunder = 0; @@ -482,7 +483,7 @@ goto page_active; /* First time through? Move it to the back of the list */ - if (!launder_loop) { + if (!launder_loop || can_queue_buffers) { list_del(page_lru); list_add(page_lru, &inactive_dirty_list); UnlockPage(page); @@ -612,7 +613,8 @@ * loads, flush out the dirty pages before we have to wait on * IO. */ - if (can_get_io_locks && !launder_loop && free_shortage()) { + if ((can_queue_buffers || can_get_io_locks) && !launder_loop + && free_shortage()) { launder_loop = 1; /* If we cleaned pages, never do synchronous IO. */ if (cleaned_pages) diff -Nur linux.orig/fs/buffer.c linux/fs/buffer.c --- linux.orig/fs/buffer.c Mon Apr 2 23:40:59 2001 +++ linux/fs/buffer.c Tue Apr 3 01:54:26 2001 @@ -1231,7 +1231,7 @@ * more buffer heads, because the swap-out may need * more buffer-heads itself. Thus SLAB_BUFFER. */ - if((bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER)) != NULL) { + if((bh = kmem_cache_alloc(bh_cachep, SLAB_PAGE_IO)) != NULL) { memset(bh, 0, sizeof(*bh)); init_waitqueue_head(&bh->b_wait); return bh; @@ -2261,7 +2261,7 @@ return 0; } - page = alloc_page(GFP_BUFFER); + page = alloc_page(GFP_PAGE_IO); if (!page) goto out; LockPage(page); diff -Nur linux.orig/include/linux/mm.h linux/include/linux/mm.h --- linux.orig/include/linux/mm.h Mon Apr 2 23:41:09 2001 +++ linux/include/linux/mm.h Tue Apr 3 01:49:29 2001 @@ -480,8 +480,9 @@ #define __GFP_HIGHMEM 0x0 /* noop */ #endif #define __GFP_VM 0x20 +#define __GFP_PAGE_IO 0x40 - +#define GFP_PAGE_IO (__GFP_HIGH | __GFP_WAIT | __GFP_PAGE_IO) #define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO) diff -Nur linux.orig/include/linux/slab.h linux/include/linux/slab.h --- linux.orig/include/linux/slab.h Mon Apr 2 23:41:11 2001 +++ linux/include/linux/slab.h Tue Apr 3 01:50:01 2001 @@ -15,6 +15,7 @@ #include <linux/cache.h> /* flags for kmem_cache_alloc() */ +#define SLAB_PAGE_IO GFP_PAGE_IO #define SLAB_BUFFER GFP_BUFFER #define SLAB_ATOMIC GFP_ATOMIC #define SLAB_USER GFP_USER @@ -22,7 +23,7 @@ #define SLAB_NFS GFP_NFS #define SLAB_DMA GFP_DMA -#define SLAB_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO) +#define SLAB_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_PAGE_IO) #define SLAB_NO_GROW 0x00001000UL /* don't grow a cache */ /* flags to pass to kmem_cache_create(). ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-27 6:53 ` Linux-2.4.5 Marcelo Tosatti @ 2001-06-03 23:32 ` Linus Torvalds 2001-06-05 2:21 ` Linux-2.4.5 Marcelo Tosatti 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-06-03 23:32 UTC (permalink / raw) To: Marcelo Tosatti Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml [ Back from Japan - don't start sending me tons of emails yet, as you can see I'm only picking up last weeks discussion where it ended right now.. ] On Sat-Sun, 26-27 May 2001, Marcelo Tosatti wrote: > > You are not going to fix the problem by _only_ doing this bh allocation > change. I would obviously not disagree with that statement. There are multiple users of the low-memory zone, and they are all likely to have some of the same problems. > Even if some bh allocators _can_ block on IO, there is no guarantee that > they are going to free low memory --- they may start more IO on highmem > pages. Now, this was actually something I already referred to earlier in this same thread, see one of myt first mails about this: Fri, 25 May 2001 21:22:05 Linus Torvalds wrote: >> >> For example, we used to have logic in swapout_process to _not_ swap >> out zones that don't need it. We changed swapout to happen in >> "page_launder()", but that logic got lost. It's entirely possible that >> we should just say "don't bother writing out dirty pages that are in >> zones that have no memory pressure", so that we don't use up pages >> from the _precious_ zones to free pages in zones that don't need >> freeing. So note how there are multiple facets to this problem. Marcelo goes on to write: > > I've just tried something similar to the attached patch, which is a "more > aggressive" version of your suggestion to use SLAB_KERNEL for bh > allocations when possible. This one makes all bh allocators block on IO. The patch looks fine. Has anybody else tried it? Along with, for example, something like this [warning: whitespace damage, I just cut-and-pasted this as a test-patch], we might actually _fix_ the problem: --- v2.4.5/linux/mm/vmscan.c Fri May 25 18:28:55 2001 +++ linux/mm/vmscan.c Sun Jun 3 16:26:20 2001 @@ -463,6 +463,7 @@ /* Page is or was in use? Move it to the active list. */ if (PageReferenced(page) || page->age > 0 || + page->zone->free_pages > page->zone->pages_max || (!page->buffers && page_count(page) > 1) || page_ramdisk(page)) { del_page_from_inactive_dirty_list(page); What the above does is fairly obvious: it considers all pages in zones that don't need to be free'd to be "young", and doesn't even try to write them out. Because we have absolutely no reason to do so. This is similar to what we used to do in "try_to_swap_out()" before, and it still makes sense. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-06-03 23:32 ` Linux-2.4.5 Linus Torvalds @ 2001-06-05 2:21 ` Marcelo Tosatti 0 siblings, 0 replies; 64+ messages in thread From: Marcelo Tosatti @ 2001-06-05 2:21 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml (ugh, just found this mail lost around here) On Sun, 3 Jun 2001, Linus Torvalds wrote: > > [ Back from Japan - don't start sending me tons of emails yet, as you can > see I'm only picking up last weeks discussion where it ended right > now.. ] > > On Sat-Sun, 26-27 May 2001, Marcelo Tosatti wrote: > > > > You are not going to fix the problem by _only_ doing this bh allocation > > change. > > I would obviously not disagree with that statement. There are multiple > users of the low-memory zone, and they are all likely to have some of the > same problems. > > > Even if some bh allocators _can_ block on IO, there is no guarantee that > > they are going to free low memory --- they may start more IO on highmem > > pages. > > Now, this was actually something I already referred to earlier in this > same thread, see one of myt first mails about this: > > Fri, 25 May 2001 21:22:05 Linus Torvalds wrote: > >> > >> For example, we used to have logic in swapout_process to _not_ swap > >> out zones that don't need it. We changed swapout to happen in > >> "page_launder()", but that logic got lost. It's entirely possible that > >> we should just say "don't bother writing out dirty pages that are in > >> zones that have no memory pressure", so that we don't use up pages > >> from the _precious_ zones to free pages in zones that don't need > >> freeing. > > So note how there are multiple facets to this problem. > > > Marcelo goes on to write: > > > > I've just tried something similar to the attached patch, which is a "more > > aggressive" version of your suggestion to use SLAB_KERNEL for bh > > allocations when possible. This one makes all bh allocators block on IO. > > The patch looks fine. Has anybody else tried it? The XFS people have been using GFP_PAGE_IO for sometime in their CVS. (getblk is not using GFP_PAGE_IO there, though). > Along with, for example, something like this [warning: whitespace damage, > I just cut-and-pasted this as a test-patch], we might actually _fix_ the > problem: > > --- v2.4.5/linux/mm/vmscan.c Fri May 25 18:28:55 2001 > +++ linux/mm/vmscan.c Sun Jun 3 16:26:20 2001 > @@ -463,6 +463,7 @@ > > /* Page is or was in use? Move it to the active list. */ > if (PageReferenced(page) || page->age > 0 || > + page->zone->free_pages > page->zone->pages_max || > (!page->buffers && page_count(page) > 1) || > page_ramdisk(page)) { > del_page_from_inactive_dirty_list(page); > > What the above does is fairly obvious: it considers all pages in zones > that don't need to be free'd to be "young", and doesn't even try to write > them out. Because we have absolutely no reason to do so. This patch makes perfect sense, but it does not avoid us from writing out highmem pages (_even_ if the highmem zone has a shortage) in case the lowmem is under shortage. We _need_ low memory to writeout high memory, thats my point. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:21 ` Linux-2.4.5 Rik van Riel 2001-05-26 14:38 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 15:09 ` Linus Torvalds 2001-05-26 15:18 ` Linux-2.4.5 Rik van Riel 1 sibling, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2001-05-26 15:09 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > > O, that part is fixed by the patch that Linus threw away > yesterday ;) Rik, I threw away the parts of the patch that were bad and obvious band-aids, and it was hard to tell whether any of your patch was a "real" fix as opposed to just making more reservations. And you have not replied to any of the real arguments for fixing the _real_ bugs. You jst repeat "my patch, my patch, my patch", without even acknowledging that others are finding the real _underlying_ problems. Please take a moment to realize that you're not exactly being helpful. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:09 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 15:18 ` Rik van Riel 2001-05-26 15:24 ` Linux-2.4.5 Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 15:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Linus Torvalds wrote: > On Sat, 26 May 2001, Rik van Riel wrote: > > > > O, that part is fixed by the patch that Linus threw away > > yesterday ;) > > Rik, I threw away the parts of the patch that were bad and obvious > band-aids, and it was hard to tell whether any of your patch was a > "real" fix as opposed to just making more reservations. 1) Remove GFP_BUFFER and HIGHMEM related deadlocks, by letting these allocations fail instead of looping forever in __alloc_pages() when they cannot make any progress there. It's the changes to __alloc_pages(), where we don't loop forever but fail the allocation. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:18 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:24 ` Andrea Arcangeli 2001-05-26 15:26 ` Linux-2.4.5 Rik van Riel 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 15:24 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, May 26, 2001 at 12:18:07PM -0300, Rik van Riel wrote: > It's the changes to __alloc_pages(), where we don't loop forever > but fail the allocation. __alloc_pages() should definitely not to loop forever but it should fail the allocation instead. If it doesn't right now that is yet another bug, at least with the 2.4 memory management design where we don't know if there is memory available or not when we do the allocation. I'd really appreciate if you could re-post this strictly necessary localized fix, I will definitely agree about that one. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:24 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 15:26 ` Rik van Riel 2001-05-26 15:40 ` Linux-2.4.5 Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 15:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Sat, May 26, 2001 at 12:18:07PM -0300, Rik van Riel wrote: > > It's the changes to __alloc_pages(), where we don't loop forever > > but fail the allocation. > > __alloc_pages() should definitely not to loop forever but it should > fail the allocation instead. Guess what's in my patch? Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:26 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:40 ` Andrea Arcangeli 0 siblings, 0 replies; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 15:40 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel On Sat, May 26, 2001 at 12:26:38PM -0300, Rik van Riel wrote: > Guess what's in my patch? that part is fine indeed, it's ages that I am telling that alloc_pages must always be allowed to fail or things will deadlock, go back to the discussion with Ingo of a few months ago, finally you seems convinced about that. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 3:11 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds @ 2001-05-26 4:45 ` Rik van Riel 2001-05-26 4:47 ` Linux-2.4.5 Rik van Riel 2001-05-26 14:32 ` Linux-2.4.5 Andrea Arcangeli 1 sibling, 2 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 4:45 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote: > > Highmem. 0 free pages in ZONE_NORMAL. Now try to allocate a buffer_head. > > That's a longstanding deadlock, it was there the first time I read > fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also > getblk is deadlock prone in a smiliar manner. Not only this, but the fix is _surprisingly_ simple... All we need to make sure of is that the following order of allocations is possible and that we back out instead of deadlock when we don't succeed at any step. 1) normal user allocation 2) buffer allocation (bounce buffer + bufferhead) 3) allocation from interrupt (for device driver) This is fixed by the patch I sent because: 1) user allocations stop when we reach zone->pages_min and keep looping until we freed some memory ... well, these don't just loop because we can guarantee that freeing memory with GFP_USER or GFP_KERNEL is possible 2) GFP_BUFFER allocations can allocate down to the point where free pages go to zone->pages_min * 3 / 4, so we can continue to swapout highmem pages when userland allocations have stopped ... this is needed because otherwise we cannot always make progress on highmem pages and we could have the effective amount of RAM in the system reduced to less than 1GB, in really bad cases not having this could even cause a deadlock 3) If the device driver needs to allocate something, it has from zone->pages_min*3/4 down to zone->pages_min/4 space to allocate stuff, this should be very useful for swap or mmap() over the network, or to encrypted block devices, etc... > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE > and see if it makes a difference? No Comment(tm) *grin* cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:45 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 4:47 ` Rik van Riel 2001-05-26 6:07 ` Linux-2.4.5 Ben LaHaise 2001-05-26 14:32 ` Linux-2.4.5 Andrea Arcangeli 1 sibling, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 4:47 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > 1) normal user allocation > 2) buffer allocation (bounce buffer + bufferhead) > 3) allocation from interrupt (for device driver) Hmmmm, now that I think of it, we always need to be able to guarantee _both_ 2) and 3). For different allocators and interrupts. I guess the only long-term solution is to have real memory reservation strategies, like the one in Ben's patch. That might be a 2.5 thing, though ... Ben? regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:47 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 6:07 ` Ben LaHaise 0 siblings, 0 replies; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 6:07 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Rik van Riel wrote: > That might be a 2.5 thing, though ... Ben? It isn't an immediate priority for me at this moment, but in a few weeks I hope to have it back at the top of my concerns. If anyone considers it truely important, I might be persuaded to give it a bit more nudging along, but if people try to explain to me that the concept is completely unnescessary, I'll redirect their notes to /dev/null. -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 4:45 ` Linux-2.4.5 Rik van Riel 2001-05-26 4:47 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 14:32 ` Andrea Arcangeli 2001-05-26 14:36 ` Linux-2.4.5 Rik van Riel 1 sibling, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 14:32 UTC (permalink / raw) To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, May 26, 2001 at 01:45:27AM -0300, Rik van Riel wrote: > 3) If the device driver needs to allocate something, it > has from zone->pages_min*3/4 down to zone->pages_min/4 > space to allocate stuff, this should be very useful > for swap or mmap() over the network, or to encrypted > block devices, etc... Anything supposed to work because there's enough memory between zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken period. > > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE > > and see if it makes a difference? > > No Comment(tm) *grin* I'm having lots of fun, thanks. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:32 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 14:36 ` Rik van Riel 2001-05-26 15:03 ` Linux-2.4.5 Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 14:36 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Sat, May 26, 2001 at 01:45:27AM -0300, Rik van Riel wrote: > > 3) If the device driver needs to allocate something, it > > has from zone->pages_min*3/4 down to zone->pages_min/4 > > space to allocate stuff, this should be very useful > > for swap or mmap() over the network, or to encrypted > > block devices, etc... > > Anything supposed to work because there's enough memory between > zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken > period. No. It's not about having enough memory between those levels. It's about _failing_ the allocation when you reach a limit. > > > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE > > > and see if it makes a difference? > > > > No Comment(tm) *grin* > > I'm having lots of fun, thanks. Now _this_ is tweaking magic limits ;) cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 14:36 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:03 ` Andrea Arcangeli 2001-05-26 15:08 ` Linux-2.4.5 Rik van Riel 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 15:03 UTC (permalink / raw) To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, May 26, 2001 at 11:36:22AM -0300, Rik van Riel wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > No Comment(tm) *grin* > > > > I'm having lots of fun, thanks. > > Now _this_ is tweaking magic limits ;) Others agreed that the real source of the create_buffers could be just too few reserved pages in the unused_list, the unused_list is the only "deadlock avoidance logic" designed to avoid create_buffers to deadlock, so I really don't see why you don't listen to me and you just assume my idea is obviously wrong while it obviously isn't. The only thing I can do is to laugh while reading your illogical "No Comment(tm) *grin*", if I would take such comment serously I would just ignore the VM and stay in other subsystem where those funny things never happens as far I can tell, which I'm not going to do just because of your new funny "No Comment(tm) *grin*". If for any reason you notice I somehow invite those things to happen please let me know and I will certainly try to get better from my part. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:03 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 15:08 ` Rik van Riel 2001-05-26 15:20 ` Linux-2.4.5 Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-26 15:08 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > Others agreed that the real source of the create_buffers could be just > too few reserved pages in the unused_list To quote you, from the message to which I replied with the "No Comment" comment: ------> Andrea Arcangeli wrote: Anything supposed to work because there's enough memory between zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken period. ------ And not 10 lines lower you decide to raise some magic limit yourself. I guess my irony threshold must be lower than yours, or something. cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 15:08 ` Linux-2.4.5 Rik van Riel @ 2001-05-26 15:20 ` Andrea Arcangeli 0 siblings, 0 replies; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 15:20 UTC (permalink / raw) To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel [-- Attachment #1: Type: text/plain, Size: 637 bytes --] On Sat, May 26, 2001 at 12:08:34PM -0300, Rik van Riel wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > Others agreed that the real source of the create_buffers could be just > > too few reserved pages in the unused_list > > To quote you, from the message to which I replied with the > "No Comment" comment: > > ------> Andrea Arcangeli wrote: > Anything supposed to work because there's enough memory between > zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken > period. > ------ What are you smoking again? You said "No Comment(tm) *grin*" to my suggestion to increase NR_RESERVED (attached). Andrea [-- Attachment #2: Type: message/rfc822, Size: 3674 bytes --] From: Rik van Riel <riel@conectiva.com.br> To: Andrea Arcangeli <andrea@suse.de> Cc: Ben LaHaise <bcrl@redhat.com>, Linus Torvalds <torvalds@transmeta.com>, Alan Cox <alan@lxorguk.ukuu.org.uk>, linux-kernel@vger.kernel.org Subject: Re: Linux-2.4.5 Date: Sat, 26 May 2001 01:45:27 -0300 (BRST) Message-ID: <Pine.LNX.4.21.0105260137140.30264-100000@imladris.rielhome.conectiva> On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote: > > Highmem. 0 free pages in ZONE_NORMAL. Now try to allocate a buffer_head. > > That's a longstanding deadlock, it was there the first time I read > fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also > getblk is deadlock prone in a smiliar manner. Not only this, but the fix is _surprisingly_ simple... All we need to make sure of is that the following order of allocations is possible and that we back out instead of deadlock when we don't succeed at any step. 1) normal user allocation 2) buffer allocation (bounce buffer + bufferhead) 3) allocation from interrupt (for device driver) This is fixed by the patch I sent because: 1) user allocations stop when we reach zone->pages_min and keep looping until we freed some memory ... well, these don't just loop because we can guarantee that freeing memory with GFP_USER or GFP_KERNEL is possible 2) GFP_BUFFER allocations can allocate down to the point where free pages go to zone->pages_min * 3 / 4, so we can continue to swapout highmem pages when userland allocations have stopped ... this is needed because otherwise we cannot always make progress on highmem pages and we could have the effective amount of RAM in the system reduced to less than 1GB, in really bad cases not having this could even cause a deadlock 3) If the device driver needs to allocate something, it has from zone->pages_min*3/4 down to zone->pages_min/4 space to allocate stuff, this should be very useful for swap or mmap() over the network, or to encrypted block devices, etc... > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE > and see if it makes a difference? No Comment(tm) *grin* cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Linux-2.4.5 2001-05-26 2:11 ` Linux-2.4.5 Ben LaHaise 2001-05-26 2:38 ` Linux-2.4.5 Andrea Arcangeli @ 2001-05-26 15:41 ` Rik van Riel 1 sibling, 0 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 15:41 UTC (permalink / raw) To: Ben LaHaise; +Cc: Andrea Arcangeli, Linus Torvalds, Alan Cox, linux-kernel On Fri, 25 May 2001, Ben LaHaise wrote: > +++ work/mm/vmscan.c Mon May 14 16:43:05 2001 > @@ -636,6 +636,12 @@ > + if (gfp_mask & __GFP_WAIT) { Without __GFP_WAIT, we will never call do_try_to_free_pages() Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:29 ` Ben LaHaise 2001-05-26 0:34 ` Linus Torvalds @ 2001-05-26 0:42 ` Andrea Arcangeli 2001-05-26 0:52 ` Ben LaHaise 2001-05-26 1:43 ` Rik van Riel 1 sibling, 2 replies; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 0:42 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 08:29:38PM -0400, Ben LaHaise wrote: > amount of bounce buffers to guarentee progress while submitting io. The > -ac kernels have a patch from Ingo that provides private pools for bounce > buffers and buffer_heads. I went a step further and have a memory > reservation patch that provides for memory pools being reserved against a > particular zone. This is needed to prevent the starvation that irq > allocations can cause. > > Some of these cleanups are 2.5 fodder, but we really need something in 2.4 > right now, so... Please merge this one in 2.4 for now (originally from Ingo, I only improved it), this is a real definitive fix and there's no nicer way to handle that unless you want to generalize an API for people to generate private anti-deadlock ("make sure to always make a progress") memory pools: diff -urN 2.4.4/mm/highmem.c highmem-deadlock/mm/highmem.c --- 2.4.4/mm/highmem.c Sat Apr 28 05:24:48 2001 +++ highmem-deadlock/mm/highmem.c Sat Apr 28 18:21:24 2001 @@ -159,6 +159,19 @@ spin_unlock(&kmap_lock); } +#define POOL_SIZE 32 + +/* + * This lock gets no contention at all, normally. + */ +static spinlock_t emergency_lock = SPIN_LOCK_UNLOCKED; + +int nr_emergency_pages; +static LIST_HEAD(emergency_pages); + +int nr_emergency_bhs; +static LIST_HEAD(emergency_bhs); + /* * Simple bounce buffer support for highmem pages. * This will be moved to the block layer in 2.5. @@ -203,17 +216,72 @@ static inline void bounce_end_io (struct buffer_head *bh, int uptodate) { + struct page *page; struct buffer_head *bh_orig = (struct buffer_head *)(bh->b_private); + unsigned long flags; bh_orig->b_end_io(bh_orig, uptodate); - __free_page(bh->b_page); + + page = bh->b_page; + + spin_lock_irqsave(&emergency_lock, flags); + if (nr_emergency_pages >= POOL_SIZE) + __free_page(page); + else { + /* + * We are abusing page->list to manage + * the highmem emergency pool: + */ + list_add(&page->list, &emergency_pages); + nr_emergency_pages++; + } + + if (nr_emergency_bhs >= POOL_SIZE) { #ifdef HIGHMEM_DEBUG - /* Don't clobber the constructed slab cache */ - init_waitqueue_head(&bh->b_wait); + /* Don't clobber the constructed slab cache */ + init_waitqueue_head(&bh->b_wait); #endif - kmem_cache_free(bh_cachep, bh); + kmem_cache_free(bh_cachep, bh); + } else { + /* + * Ditto in the bh case, here we abuse b_inode_buffers: + */ + list_add(&bh->b_inode_buffers, &emergency_bhs); + nr_emergency_bhs++; + } + spin_unlock_irqrestore(&emergency_lock, flags); } +static __init int init_emergency_pool(void) +{ + spin_lock_irq(&emergency_lock); + while (nr_emergency_pages < POOL_SIZE) { + struct page * page = alloc_page(GFP_ATOMIC); + if (!page) { + printk("couldn't refill highmem emergency pages"); + break; + } + list_add(&page->list, &emergency_pages); + nr_emergency_pages++; + } + while (nr_emergency_bhs < POOL_SIZE) { + struct buffer_head * bh = kmem_cache_alloc(bh_cachep, SLAB_ATOMIC); + if (!bh) { + printk("couldn't refill highmem emergency bhs"); + break; + } + list_add(&bh->b_inode_buffers, &emergency_bhs); + nr_emergency_bhs++; + } + spin_unlock_irq(&emergency_lock); + printk("allocated %d pages and %d bhs reserved for the highmem bounces\n", + nr_emergency_pages, nr_emergency_bhs); + + return 0; +} + +__initcall(init_emergency_pool); + static void bounce_end_io_write (struct buffer_head *bh, int uptodate) { bounce_end_io(bh, uptodate); @@ -228,6 +296,82 @@ bounce_end_io(bh, uptodate); } +struct page *alloc_bounce_page (void) +{ + struct list_head *tmp; + struct page *page; + +repeat_alloc: + page = alloc_page(GFP_BUFFER); + if (page) + return page; + /* + * No luck. First, kick the VM so it doesnt idle around while + * we are using up our emergency rations. + */ + wakeup_bdflush(0); + + /* + * Try to allocate from the emergency pool. + */ + tmp = &emergency_pages; + spin_lock_irq(&emergency_lock); + if (!list_empty(tmp)) { + page = list_entry(tmp->next, struct page, list); + list_del(tmp->next); + nr_emergency_pages--; + } + spin_unlock_irq(&emergency_lock); + if (page) + return page; + + /* we need to wait I/O completion */ + run_task_queue(&tq_disk); + + current->policy |= SCHED_YIELD; + __set_current_state(TASK_RUNNING); + schedule(); + goto repeat_alloc; +} + +struct buffer_head *alloc_bounce_bh (void) +{ + struct list_head *tmp; + struct buffer_head *bh; + +repeat_alloc: + bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER); + if (bh) + return bh; + /* + * No luck. First, kick the VM so it doesnt idle around while + * we are using up our emergency rations. + */ + wakeup_bdflush(0); + + /* + * Try to allocate from the emergency pool. + */ + tmp = &emergency_bhs; + spin_lock_irq(&emergency_lock); + if (!list_empty(tmp)) { + bh = list_entry(tmp->next, struct buffer_head, b_inode_buffers); + list_del(tmp->next); + nr_emergency_bhs--; + } + spin_unlock_irq(&emergency_lock); + if (bh) + return bh; + + /* we need to wait I/O completion */ + run_task_queue(&tq_disk); + + current->policy |= SCHED_YIELD; + __set_current_state(TASK_RUNNING); + schedule(); + goto repeat_alloc; +} + struct buffer_head * create_bounce(int rw, struct buffer_head * bh_orig) { struct page *page; @@ -236,24 +380,15 @@ if (!PageHighMem(bh_orig->b_page)) return bh_orig; -repeat_bh: - bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER); - if (!bh) { - wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - goto repeat_bh; - } + bh = alloc_bounce_bh(); /* * This is wasteful for 1k buffers, but this is a stopgap measure * and we are being ineffective anyway. This approach simplifies * things immensly. On boxes with more than 4GB RAM this should * not be an issue anyway. */ -repeat_page: - page = alloc_page(GFP_BUFFER); - if (!page) { - wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - goto repeat_page; - } + page = alloc_bounce_page(); + set_bh_page(bh, page, 0); bh->b_next = NULL; And this one as well to avoid tight loops in getblk without reschedules in between when normal zone is empty: diff -urN 2.4.4pre1/fs/buffer.c 2.4.4pre1-blkdev/fs/buffer.c --- 2.4.4pre1/fs/buffer.c Sun Apr 1 01:17:30 2001 +++ 2.4.4pre1-blkdev/fs/buffer.c Mon Apr 9 15:37:20 2001 @@ -628,7 +622,7 @@ to do in order to release the ramdisk memory is to destroy dirty buffers. These are two special cases. Normal usage imply the device driver - to issue a sync on the device (without waiting I/O completation) and + to issue a sync on the device (without waiting I/O completion) and then an invalidate_buffers call that doesn't trash dirty buffers. */ void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers) { @@ -762,7 +756,12 @@ balance_dirty(NODEV); if (free_shortage()) page_launder(GFP_BUFFER, 0); - grow_buffers(size); + if (!grow_buffers(size)) { + wakeup_bdflush(1); + current->policy |= SCHED_YIELD; + __set_current_state(TASK_RUNNING); + schedule(); + } } void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) @@ -1027,12 +1026,13 @@ write_unlock(&hash_table_lock); spin_unlock(&lru_list_lock); refill_freelist(size); + /* FIXME: getblk should fail if there's no enough memory */ goto repeat; } /* -1 -> no need to flush 0 -> async flush - 1 -> sync flush (wait for I/O completation) */ + 1 -> sync flush (wait for I/O completion) */ int balance_dirty_state(kdev_t dev) { unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit; @@ -1431,6 +1431,7 @@ { struct buffer_head *bh, *head, *tail; + /* FIXME: create_buffers should fail if there's no enough memory */ head = create_buffers(page, blocksize, 1); if (page->buffers) BUG(); @@ -2367,11 +2368,9 @@ spin_lock(&free_list[index].lock); tmp = bh; do { - struct buffer_head *p = tmp; - - tmp = tmp->b_this_page; - if (buffer_busy(p)) + if (buffer_busy(tmp)) goto busy_buffer_page; + tmp = tmp->b_this_page; } while (tmp != bh); spin_lock(&unused_list_lock); Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:42 ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli @ 2001-05-26 0:52 ` Ben LaHaise 2001-05-26 1:27 ` Andrea Arcangeli 2001-05-26 1:43 ` Rik van Riel 1 sibling, 1 reply; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 0:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > Please merge this one in 2.4 for now (originally from Ingo, I only > improved it), this is a real definitive fix and there's no nicer way to > handle that unless you want to generalize an API for people to generate > private anti-deadlock ("make sure to always make a progress") memory > pools: Alternatively, the following might be more interesting... -ben diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/arch/i386/kernel/irq.c linux-toolbox-current/arch/i386/kernel/irq.c --- kernel-2.4.3-works/linux.orig/arch/i386/kernel/irq.c Thu May 10 16:04:39 2001 +++ linux-toolbox-current/arch/i386/kernel/irq.c Thu May 10 12:16:21 2001 @@ -32,6 +32,7 @@ #include <linux/kernel_stat.h> #include <linux/irq.h> #include <linux/proc_fs.h> +#include <linux/mm/reservation.h> #include <asm/atomic.h> #include <asm/io.h> @@ -576,7 +577,10 @@ irq_desc_t *desc = irq_desc + irq; struct irqaction * action; unsigned int status; + struct page_reservation *saved_irq_rsv; + saved_irq_rsv = current->page_reservations; + current->page_reservations = &irq_rsv; kstat.irqs[cpu][irq]++; spin_lock(&desc->lock); desc->handler->ack(irq); @@ -638,6 +642,7 @@ if (softirq_active(cpu) & softirq_mask(cpu)) do_softirq(); + current->page_reservations = saved_irq_rsv; return 1; } diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/arch/i386/mm/fault.c linux-toolbox-current/arch/i386/mm/fault.c --- kernel-2.4.3-works/linux.orig/arch/i386/mm/fault.c Thu May 10 16:04:40 2001 +++ linux-toolbox-current/arch/i386/mm/fault.c Mon May 14 13:26:57 2001 @@ -196,6 +196,7 @@ if (in_interrupt() || !mm) goto no_context; + atomic_inc(&mm->in_fault_count); down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -269,6 +270,7 @@ if (bit < 32) tsk->thread.screen_bitmap |= 1 << bit; } + atomic_dec(&mm->in_fault_count); up_read(&mm->mmap_sem); return; @@ -277,6 +279,7 @@ * Fix it, but check if it's kernel or user first.. */ bad_area: + atomic_dec(&mm->in_fault_count); up_read(&mm->mmap_sem); bad_area_nosemaphore: @@ -339,6 +342,7 @@ * us unable to handle the page fault gracefully. */ out_of_memory: + atomic_dec(&mm->in_fault_count); up_read(&mm->mmap_sem); printk("VM: killing process %s\n", tsk->comm); if (error_code & 4) @@ -346,6 +350,7 @@ goto no_context; do_sigbus: + atomic_dec(&mm->in_fault_count); up_read(&mm->mmap_sem); /* diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/fs/buffer.c linux-toolbox-current/fs/buffer.c --- kernel-2.4.3-works/linux.orig/fs/buffer.c Thu May 10 16:07:27 2001 +++ linux-toolbox-current/fs/buffer.c Fri May 11 16:42:19 2001 @@ -45,6 +45,7 @@ #include <linux/quotaops.h> #include <linux/iobuf.h> #include <linux/highmem.h> +#include <linux/mm/reservation.h> #include <asm/uaccess.h> #include <asm/io.h> @@ -2735,6 +2736,7 @@ */ int bdflush(void *sem) { + static struct page_reservation rsv; struct task_struct *tsk = current; int flushed; /* @@ -2748,6 +2750,12 @@ strcpy(tsk->comm, "bdflush"); bdflush_tsk = tsk; + init_page_reservation(&rsv, RSV_MULTISHOT, ZONE_NORMAL); + if (reserve_pages(&rsv, GFP_KERNEL, 64)) + panic("bdflush unable to reserve emergency pages!\n"); + tsk->page_reservations = &rsv; + + /* avoid getting signals */ spin_lock_irq(&tsk->sigmask_lock); flush_signals(tsk); @@ -2778,6 +2786,8 @@ the next schedule will block. */ __set_current_state(TASK_RUNNING); } + + destroy_page_reservation(&rsv); } /* @@ -2788,6 +2798,7 @@ */ int kupdate(void *sem) { + static struct page_reservation rsv; struct task_struct * tsk = current; int interval; @@ -2795,6 +2806,11 @@ tsk->pgrp = 1; strcpy(tsk->comm, "kupdated"); + init_page_reservation(&rsv, RSV_MULTISHOT, ZONE_NORMAL); + if (reserve_pages(&rsv, GFP_KERNEL, 32)) + panic("bdflush unable to reserve emergency pages!\n"); + tsk->page_reservations = &rsv; + /* sigstop and sigcont will stop and wakeup kupdate */ spin_lock_irq(&tsk->sigmask_lock); sigfillset(&tsk->blocked); @@ -2833,6 +2849,7 @@ #endif sync_old_buffers(); } + destroy_page_reservation(&rsv); } static int __init bdflush_init(void) diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mm/reservation.h linux-toolbox-current/include/linux/mm/reservation.h --- kernel-2.4.3-works/linux.orig/include/linux/mm/reservation.h Wed Dec 31 19:00:00 1969 +++ linux-toolbox-current/include/linux/mm/reservation.h Thu May 10 12:16:21 2001 @@ -0,0 +1,48 @@ +#ifndef __LINUX__MM__RESERVATION_H +#define __LINUX__MM__RESERVATION_H +/* inclinde/linux/mm/reservation.h + * written by Benjamin LaHaise + * + * Copyright 2001 Red Hat, Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * based in part on ideas/code from Arjan Van de Ven and Stephen Tweedie. + */ + +#define RSV_ONESHOT 0x00 +#define RSV_MULTISHOT 0x01 /* reservation will replenish itself */ + +struct page_reservation { + struct list_head list; + unsigned avail, used; + int flags; + zone_t *zone; +}; + +extern struct page_reservation irq_rsv; + +extern void init_page_reservation(struct page_reservation *rsv, int flags, int zone); +extern void destroy_page_reservation(struct page_reservation *rsv); + +/* Reservation is an all or nothing thing. A successful reservation + * returns 0. Anything else is a failure. + */ +extern int reserve_pages(struct page_reservation *rsv, int gfp_mask, unsigned count); + +/* Release a previously reserved amount of memory. */ +extern void put_reserved_pages(struct page_reservation *rsv, unsigned count); + +#endif diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mm.h linux-toolbox-current/include/linux/mm.h --- kernel-2.4.3-works/linux.orig/include/linux/mm.h Thu May 10 16:04:40 2001 +++ linux-toolbox-current/include/linux/mm.h Mon May 14 13:33:43 2001 @@ -528,7 +528,7 @@ #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL) -#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) +#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT | __GFP_FAIL) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mmzone.h linux-toolbox-current/include/linux/mmzone.h --- kernel-2.4.3-works/linux.orig/include/linux/mmzone.h Thu May 10 16:07:27 2001 +++ linux-toolbox-current/include/linux/mmzone.h Fri May 11 20:46:13 2001 @@ -50,6 +50,10 @@ unsigned long inactive_dirty_pages; unsigned long pages_min, pages_low, pages_high; + /* Page reservation */ + unsigned long reserved_pages; + struct list_head depleted_rsv_list; + /* * free areas of different sizes */ diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/sched.h linux-toolbox-current/include/linux/sched.h --- kernel-2.4.3-works/linux.orig/include/linux/sched.h Thu May 10 16:07:27 2001 +++ linux-toolbox-current/include/linux/sched.h Mon May 14 13:23:55 2001 @@ -210,6 +210,7 @@ pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ + atomic_t in_fault_count; /* number of in progress page faults */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects task page tables and mm->rss */ @@ -241,6 +242,7 @@ pgd: swapper_pg_dir, \ mm_users: ATOMIC_INIT(2), \ mm_count: ATOMIC_INIT(1), \ + in_fault_count: ATOMIC_INIT(0), \ map_count: 1, \ mmap_sem: __RWSEM_INITIALIZER(name.mmap_sem), \ page_table_lock: SPIN_LOCK_UNLOCKED, \ @@ -406,6 +408,8 @@ u32 self_exec_id; /* Protection of (de-)allocation: mm, files, fs, tty */ spinlock_t alloc_lock; + + struct page_reservation *page_reservations; }; /* @@ -486,7 +490,8 @@ sig: &init_signals, \ pending: { NULL, &tsk.pending.head, {{0}}}, \ blocked: {{0}}, \ - alloc_lock: SPIN_LOCK_UNLOCKED \ + alloc_lock: SPIN_LOCK_UNLOCKED, \ + page_reservations: NULL, \ } diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/init/main.c linux-toolbox-current/init/main.c --- kernel-2.4.3-works/linux.orig/init/main.c Thu May 10 16:04:39 2001 +++ linux-toolbox-current/init/main.c Thu May 10 21:00:20 2001 @@ -28,6 +28,7 @@ #include <linux/iobuf.h> #include <linux/bootmem.h> #include <linux/tty.h> +#include <linux/mm/reservation.h> #include <asm/io.h> #include <asm/bugs.h> @@ -88,6 +89,9 @@ static int init(void *); +extern struct page_reservation atomic_rsv; +extern struct page_reservation swap_rsv; + extern void init_IRQ(void); extern void init_modules(void); extern void sock_init(void); @@ -654,6 +658,13 @@ proc_root_init(); #endif mempages = num_physpages; + + if (reserve_pages(&irq_rsv, GFP_KERNEL, mempages >> 9)) + panic("unable to reserve irq memory.\n"); + if (reserve_pages(&swap_rsv, GFP_KERNEL, mempages >> 9)) + panic("unable to reserve swap memory.\n"); + if (reserve_pages(&atomic_rsv, GFP_KERNEL, mempages >> 10)) + panic("unable to reserve atomic memory.\n"); fork_init(mempages); proc_caches_init(); diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/kernel/exit.c linux-toolbox-current/kernel/exit.c --- kernel-2.4.3-works/linux.orig/kernel/exit.c Thu May 10 16:07:27 2001 +++ linux-toolbox-current/kernel/exit.c Thu May 10 12:15:34 2001 @@ -10,6 +10,7 @@ #include <linux/smp_lock.h> #include <linux/module.h> #include <linux/tty.h> +#include <linux/mm/reservation.h> #ifdef CONFIG_BSD_PROCESS_ACCT #include <linux/acct.h> #endif @@ -422,6 +423,11 @@ NORET_TYPE void do_exit(long code) { struct task_struct *tsk = current; + + if (tsk->page_reservations) { + destroy_page_reservation(tsk->page_reservations); + tsk->page_reservations = NULL; + } if (in_interrupt()) panic("Aiee, killing interrupt handler!"); diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/kernel/fork.c linux-toolbox-current/kernel/fork.c --- kernel-2.4.3-works/linux.orig/kernel/fork.c Thu May 10 16:07:27 2001 +++ linux-toolbox-current/kernel/fork.c Mon May 14 13:24:45 2001 @@ -203,6 +203,7 @@ { atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); + atomic_set(&mm->in_fault_count, 0); init_rwsem(&mm->mmap_sem); mm->page_table_lock = SPIN_LOCK_UNLOCKED; mm->pgd = pgd_alloc(); @@ -630,6 +631,7 @@ p->tty_old_pgrp = 0; p->times.tms_utime = p->times.tms_stime = 0; p->times.tms_cutime = p->times.tms_cstime = 0; + p->page_reservations = 0; #ifdef CONFIG_SMP { int i; diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/mm/page_alloc.c linux-toolbox-current/mm/page_alloc.c --- kernel-2.4.3-works/linux.orig/mm/page_alloc.c Thu May 10 16:07:27 2001 +++ linux-toolbox-current/mm/page_alloc.c Fri May 11 15:52:54 2001 @@ -18,7 +18,31 @@ #include <linux/pagemap.h> #include <linux/bootmem.h> #include <linux/slab.h> +#include <linux/mm/reservation.h> +extern int + page_launder_calls, + page_launder_scans, + page_launder_scanned_pages, + page_launder_scanned_not_dirty, + page_launder_scanned_active, + page_launder_scanned_skipped, + page_launder_scanned_dirty, + page_launder_scanned_dirty_skip, + page_launder_scanned_dirty_nolaunder, + page_launder_scanned_dirty_swapcache, + page_launder_scanned_buffers, + page_launder_scanned_buffers_refiled, + page_launder_scanned_buffers_freed_buffer, + page_launder_scanned_buffers_active, + page_launder_scanned_buffers_cleaned, + page_launder_scanned_buffers_released, + page_launder_scanned_mapped_clean, + page_launder_scanned_still_active +; +struct page_reservation atomic_rsv; +struct page_reservation swap_rsv; +struct page_reservation irq_rsv; int nr_swap_pages; int nr_active_pages; int nr_inactive_dirty_pages; @@ -99,7 +123,7 @@ page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty)); page->age = PAGE_AGE_START; - + zone = page->zone; mask = (~0UL) << order; @@ -115,7 +139,8 @@ __save_flags(flags); __cli(); - if (!order && (per_cpu->nr_pages < per_cpu->max_nr_pages)) { + if (!order && (per_cpu->nr_pages < per_cpu->max_nr_pages) && list_empty(&zone->depleted_rsv_list)) { +static int foo; if (foo++ < 5) printk("freeing per-cpu page\n"); list_add(&page->list, &per_cpu->head); per_cpu->nr_pages++; __restore_flags(flags); @@ -124,6 +149,20 @@ spin_lock(&zone->lock); + /* Check if we need to replenish any of this zone's reservations. */ + if (!list_empty(&zone->depleted_rsv_list)) { + struct page_reservation *rsv = list_entry(zone->depleted_rsv_list.next, struct page_reservation, list); +static int foo; if (foo++ < 5) printk("updating reserve: %p %u %u\n", rsv, rsv->avail, rsv->used); + if (!rsv->used) + BUG(); + rsv->avail++; + rsv->used--; + + list_del_init(&rsv->list); + if (rsv->used) + list_add(&rsv->list, zone->depleted_rsv_list.prev); + } + zone->free_pages -= mask; while (mask + (1 << (MAX_ORDER-1))) { @@ -190,12 +229,13 @@ } -static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned long order)); -static struct page * rmqueue(zone_t *zone, unsigned long order) +static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned long order, struct page_reservation *rsv)); +static struct page * rmqueue(zone_t *zone, unsigned long order, struct page_reservation *rsv) { per_cpu_t *per_cpu = zone->cpu_pages + smp_processor_id(); free_area_t * area = zone->free_area + order; unsigned long curr_order = order; + unsigned long free_pages; struct list_head *head, *curr; unsigned long flags; struct page *page; @@ -216,7 +256,15 @@ } spin_lock(&zone->lock); - do { + + free_pages = zone->free_pages; + free_pages -= zone->reserved_pages; + + /* Using a reservation? */ + if (rsv) + free_pages += rsv->avail; + + if (free_pages >= (1 << order)) do { head = &area->free_list; curr = memlist_next(head); @@ -232,6 +280,16 @@ zone->free_pages -= 1 << order; page = expand(zone, page, index, order, curr_order, area); + if (rsv && (rsv->avail >= (1 << order))) { +static int foo; if (foo++ < 5) printk("alloc from reserv: %p %u %u\n", rsv, rsv->avail, rsv->used); + if (!rsv->used && (rsv->flags & RSV_MULTISHOT)) { +static int foo; if (foo++ < 5) printk("multishot reserv: %p\n", rsv); + list_add(&rsv->list, &zone->depleted_rsv_list); + rsv->avail -= 1 << order; + rsv->used += 1 << order; +} + zone->reserved_pages--; + } spin_unlock_irqrestore(&zone->lock, flags); set_page_count(page, 1); @@ -265,6 +323,7 @@ for (;;) { zone_t *z = *(zone++); unsigned long water_mark; + unsigned long free_pages; if (!z) break; @@ -275,6 +334,8 @@ * We allocate if the number of free + inactive_clean * pages is above the watermark. */ + free_pages = z->free_pages - z->reserved_pages; + switch (limit) { default: case PAGES_MIN: @@ -287,14 +348,14 @@ water_mark = z->pages_high; } - if (z->free_pages + z->inactive_clean_pages > water_mark) { + if (free_pages + z->inactive_clean_pages > water_mark) { struct page *page = NULL; /* If possible, reclaim a page directly. */ - if (direct_reclaim && z->free_pages < z->pages_min + 8) + if (direct_reclaim && free_pages < z->pages_min + 8) page = reclaim_page(z); /* If that fails, fall back to rmqueue. */ if (!page) - page = rmqueue(z, order); + page = rmqueue(z, order, NULL); if (page) return page; } @@ -304,6 +365,8 @@ return NULL; } +extern struct page *get_reserved_page (void); + /* * This is the 'heart' of the zoned buddy allocator: @@ -320,7 +383,7 @@ * Allocations put pressure on the VM subsystem. */ memory_pressure++; - + /* * (If anyone calls gfp from interrupts nonatomically then it * will sooner or later tripped up by a schedule().) @@ -351,11 +414,11 @@ if (!z->size) BUG(); - if (z->free_pages >= z->pages_low) { - page = rmqueue(z, order); + if (z->free_pages - z->reserved_pages >= z->pages_low) { + page = rmqueue(z, order, NULL); if (page) - return page; - } else if (z->free_pages < z->pages_min && + goto out_success; + } else if (z->free_pages - z->reserved_pages < z->pages_min && waitqueue_active(&kreclaimd_wait)) { wake_up_interruptible(&kreclaimd_wait); } @@ -371,7 +434,7 @@ */ page = __alloc_pages_limit(zonelist, order, PAGES_HIGH, direct_reclaim); if (page) - return page; + goto out_success; /* * Then try to allocate a page from a zone with more @@ -383,7 +446,7 @@ */ page = __alloc_pages_limit(zonelist, order, PAGES_LOW, direct_reclaim); if (page) - return page; + goto out_success; /* * OK, none of the zones on our zonelist has lots @@ -418,8 +481,42 @@ */ page = __alloc_pages_limit(zonelist, order, PAGES_MIN, direct_reclaim); if (page) - return page; + goto out_success; + /* Memory reservation hook. Note: memory reservations are + * attempted after all other normal means of allocations have + * failed. Give it a try with the memory reservation and see + * what happens. + * TODO: with memory reservations in place, much of the code + * below is completely bogus. Clean this up! -ben + */ + if (current->page_reservations || (gfp_mask & __GFP_HIGH)) { + struct page_reservation *rsv; + + if (gfp_mask & __GFP_HIGH) + rsv = &atomic_rsv; + else + rsv = current->page_reservations; + + do { + zone = zonelist->zones; + for (;;) { + zone_t *z = *(zone++); + if (!z) + break; + + if (z == rsv->zone) { +static int foo; if (foo++ < 5) printk("trying reservation: %p\n", current->page_reservations); + page = rmqueue(z, order, rsv); + if (page) + goto out_success; + break; + } + } + } while (rsv == &atomic_rsv && + (rsv = current->page_reservations)); + } + /* * If we dont want to try too hard then we can give up * now @@ -465,9 +562,9 @@ break; __free_page(page); /* Try if the allocation succeeds. */ - page = rmqueue(z, order); + page = rmqueue(z, order, NULL); if (page) - return page; + goto out_success; } } } @@ -485,6 +582,11 @@ memory_pressure++; try_to_free_pages(gfp_mask); wakeup_bdflush(0); + if ((gfp_mask & __GFP_WAIT) && !(current->flags & PF_ATOMICALLOC)) { + __set_current_state(TASK_RUNNING); + current->policy |= SCHED_YIELD; + schedule(); + } goto try_again; } } @@ -511,31 +613,25 @@ if (direct_reclaim) { page = reclaim_page(z); if (page) - return page; + goto out_success; } /* XXX: is pages_min/4 a good amount to reserve for this? */ - if (z->free_pages < z->pages_min / 4 && + if (z->free_pages - z->reserved_pages < z->pages_min / 4 && !((current->flags & PF_MEMALLOC) && (gfp_mask & __GFP_WAIT))) continue; - page = rmqueue(z, order); + page = rmqueue(z, order, NULL); if (page) - return page; + goto out_success; } - // okay - we are in trouble, lets go to the DMA pool directly: - - { - zone_t *z = pgdat_list->node_zones; - - page = rmqueue(z, order); - if (page) - return page; - } /* No luck.. */ printk(KERN_INFO "__alloc_pages: %lu-order allocation failed.\n", order); return NULL; + +out_success: + return page; } /* @@ -588,7 +684,7 @@ sum = 0; while (pgdat) { for (zone = pgdat->node_zones; zone < pgdat->node_zones + MAX_NR_ZONES; zone++) - sum += zone->free_pages; + sum += zone->free_pages - zone->reserved_pages; pgdat = pgdat->node_next; } return sum; @@ -605,7 +701,8 @@ sum = 0; pgdat = pgdat_list; while (pgdat) { - sum += (pgdat->node_zones+zone_type)->free_pages; + zone_t *z = pgdat->node_zones+zone_type; + sum += z->free_pages - z->reserved_pages; pgdat = pgdat->node_next; } return sum; @@ -694,6 +791,7 @@ while (pgdat) { pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages; + pages -= pgdat->node_zones[ZONE_HIGHMEM].reserved_pages; pgdat = pgdat->node_next; } return pages; @@ -723,13 +821,37 @@ freepages.low, freepages.high); + printk( "%8u %8u %8u %8u %8u %8u\n" + "%8u %8u %8u %8u %8u %8u\n" + "%8u %8u %8u %8u %8u %8u\n", + page_launder_calls, + page_launder_scans, + page_launder_scanned_pages, + page_launder_scanned_not_dirty, + page_launder_scanned_active, + page_launder_scanned_skipped, + page_launder_scanned_dirty, + page_launder_scanned_dirty_skip, + page_launder_scanned_dirty_nolaunder, + page_launder_scanned_dirty_swapcache, + page_launder_scanned_buffers, + page_launder_scanned_buffers_refiled, + page_launder_scanned_buffers_freed_buffer, + page_launder_scanned_buffers_active, + page_launder_scanned_buffers_cleaned, + page_launder_scanned_buffers_released, + page_launder_scanned_mapped_clean, + page_launder_scanned_still_active +); + for (type = 0; type < MAX_NR_ZONES; type++) { struct list_head *head, *curr; zone_t *zone = pgdat->node_zones + type; - unsigned long nr, total, flags; + unsigned long nr, total, flags, reserved; - total = 0; + reserved = total = 0; if (zone->size) { + printk("Zone %s: ", zone->name); spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < MAX_ORDER; order++) { head = &(zone->free_area + order)->free_list; @@ -745,9 +867,10 @@ printk("%lu*%lukB ", nr, (PAGE_SIZE>>10) << order); } + reserved = zone->reserved_pages; spin_unlock_irqrestore(&zone->lock, flags); } - printk("= %lukB)\n", total * (PAGE_SIZE>>10)); + printk("= %lukB) Reserved: %lukB\n", total * (PAGE_SIZE>>10), reserved * (PAGE_SIZE>>10)); } #ifdef SWAP_CACHE_INFO @@ -901,8 +1024,11 @@ zone->lock = SPIN_LOCK_UNLOCKED; zone->zone_pgdat = pgdat; zone->free_pages = 0; + zone->reserved_pages = 0; zone->inactive_clean_pages = 0; zone->inactive_dirty_pages = 0; + zone->reserved_pages = 0; + INIT_LIST_HEAD(&zone->depleted_rsv_list); memlist_init(&zone->inactive_clean_list); if (!size) continue; @@ -914,7 +1040,7 @@ mask = zone_balance_min[j]; else if (mask > zone_balance_max[j]) mask = zone_balance_max[j]; - zone->pages_min = mask; + zone->pages_min = 0; zone->pages_low = mask*2; zone->pages_high = mask*3; /* @@ -928,7 +1054,7 @@ * for people who require it to catch load spikes in eg. * gigabit ethernet routing... */ - freepages.min += mask; + freepages.min = 0; freepages.low += mask*2; freepages.high += mask*3; zone->zone_mem_map = mem_map + offset; @@ -955,12 +1081,16 @@ bitmap_size = size >> i; bitmap_size = (bitmap_size + 7) >> 3; bitmap_size = LONG_ALIGN(bitmap_size); - bitmap_size *= 2; + bitmap_size *= 1; zone->free_area[i].map = (unsigned int *) alloc_bootmem_node(pgdat, bitmap_size); } } build_zonelists(pgdat); + + init_page_reservation(&irq_rsv, RSV_MULTISHOT, ZONE_NORMAL); + init_page_reservation(&swap_rsv, RSV_MULTISHOT, ZONE_NORMAL); + init_page_reservation(&atomic_rsv, RSV_MULTISHOT, ZONE_NORMAL); } void __init free_area_init(unsigned long *zones_size) @@ -977,6 +1107,101 @@ for (j = 0; j < MAX_NR_ZONES; j++) printk("%d ", zone_balance_ratio[j]); printk("\n"); return 1; +} + +void init_page_reservation(struct page_reservation *rsv, int flags, int zone) +{ +static int foo; if (foo++ < 5) printk("init_page_reservation(%p, %d, %d)\n", rsv, flags, zone); + INIT_LIST_HEAD(&rsv->list); + rsv->avail = 0; + rsv->used = 0; + rsv->flags = flags; + + /* FIXME: This doesn't work properly on NUMA or multizoned setups. + */ + rsv->zone = pgdat_list->node_zones + zone; +} + +void destroy_page_reservation(struct page_reservation *rsv) +{ + unsigned long flags; + zone_t *zone = rsv->zone; +static int foo; if (foo++ < 5) printk("destroy_page_reservation(%p)\n", rsv); + + spin_lock_irqsave(&zone->lock, flags); + zone->reserved_pages -= rsv->avail; + list_del_init(&rsv->list); /* This relies on list_del_init being used */ + spin_unlock_irqrestore(&zone->lock, flags); + memset(rsv, 0x57, sizeof(*rsv)); +} + +int reserve_pages(struct page_reservation *rsv, int gfp_mask, unsigned count) +{ + unsigned long flags, free_pages; + zone_t *zone = rsv->zone; + unsigned orig = count; + int tries = 5; +static int foo; if (foo++ < 5) printk("reserve_pages(%p, %d, %u)\n", rsv, gfp_mask, count); + + spin_lock_irqsave(&zone->lock, flags); + free_pages = zone->free_pages - zone->reserved_pages; + if (free_pages > count) + free_pages = count; + count -= free_pages; + zone->reserved_pages += free_pages; + + rsv->used += count; + if (count) + zone->pages_min++; + list_del_init(&rsv->list); + if (rsv->used) + list_add(&rsv->list, zone->depleted_rsv_list.prev); + + spin_unlock_irqrestore(&zone->lock, flags); + + while (--tries && rsv->used) { + try_to_free_pages(gfp_mask); + if ((gfp_mask & __GFP_WAIT) && !(current->flags & PF_ATOMICALLOC)) { + __set_current_state(TASK_RUNNING); + current->policy |= SCHED_YIELD; + schedule(); + } + } + + if (count) { + spin_lock_irqsave(&zone->lock, flags); + zone->pages_min--; + spin_unlock_irqrestore(&zone->lock, flags); + } + + if (!rsv->used) + return 0; + + put_reserved_pages(rsv, orig); + return -ENOMEM; +} + +void put_reserved_pages(struct page_reservation *rsv, unsigned count) +{ + unsigned long flags; + zone_t *zone = rsv->zone; +static int foo; if (foo++ < 5) printk("put_reserved_pages(%p, %u)\n", rsv, count); + spin_lock_irqsave(&zone->lock, flags); + + if (rsv->used <= count) { + count -= rsv->used; + rsv->used = 0; + } else { + rsv->used -= count; + count = 0; + } + + if (count > rsv->avail) + BUG(); + + rsv->avail -= count; + zone->reserved_pages -= count; + spin_unlock_irqrestore(&zone->lock, flags); } __setup("memfrac=", setup_mem_frac); diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/mm/vmscan.c linux-toolbox-current/mm/vmscan.c --- kernel-2.4.3-works/linux.orig/mm/vmscan.c Thu May 10 16:07:27 2001 +++ linux-toolbox-current/mm/vmscan.c Mon May 14 13:30:26 2001 @@ -21,9 +21,32 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/file.h> +#include <linux/mm/reservation.h> #include <asm/pgalloc.h> +extern struct page_reservation swap_rsv; + +int page_launder_calls, + page_launder_scans, + page_launder_scanned_pages, + page_launder_scanned_not_dirty, + page_launder_scanned_active, + page_launder_scanned_skipped, + page_launder_scanned_dirty, + page_launder_scanned_dirty_skip, + page_launder_scanned_dirty_nolaunder, + page_launder_scanned_dirty_swapcache, + page_launder_scanned_buffers, + page_launder_scanned_buffers_refiled, + page_launder_scanned_buffers_freed_buffer, + page_launder_scanned_buffers_active, + page_launder_scanned_buffers_cleaned, + page_launder_scanned_buffers_released, + page_launder_scanned_mapped_clean, + page_launder_scanned_still_active +; + /* * The swap-out function returns 1 if it successfully * scanned all the pages it was asked to (`count'). @@ -230,6 +253,14 @@ { unsigned long address; struct vm_area_struct* vma; + unsigned long min_rss = atomic_read(&mm->in_fault_count); + + min_rss *= 64; + if (min_rss > 256) + min_rss = 256; + + if (mm->rss <= min_rss) + return; /* * Go through process' page directory. @@ -466,11 +497,18 @@ flushed_pages = 0; freed_pages = 0; + spin_lock(&pagemap_lru_lock); + page_launder_calls++; + spin_unlock(&pagemap_lru_lock); + dirty_page_rescan: spin_lock(&pagemap_lru_lock); + page_launder_scans++; + maxscan = nr_inactive_dirty_pages; while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list && maxscan-- > 0) { + page_launder_scanned_pages++; page = list_entry(page_lru, struct page, lru); zone = page->zone; @@ -481,6 +519,7 @@ list_del(page_lru); nr_inactive_dirty_pages--; zone->inactive_dirty_pages--; + page_launder_scanned_not_dirty++; continue; } @@ -490,6 +529,7 @@ page_ramdisk(page)) { del_page_from_inactive_dirty_list(page); add_page_to_active_list(page); + page_launder_scanned_active++; continue; } @@ -505,7 +545,7 @@ if (launder_loop && !maxlaunder) break; if (launder_loop && zone->inactive_clean_pages + - zone->free_pages > zone->pages_high) + zone->free_pages - zone->reserved_pages > zone->pages_high) goto skip_page; /* @@ -514,6 +554,7 @@ */ if (TryLockPage(page)) { skip_page: + page_launder_scanned_skipped++; list_del(page_lru); list_add(page_lru, &inactive_dirty_list); continue; @@ -524,13 +565,19 @@ * last copy.. */ if (PageDirty(page)) { + struct page_reservation *saved_rsv; int (*writepage)(struct page *) = page->mapping->a_ops->writepage; - if (!writepage || !can_get_io_locks) + page_launder_scanned_dirty++; + + if (!writepage || !can_get_io_locks) { + page_launder_scanned_dirty_skip++; goto page_active; + } /* First time through? Move it to the back of the list */ if (!launder_loop) { + page_launder_scanned_dirty_nolaunder++; list_del(page_lru); list_add(page_lru, &inactive_dirty_list); UnlockPage(page); @@ -542,10 +589,16 @@ page_cache_get(page); spin_unlock(&pagemap_lru_lock); + saved_rsv = current->page_reservations; + current->page_reservations = &swap_rsv; writepage(page); + current->page_reservations = saved_rsv; + /* XXX: all ->writepage()s should use nr_async_pages */ if (!PageSwapCache(page)) flushed_pages++; + else + page_launder_scanned_dirty_swapcache++; maxlaunder--; page_cache_release(page); @@ -565,6 +618,7 @@ */ if (page->buffers) { int wait, clearedbuf; + page_launder_scanned_buffers++; /* * Since we might be doing disk IO, we have to * drop the spinlock and take an extra reference @@ -594,21 +648,25 @@ /* The buffers were not freed. */ if (!clearedbuf) { + page_launder_scanned_buffers_refiled++; add_page_to_inactive_dirty_list(page); if (wait) flushed_pages++; /* The page was only in the buffer cache. */ } else if (!page->mapping) { + page_launder_scanned_buffers_freed_buffer++; atomic_dec(&buffermem_pages); freed_pages++; /* The page has more users besides the cache and us. */ } else if (page_count(page) > 2) { + page_launder_scanned_buffers_active++; add_page_to_active_list(page); /* OK, we "created" a freeable page. */ } else /* page->mapping && page_count(page) == 2 */ { + page_launder_scanned_buffers_cleaned++; add_page_to_inactive_clean_list(page); freed_pages++; } @@ -618,10 +676,12 @@ * We can only do it here because we are accessing * the page struct above. */ + page_launder_scanned_buffers_released++; UnlockPage(page); page_cache_release(page); } else if (page->mapping && !PageDirty(page)) { + page_launder_scanned_mapped_clean++; /* * If a page had an extra reference in * deactivate_page(), we will find it here. @@ -634,6 +694,7 @@ freed_pages++; } else { + page_launder_scanned_still_active++; page_active: /* * OK, we don't know what to do with the page. @@ -660,6 +721,13 @@ */ shortage = free_shortage(); if (can_get_io_locks && !launder_loop && shortage) { + if (gfp_mask & __GFP_WAIT) { + __set_current_state(TASK_RUNNING); + current->policy |= SCHED_YIELD; + schedule(); + } + + shortage = free_shortage(); launder_loop = 1; /* @@ -835,10 +903,11 @@ for(i = 0; i < MAX_NR_ZONES; i++) { zone_t *zone = pgdat->node_zones+ i; if (zone->size && (zone->inactive_clean_pages + - zone->free_pages < zone->pages_min+1)) { + zone->free_pages - zone->reserved_pages < zone->pages_min+1)) { /* + 1 to have overlap with alloc_pages() !! */ sum += zone->pages_min + 1; sum -= zone->free_pages; + sum += zone->reserved_pages; sum -= zone->inactive_clean_pages; } } @@ -881,6 +950,7 @@ zone_shortage -= zone->inactive_dirty_pages; zone_shortage -= zone->inactive_clean_pages; zone_shortage -= zone->free_pages; + zone_shortage += zone->reserved_pages; if (zone_shortage > 0) shortage += zone_shortage; } @@ -1009,6 +1079,7 @@ int kswapd(void *unused) { + static struct page_reservation kswapd_rsv; struct task_struct *tsk = current; tsk->session = 1; @@ -1016,6 +1087,11 @@ strcpy(tsk->comm, "kswapd"); sigfillset(&tsk->blocked); kswapd_task = tsk; + + init_page_reservation(&kswapd_rsv, RSV_MULTISHOT, ZONE_NORMAL); + if (reserve_pages(&kswapd_rsv, GFP_KERNEL, 32)) + panic("kswapd unable to reserve emergency pages!\n"); + tsk->page_reservations = &kswapd_rsv; /* * Tell the memory management that we're a "memory allocator", @@ -1086,6 +1162,8 @@ oom_kill(); } } + + destroy_page_reservation(&kswapd_rsv); } void wakeup_kswapd(void) @@ -1102,6 +1180,10 @@ int try_to_free_pages(unsigned int gfp_mask) { int ret = 1; + struct page_reservation *rsv = current->page_reservations; + + if (!rsv) + current->page_reservations = &swap_rsv; if (gfp_mask & __GFP_WAIT) { unsigned long caller_memalloc = current->flags & PF_MEMALLOC; @@ -1111,6 +1193,8 @@ current->flags |= caller_memalloc; } + current->page_reservations = rsv; + return ret; } @@ -1151,7 +1235,7 @@ if (!zone->size) continue; - while (zone->free_pages < zone->pages_low) { + while (zone->free_pages - zone->reserved_pages < zone->pages_low) { struct page * page; page = reclaim_page(zone); if (!page) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:52 ` Ben LaHaise @ 2001-05-26 1:27 ` Andrea Arcangeli 2001-05-26 1:38 ` Ben LaHaise 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 1:27 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 08:52:50PM -0400, Ben LaHaise wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > Please merge this one in 2.4 for now (originally from Ingo, I only > > improved it), this is a real definitive fix and there's no nicer way to > > handle that unless you want to generalize an API for people to generate > > private anti-deadlock ("make sure to always make a progress") memory > > pools: > > Alternatively, the following might be more interesting... side note, it won't compile against pre6, that's against the RH kernel that has the tux stuff (PF_ATOMICALLOC in this case) in it, that's also included in 2.4.5pre6aa1 if you apply the optional patches in the 30_tux directory. I only had a short look but unless I'm missing something it cannot fix anything in the highmem area, you are not reserving anything for highmem bounces in particular, you're reserving for swap/irq/atomic allocations in general or for some task in particular, that's just broken design I think as you will fail anyways eventually because the stacking under those layers won't give you any guarantee to make progress. The pool must live in the very latest layer of allocations if you want to have any guarantee of making progress evenutally, create_bounces() actually is called as the very last thing before the atomic stuff, infact loop can deadlock as well as it would need its own private pool too after the create_bounces() in case of non noop transfer function (but loop is not a shotstopper so we didn't addressed that yet, create_bounces itself is a showtsopper instead). The only case where a reserved pool make sense is when you know that waiting (i.e. running a task queue, scheduling and trying again to allocate later) you will succeed the allocation for sure eventually (possibly with a FIFO policy to make also starvation impossible, not only deadlocks). If you don't have that guarantee those pools atuomatically become only a sourcecode and runtime waste, possibly they could hide core bugs in the allocator or stuff like that. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 1:27 ` Andrea Arcangeli @ 2001-05-26 1:38 ` Ben LaHaise 2001-05-26 1:49 ` Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 1:38 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > side note, it won't compile against pre6, that's against the RH kernel > that has the tux stuff (PF_ATOMICALLOC in this case) in it, that's also > included in 2.4.5pre6aa1 if you apply the optional patches in the 30_tux > directory. Yeah, I know. The point is I wanted to know if people were interested in the approach. Man, nobody's capable of being a visionary anymore. > I only had a short look but unless I'm missing something it cannot fix > anything in the highmem area, you are not reserving anything for highmem > bounces in particular, you're reserving for swap/irq/atomic allocations > in general or for some task in particular, that's just broken design I > think as you will fail anyways eventually because the stacking under > those layers won't give you any guarantee to make progress. The pool > must live in the very latest layer of allocations if you want to have > any guarantee of making progress evenutally, create_bounces() actually > is called as the very last thing before the atomic stuff, infact loop > can deadlock as well as it would need its own private pool too after the > create_bounces() in case of non noop transfer function (but loop is not > a shotstopper so we didn't addressed that yet, create_bounces itself is > a showtsopper instead). You're missing a few subtle points: 1. reservations are against a specific zone 2. try_to_free_pages uses the swap reservation 3. irqs can no longer eat memory allocations that are needed for swap Note that with this patch the current garbage in the zone structure with pages_min (which doesn't work reliably) becomes obsolete. > The only case where a reserved pool make sense is when you know that > waiting (i.e. running a task queue, scheduling and trying again to > allocate later) you will succeed the allocation for sure eventually > (possibly with a FIFO policy to make also starvation impossible, not > only deadlocks). If you don't have that guarantee those pools > atuomatically become only a sourcecode and runtime waste, possibly they > could hide core bugs in the allocator or stuff like that. You're completely wrong here. -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 1:38 ` Ben LaHaise @ 2001-05-26 1:49 ` Andrea Arcangeli 2001-05-26 2:01 ` Ben LaHaise 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 1:49 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote: > You're missing a few subtle points: > > 1. reservations are against a specific zone A single zone is not used only for one thing, period. In my previous email I enlighted the only conditions under which a reserved pool can avoid a deadlock. > 2. try_to_free_pages uses the swap reservation try_to_free_pages has an huge stacking under it, bounce bufferes/loop/nbd/whatever being just some of them. > 3. irqs can no longer eat memory allocations that are needed for > swap you don't even need irq to still deadlock. > Note that with this patch the current garbage in the zone structure with > pages_min (which doesn't work reliably) becomes obsolete. The "garbage" is just an heuristic to allow atomic allocation to work in the common case dynamically. Anything deadlock related cannot rely on pages_min. I am talking about fixing the thing, of course I perfectly know you can hide it pretty well, but I definitely hate those kind of hiding patches. > > The only case where a reserved pool make sense is when you know that > > waiting (i.e. running a task queue, scheduling and trying again to > > allocate later) you will succeed the allocation for sure eventually > > (possibly with a FIFO policy to make also starvation impossible, not > > only deadlocks). If you don't have that guarantee those pools > > atuomatically become only a sourcecode and runtime waste, possibly they > > could hide core bugs in the allocator or stuff like that. > > You're completely wrong here. I don't think so. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 1:49 ` Andrea Arcangeli @ 2001-05-26 2:01 ` Ben LaHaise 2001-05-26 2:26 ` Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 2:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote: > > You're missing a few subtle points: > > > > 1. reservations are against a specific zone > > A single zone is not used only for one thing, period. In my previous > email I enlighted the only conditions under which a reserved pool can > avoid a deadlock. Well, until we come up with a better design for a zone allocator that doesn't involve walking lists and polluting the cache all over the place, it'll be against a single zone. > > 2. try_to_free_pages uses the swap reservation > > try_to_free_pages has an huge stacking under it, bounce > bufferes/loop/nbd/whatever being just some of them. Fine, then add one to the bounce buffer allocation code, it's all of about 3 lines added. > > 3. irqs can no longer eat memory allocations that are needed for > > swap > > you don't even need irq to still deadlock. I never said you didn't. But Ingo's patch DOES NOT PROTECT AGAINST DEADLOCKS CAUSED BY INTERRUPT ALLOCATIONS. Heck, it doesn't even fix the deadlock that started this discussion. Think about what happens when your network interfaces start rapidly eating all of the memory kswapd is freeing. Also, I haven't posted this patch previously exactly because it is against a specific kind of deadlock that nobody cares about right now, plus it touches the core of the page allocator, which isn't nescessarily a good idea during the 2.4 timeframe. That said, the reservation concept is generic code, which the bounce buffer patch most certainly isn't. It can even be improved to overlap with the page cache pages in the zone, so it isn't even really "free" ram as currently implemented. > > Note that with this patch the current garbage in the zone structure with > > pages_min (which doesn't work reliably) becomes obsolete. > > The "garbage" is just an heuristic to allow atomic allocation to work in > the common case dynamically. Anything deadlock related cannot rely on > pages_min. > I am talking about fixing the thing, of course I perfectly know you can > hide it pretty well, but I definitely hate those kind of hiding patches. Re-read the above and reconsider. The reservation doesn't need to be permitted until after page_alloc has blocked. Heck, do a schedule before attempting to use the reservation. > > > The only case where a reserved pool make sense is when you know that > > > waiting (i.e. running a task queue, scheduling and trying again to > > > allocate later) you will succeed the allocation for sure eventually > > > (possibly with a FIFO policy to make also starvation impossible, not > > > only deadlocks). If you don't have that guarantee those pools > > > atuomatically become only a sourcecode and runtime waste, possibly they > > > could hide core bugs in the allocator or stuff like that. > > > > You're completely wrong here. > > I don't think so. Atomicity isn't what I care about. It's about being able to keep memory around so that certain allocations can proceed, and those pools cannot be eaten into by other tasks. Extend the concept a little further, and you can even reclaim higher order pages by having a reservation outstanding on that order. But I'm just dreaming. -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 2:01 ` Ben LaHaise @ 2001-05-26 2:26 ` Andrea Arcangeli 2001-05-26 2:40 ` Ben LaHaise 0 siblings, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2001-05-26 2:26 UTC (permalink / raw) To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Fri, May 25, 2001 at 10:01:37PM -0400, Ben LaHaise wrote: > On Sat, 26 May 2001, Andrea Arcangeli wrote: > > > On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote: > > > You're missing a few subtle points: > > > > > > 1. reservations are against a specific zone > > > > A single zone is not used only for one thing, period. In my previous > > email I enlighted the only conditions under which a reserved pool can > > avoid a deadlock. > > Well, until we come up with a better design for a zone allocator that > doesn't involve walking lists and polluting the cache all over the place, > it'll be against a single zone. I meant each zone is used by more than one user, that definitely won't change. > > > 2. try_to_free_pages uses the swap reservation > > > > try_to_free_pages has an huge stacking under it, bounce > > bufferes/loop/nbd/whatever being just some of them. > > Fine, then add one to the bounce buffer allocation code, it's all of about > 3 lines added. Yes, you should add it to the bounce buffers to the loop to nbd to whatever and then remove it from all other places that you put into it right now. That's why I'm saying your patch won't fix anything (but hide) as it was in its first revision. > I never said you didn't. But Ingo's patch DOES NOT PROTECT AGAINST > DEADLOCKS CAUSED BY INTERRUPT ALLOCATIONS. Heck, it doesn't even fix the It does, but only for the create_bounces. As said if you want to "fix", not to "hide" you need to address every single case, a generic reserved pool is just useless. Now try to get a dealdock in 2.4.5 with tasks locked up in create_bounces() if you say it does not protect against irqs. see? > That said, the reservation concept is generic code, which the bounce > buffer patch most certainly isn't. It can even be improved to overlap What I said is that instead of handling the pool by hand in every single place one could write an API to generalize it, but very often a simple API hooked into the page allocator may not be flexible enough to guarantee the kind of allocations we need, highmem is just one example where we need more flexibility not just for the pages but also for the bh (but ok that's mostly an implementation issue, if you do the math right, it's harder but you can still use a generic API). > with the page cache pages in the zone, so it isn't even really "free" ram > as currently implemented. yes that would be a very nice property, again I'm not against a generic interface for creating reserved pool of memory (I mentioned that possibilty before reading your patch after all). It's just the implemetation (mainly the per-task hook overwritten) that didn't convinced me and the usage that was at least apparently obviously wrong to my eyes. > Re-read the above and reconsider. The reservation doesn't need to be > permitted until after page_alloc has blocked. Heck, do a schedule before I don't see what you mean here, could you elaborate? > Atomicity isn't what I care about. It's about being able to keep memory > around so that certain allocations can proceed, and those pools cannot be > eaten into by other tasks. [..] Those pools needs to be gloabl unless you want to allocate them at fork() for every single task like you did for some the kernel threads, and if you make them global per-zone or per-whatever not every single case it will deadlock. Or better it will works by luck, it proceeds until you don't have a case where you didn't only needed 32 pages reserved, but where you needed 33 pages reserved to avoid the deadlock, it's on the same lines of the pci_map_* brokeness in some sense. Allocating those pools per-task is just a total waste, those are "security" pools, in the 99% of cases you won't need them and you will allocate the memory dynamically, they just needs to be there for correctness and to avoid the dealdock very seldom. Andrea ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 2:26 ` Andrea Arcangeli @ 2001-05-26 2:40 ` Ben LaHaise 0 siblings, 0 replies; 64+ messages in thread From: Ben LaHaise @ 2001-05-26 2:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > It does, but only for the create_bounces. As said if you want to "fix", > not to "hide" you need to address every single case, a generic reserved > pool is just useless. Now try to get a dealdock in 2.4.5 with tasks > locked up in create_bounces() if you say it does not protect against > irqs. see? So? It just deadlocks in create_buffers/alloc_pages. We finished debugging that weeks ago, and I'm not interested in repeating that. > What I said is that instead of handling the pool by hand in every single > place one could write an API to generalize it, but very often a simple > API hooked into the page allocator may not be flexible enough to > guarantee the kind of allocations we need, highmem is just one example > where we need more flexibility not just for the pages but also for the > bh (but ok that's mostly an implementation issue, if you do the math > right, it's harder but you can still use a generic API). Well, this is the infrastructure for guaranteeing atomic allocations. The only beautifying it really needs is in adding an alloc_pages variant that takes the reservation pool as a parameter instead of the current mucking with current->page_reservation. > > Re-read the above and reconsider. The reservation doesn't need to be > > permitted until after page_alloc has blocked. Heck, do a schedule before > > I don't see what you mean here, could you elaborate? I simply meant that the reservation code wasn't bent on providing atomic allocations for non-atomic allocators. IIRC, I hooked into __alloc_pages after the normal mechanism of allocating pages failed, but where it may have already slept, but I think that was part of the other patch I posted in the last email. Our tree contains a lot of patches, and some of the effects like this are built on top of each other, so I'm not surprised that a critical piece like that was missing. -ben ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-26 0:42 ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli 2001-05-26 0:52 ` Ben LaHaise @ 2001-05-26 1:43 ` Rik van Riel 1 sibling, 0 replies; 64+ messages in thread From: Rik van Riel @ 2001-05-26 1:43 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel On Sat, 26 May 2001, Andrea Arcangeli wrote: > Please merge this one in 2.4 for now (originally from Ingo, I only > improved it), this is a real definitive fix With the only minor detail being that it DOESN'T WORK. You're not solving the problems of GFP_BUFFER allocators looping forever in __alloc_pages(), the deadlock can still happen. You've only solved the 1 specific case of highmem.c getting a page for bounce buffers, but you'll happily let the thing deadlock while trying to get buffer heads for a normal low memory page! regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 21:12 ` Linus Torvalds 2001-05-25 22:20 ` Rik van Riel @ 2001-05-25 22:35 ` Rik van Riel 2001-05-25 23:07 ` Linus Torvalds 1 sibling, 1 reply; 64+ messages in thread From: Rik van Riel @ 2001-05-25 22:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Fri, 25 May 2001, Linus Torvalds wrote: > On Fri, 25 May 2001, Rik van Riel wrote: > > > > OK, shoot me. Here it is again, this time _with_ patch... > > I'm not going to apply this as long as it plays experimental > games with "shrink_icache()" and friends. I haven't seen anybody > comment on the performance on this, and I can well imagine that > it would potentially shrink the dcache too aggressively when > there are lots of inactive-dirty pages around, where > page_launder is the right thing to do, and shrinking > icache/dcache might not be. Your analysis exactly describes what happens in your current code, though I have to admit that my patch doesn't change it. Without the patch my workstation (with ~180MB RAM) usually has between 50 and 80MB of inode/dentry cache and real usable stuff gets swapped out. Either you can go make Linux 2.4 usable again for normal people, or you can go buy us all a gig of ram. regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup 2001-05-25 22:35 ` Rik van Riel @ 2001-05-25 23:07 ` Linus Torvalds 0 siblings, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2001-05-25 23:07 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Fri, 25 May 2001, Rik van Riel wrote: > > Without the patch my workstation (with ~180MB RAM) usually has > between 50 and 80MB of inode/dentry cache and real usable stuff > gets swapped out. All I want is more people giving feedback. It's clear that neither my nor your machine is a good thing to base things on. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2001-06-05 3:58 UTC | newest] Thread overview: 64+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-05-25 20:00 [with-PATCH-really] highmem deadlock removal, balancing & cleanup Rik van Riel 2001-05-25 21:12 ` Linus Torvalds 2001-05-25 22:20 ` Rik van Riel 2001-05-25 23:05 ` Linus Torvalds 2001-05-25 23:13 ` Alan Cox 2001-05-25 23:19 ` Rik van Riel 2001-05-26 0:02 ` Linus Torvalds 2001-05-26 0:07 ` Rik van Riel 2001-05-26 0:16 ` Linus Torvalds 2001-05-26 0:23 ` Linus Torvalds 2001-05-26 0:26 ` Rik van Riel 2001-05-26 0:30 ` Linus Torvalds 2001-05-26 0:29 ` Ben LaHaise 2001-05-26 0:34 ` Linus Torvalds 2001-05-26 0:38 ` Rik van Riel 2001-05-26 1:28 ` Linux-2.4.5 Linus Torvalds 2001-05-26 1:35 ` Linux-2.4.5 Rik van Riel 2001-05-26 1:39 ` Linux-2.4.5 Ben LaHaise 2001-05-26 1:59 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 2:11 ` Linux-2.4.5 Ben LaHaise 2001-05-26 2:38 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 2:49 ` Linux-2.4.5 Ben LaHaise 2001-05-26 3:11 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 4:22 ` Linux-2.4.5 Linus Torvalds 2001-05-26 4:31 ` Linux-2.4.5 Rik van Riel 2001-05-26 8:10 ` Linux-2.4.5 Linus Torvalds 2001-05-26 9:01 ` Linux-2.4.5 Linus Torvalds 2001-05-26 9:18 ` Linux-2.4.5 arjan 2001-05-26 14:18 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 14:21 ` Linux-2.4.5 Rik van Riel 2001-05-26 14:38 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 14:40 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:17 ` Linux-2.4.5 Linus Torvalds 2001-05-26 15:28 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:59 ` Linux-2.4.5 Linus Torvalds 2001-05-26 22:12 ` Linux-2.4.5 Marcelo Tosatti 2001-05-27 6:53 ` Linux-2.4.5 Marcelo Tosatti 2001-06-03 23:32 ` Linux-2.4.5 Linus Torvalds 2001-06-05 2:21 ` Linux-2.4.5 Marcelo Tosatti 2001-05-26 15:09 ` Linux-2.4.5 Linus Torvalds 2001-05-26 15:18 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:24 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 15:26 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:40 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 4:45 ` Linux-2.4.5 Rik van Riel 2001-05-26 4:47 ` Linux-2.4.5 Rik van Riel 2001-05-26 6:07 ` Linux-2.4.5 Ben LaHaise 2001-05-26 14:32 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 14:36 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:03 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 15:08 ` Linux-2.4.5 Rik van Riel 2001-05-26 15:20 ` Linux-2.4.5 Andrea Arcangeli 2001-05-26 15:41 ` Linux-2.4.5 Rik van Riel 2001-05-26 0:42 ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli 2001-05-26 0:52 ` Ben LaHaise 2001-05-26 1:27 ` Andrea Arcangeli 2001-05-26 1:38 ` Ben LaHaise 2001-05-26 1:49 ` Andrea Arcangeli 2001-05-26 2:01 ` Ben LaHaise 2001-05-26 2:26 ` Andrea Arcangeli 2001-05-26 2:40 ` Ben LaHaise 2001-05-26 1:43 ` Rik van Riel 2001-05-25 22:35 ` Rik van Riel 2001-05-25 23:07 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox