[with-PATCH-really] highmem deadlock removal, balancing & cleanup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [with-PATCH-really] highmem deadlock removal, balancing & cleanup
@ 2001-05-25 20:00 Rik van Riel
  2001-05-25 21:12 ` Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-25 20:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

OK, shoot me.  Here it is again, this time _with_ patch...

---------- Forwarded message ----------
Date: Fri, 25 May 2001 16:53:38 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>

Hi Linus,

the following patch does:

1) Remove GFP_BUFFER and HIGHMEM related deadlocks, by letting
   these allocations fail instead of looping forever in
   __alloc_pages() when they cannot make any progress there.

   Now Linux no longer hangs on highmem machines with heavy
   write loads.

2) Clean up the __alloc_pages() / __alloc_pages_limit() code
   a bit, moving the direct reclaim condition from the latter
   function into the former so we run it less often ;)

3) Remove the superfluous wakeups from __alloc_pages(), not
   only are the tests a real CPU eater, they also have the
   potential of waking up bdflush in a situation where it
   shouldn't run in the first place.  The kswapd wakeup didn't
   seem to have any effect either.

4) Do make sure GFP_BUFFER allocations NEVER eat into the
   very last pages of the system. It is important to preserve
   the following ordering:
	- normal allocations
	- GFP_BUFFER
	- atomic allocations
	- other recursive allocations

   Using this ordering, we can be pretty sure that eg. a
   GFP_BUFFER allocation to swap something out to an
   encrypted device won't eat the memory the device driver
   will need to perform its functions. It also means that
   a gigabit network flood won't eat those pages...

5) Change nr_free_buffer_pages() a bit to not return pages
   which cannot be used as buffer pages, this makes a BIG
   difference on highmem machines (which now DO have a working
   write throttling again).

6) Simplify the refill_inactive() loop enough that it actually
   works again. Calling page_launder() and shrink_i/d_memory()
   by the same if condition means that the different caches
   get balanced against each other again.

   The illogical argument for not shrinking the slab cache
   while we're under a free shortage turned out to be very
   much illogical too.  All needed buffer heads will have been
   allocated in page_launder() and shrink_i/d_memory() before
   we get here and we can be pretty sure that these functions
   will keep re-using those same buffer heads as soon as the
   IO finishes.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/



--- linux-2.4.5-pre6/mm/page_alloc.c.orig	Fri May 25 16:13:39 2001
+++ linux-2.4.5-pre6/mm/page_alloc.c	Fri May 25 16:35:50 2001
@@ -251,10 +251,10 @@
 				water_mark = z->pages_high;
 		}

-		if (z->free_pages + z->inactive_clean_pages > water_mark) {
+		if (z->free_pages + z->inactive_clean_pages >= water_mark) {
 			struct page *page = NULL;
 			/* If possible, reclaim a page directly. */
-			if (direct_reclaim && z->free_pages < z->pages_min + 8)
+			if (direct_reclaim)
 				page = reclaim_page(z);
 			/* If that fails, fall back to rmqueue. */
 			if (!page)
@@ -299,21 +299,6 @@
 	if (order == 0 && (gfp_mask & __GFP_WAIT))
 		direct_reclaim = 1;

-	/*
-	 * If we are about to get low on free pages and we also have
-	 * an inactive page shortage, wake up kswapd.
-	 */
-	if (inactive_shortage() > inactive_target / 2 && free_shortage())
-		wakeup_kswapd();
-	/*
-	 * If we are about to get low on free pages and cleaning
-	 * the inactive_dirty pages would fix the situation,
-	 * wake up bdflush.
-	 */
-	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
-			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
-
 try_again:
 	/*
 	 * First, see if we have any zones with lots of free memory.
@@ -329,7 +314,7 @@
 		if (!z->size)
 			BUG();

-		if (z->free_pages >= z->pages_low) {
+		if (z->free_pages >= z->pages_min + 8) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -443,18 +428,26 @@
 		}
 		/*
 		 * When we arrive here, we are really tight on memory.
+		 * Since kswapd didn't succeed in freeing pages for us,
+		 * we try to help it.
+		 *
+		 * Single page allocs loop until the allocation succeeds.
+		 * Multi-page allocs can fail due to memory fragmentation;
+		 * in that case we bail out to prevent infinite loops and
+		 * hanging device drivers ...
 		 *
-		 * We try to free pages ourselves by:
-		 * 	- shrinking the i/d caches.
-		 * 	- reclaiming unused memory from the slab caches.
-		 * 	- swapping/syncing pages to disk (done by page_launder)
-		 * 	- moving clean pages from the inactive dirty list to
-		 * 	  the inactive clean list. (done by page_launder)
+		 * Another issue are GFP_BUFFER allocations; because they
+		 * do not have __GFP_IO set it's possible we cannot make
+		 * any progress freeing pages, in that case it's better
+		 * to give up than to deadlock the kernel looping here.
 		 */
 		if (gfp_mask & __GFP_WAIT) {
 			memory_pressure++;
-			try_to_free_pages(gfp_mask);
-			goto try_again;
+			if (!order || free_shortage()) {
+				int progress = try_to_free_pages(gfp_mask);
+				if (progress || gfp_mask & __GFP_IO)
+					goto try_again;
+			}
 		}
 	}

@@ -489,6 +482,10 @@
 				return page;
 		}

+		/* Don't let GFP_BUFFER allocations eat all the memory. */
+		if (gfp_mask==GFP_BUFFER && z->free_pages < z->pages_min * 3/4)
+			continue;
+
 		/* XXX: is pages_min/4 a good amount to reserve for this? */
 		if (z->free_pages < z->pages_min / 4 &&
 				!(current->flags & PF_MEMALLOC))
@@ -499,7 +496,7 @@
 	}

 	/* No luck.. */
-	printk(KERN_ERR "__alloc_pages: %lu-order allocation failed.\n", order);
+//	printk(KERN_ERR "__alloc_pages: %lu-order allocation failed.\n", order);
 	return NULL;
 }

@@ -578,34 +575,66 @@
 }

 /*
+ * Total amount of inactive_clean (allocatable) RAM in a given zone.
+ */
+#ifdef CONFIG_HIGHMEM
+unsigned int nr_free_buffer_pages_zone (int zone_type)
+{
+	pg_data_t	*pgdat;
+	unsigned int	 sum;
+
+	sum = 0;
+	pgdat = pgdat_list;
+	while (pgdat) {
+		sum += (pgdat->node_zones+zone_type)->free_pages;
+		sum += (pgdat->node_zones+zone_type)->inactive_clean_pages;
+		sum += (pgdat->node_zones+zone_type)->inactive_dirty_pages;
+		pgdat = pgdat->node_next;
+	}
+	return sum;
+}
+#endif
+
+/*
  * Amount of free RAM allocatable as buffer memory:
+ *
+ * For HIGHMEM systems don't count HIGHMEM pages.
+ * This is function is still far from perfect for HIGHMEM systems, but
+ * it is close enough for the time being.
  */
 unsigned int nr_free_buffer_pages (void)
 {
 	unsigned int sum;

-	sum = nr_free_pages();
-	sum += nr_inactive_clean_pages();
+#ifdef CONFIG_HIGHMEM
+	sum = nr_free_buffer_pages_zone(ZONE_NORMAL) +
+	      nr_free_buffer_pages_zone(ZONE_DMA);
+#else
+	sum = nr_free_pages() +
+	      nr_inactive_clean_pages();
 	sum += nr_inactive_dirty_pages;
+#endif

 	/*
 	 * Keep our write behind queue filled, even if
-	 * kswapd lags a bit right now.
+	 * kswapd lags a bit right now. Make sure not
+	 * to clog up the whole inactive_dirty list with
+	 * dirty pages, though.
 	 */
-	if (sum < freepages.high + inactive_target)
-		sum = freepages.high + inactive_target;
+	if (sum < freepages.high + inactive_target / 2)
+		sum = freepages.high + inactive_target / 2;
 	/*
 	 * We don't want dirty page writebehind to put too
 	 * much pressure on the working set, but we want it
 	 * to be possible to have some dirty pages in the
 	 * working set without upsetting the writebehind logic.
 	 */
-	sum += nr_active_pages >> 4;
+	sum += nr_active_pages >> 5;

 	return sum;
 }

-#if CONFIG_HIGHMEM
+#ifdef CONFIG_HIGHMEM
 unsigned int nr_free_highpages (void)
 {
 	pg_data_t *pgdat = pgdat_list;
--- linux-2.4.5-pre6/mm/vmscan.c.orig	Fri May 25 16:13:40 2001
+++ linux-2.4.5-pre6/mm/vmscan.c	Fri May 25 16:13:52 2001
@@ -865,14 +865,18 @@

 	/*
 	 * If we're low on free pages, move pages from the
-	 * inactive_dirty list to the inactive_clean list.
+	 * inactive_dirty list to the inactive_clean list
+	 * and shrink the inode and dentry caches.
 	 *
 	 * Usually bdflush will have pre-cleaned the pages
 	 * before we get around to moving them to the other
 	 * list, so this is a relatively cheap operation.
 	 */
-	if (free_shortage())
+	if (free_shortage()) {
 		ret += page_launder(gfp_mask, user);
+		shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
+		shrink_icache_memory(DEF_PRIORITY, gfp_mask);
+	}

 	/*
 	 * If needed, we move pages from the active list
@@ -882,21 +886,10 @@
 		ret += refill_inactive(gfp_mask, user);

 	/*
-	 * Delete pages from the inode and dentry caches and
-	 * reclaim unused slab cache if memory is low.
+	 * If we're still short on free pages, reclaim unused
+	 * slab cache memory.
 	 */
 	if (free_shortage()) {
-		shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
-		shrink_icache_memory(DEF_PRIORITY, gfp_mask);
-	} else {
-		/*
-		 * Illogical, but true. At least for now.
-		 *
-		 * If we're _not_ under shortage any more, we
-		 * reap the caches. Why? Because a noticeable
-		 * part of the caches are the buffer-heads,
-		 * which we'll want to keep if under shortage.
-		 */
 		kmem_cache_reap(gfp_mask);
 	}



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 20:00 [with-PATCH-really] highmem deadlock removal, balancing & cleanup Rik van Riel
@ 2001-05-25 21:12 ` Linus Torvalds
  2001-05-25 22:20   ` Rik van Riel
  2001-05-25 22:35   ` Rik van Riel
  0 siblings, 2 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-25 21:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Fri, 25 May 2001, Rik van Riel wrote:
>
> OK, shoot me.  Here it is again, this time _with_ patch...

I'm not going to apply this as long as it plays experimental games with
"shrink_icache()" and friends. I haven't seen anybody comment on the
performance on this, and I can well imagine that it would potentially
shrink the dcache too aggressively when there are lots of inactive-dirty
pages around, where page_launder is the right thing to do, and shrinking
icache/dcache might not be.

I'd really like to avoid having the MM stuff fluctuate too much. Have
people tested this under different loads etc?

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 21:12 ` Linus Torvalds
@ 2001-05-25 22:20   ` Rik van Riel
  2001-05-25 23:05     ` Linus Torvalds
  2001-05-25 23:13     ` Alan Cox
  2001-05-25 22:35   ` Rik van Riel
  1 sibling, 2 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-25 22:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:
> On Fri, 25 May 2001, Rik van Riel wrote:
> >
> > OK, shoot me.  Here it is again, this time _with_ patch...
>
> I'm not going to apply this as long as it plays experimental games with
> "shrink_icache()" and friends. I haven't seen anybody comment on the
> performance on this,

Yeah, I guess the way Linux 2.2 balances things is way too
experimental ;)


Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 22:20   ` Rik van Riel
@ 2001-05-25 23:05     ` Linus Torvalds
  2001-05-25 23:13     ` Alan Cox
  1 sibling, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-25 23:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel



On Fri, 25 May 2001, Rik van Riel wrote:
>
> Yeah, I guess the way Linux 2.2 balances things is way too
> experimental ;)

Ehh.. Take a look at the other differences between the VM's. Which may
make a 2.2.x approach completely bogus.

And take a look at how long the 2.2.x VM took to stabilize, and how
INCREDIBLY BAD some of those kernels were.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 22:20   ` Rik van Riel
  2001-05-25 23:05     ` Linus Torvalds
@ 2001-05-25 23:13     ` Alan Cox
  2001-05-25 23:19       ` Rik van Riel
  2001-05-26  0:02       ` Linus Torvalds
  1 sibling, 2 replies; 64+ messages in thread
From: Alan Cox @ 2001-05-25 23:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, linux-kernel

> On Fri, 25 May 2001, Linus Torvalds wrote:
> > On Fri, 25 May 2001, Rik van Riel wrote:
> > >
> > > OK, shoot me.  Here it is again, this time _with_ patch...
> >
> > I'm not going to apply this as long as it plays experimental games with
> > "shrink_icache()" and friends. I haven't seen anybody comment on the
> > performance on this,
> 
> Yeah, I guess the way Linux 2.2 balances things is way too
> experimental ;)

Compared to the 2.0 performance - yes. 2.0 is faster than 2.2 is twice the
speed of 2.4 starting X and the session apps on my MediaGX box with 64Mb

But Linus is right I think - VM changes often prove 'interesting'. Test it in
-ac , gets some figures for real world use then plan further



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 23:13     ` Alan Cox
@ 2001-05-25 23:19       ` Rik van Riel
  2001-05-26  0:02       ` Linus Torvalds
  1 sibling, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-25 23:19 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, linux-kernel

On Sat, 26 May 2001, Alan Cox wrote:

> But Linus is right I think - VM changes often prove
> 'interesting'. Test it in -ac , gets some figures for real world
> use then plan further

Oh well. As long as he takes the patch to page_alloc.c, otherwise
everybody _will_ have to "experiment" with the -ac kernels just
to have a system with highmem which doesn't deadlock ;)

cheers,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 23:13     ` Alan Cox
  2001-05-25 23:19       ` Rik van Riel
@ 2001-05-26  0:02       ` Linus Torvalds
  2001-05-26  0:07         ` Rik van Riel
  2001-05-26  0:29         ` Ben LaHaise
  1 sibling, 2 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  0:02 UTC (permalink / raw)
  To: Alan Cox; +Cc: Rik van Riel, linux-kernel



On Sat, 26 May 2001, Alan Cox wrote:
>
> But Linus is right I think - VM changes often prove 'interesting'. Test it in
> -ac , gets some figures for real world use then plan further

.. on the other hand, thinking more about this, I'd rather be called
"stupid" than "stodgy".

So I think I'll buy some experimentation. That HIGHMEM change is too ugly
to live, though, I'd really like to hear more about why something that
ugly is necessary.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:02       ` Linus Torvalds
@ 2001-05-26  0:07         ` Rik van Riel
  2001-05-26  0:16           ` Linus Torvalds
  2001-05-26  0:23           ` Linus Torvalds
  2001-05-26  0:29         ` Ben LaHaise
  1 sibling, 2 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  0:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> So I think I'll buy some experimentation. That HIGHMEM change is
> too ugly to live, though, I'd really like to hear more about why
> something that ugly is necessary.

If you mean the "GFP_BUFFER allocations should fail instead
of looping forever" thing, it is because:

1) GFP_BUFFER allocations are made in order to try to flush
   (and free) pages and allocate highmem pages.

2) This is the page flushing equivalent of PF_MEMALLOC, in
   the sense that we should not go and try to "recursively"
   flush more random pages until we find one that succeeds
   without an allocation; instead, we should just break the
   loop and let the caller deal with it.

If you mean the change to nr_free_buffer_pages(), that one
is needed because that function should return what the name
implies ... returning "2 GB of memory available for dirty
buffers" makes for a completely filled up ZONE_NORMAL and
processes which never get throttled for writing (and instead,
end up looping for more memory and killing performance for
the rest of the system).

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:07         ` Rik van Riel
@ 2001-05-26  0:16           ` Linus Torvalds
  2001-05-26  0:23           ` Linus Torvalds
  1 sibling, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  0:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, linux-kernel



On Fri, 25 May 2001, Rik van Riel wrote:
> On Fri, 25 May 2001, Linus Torvalds wrote:
>
> > So I think I'll buy some experimentation. That HIGHMEM change is
> > too ugly to live, though, I'd really like to hear more about why
> > something that ugly is necessary.
>
> If you mean the "GFP_BUFFER allocations should fail instead
> of looping forever" thing, it is because:

No, I was thinking more of the dirty buffer balancing thing.

It seems to have this hardcoded notion of "DMA + NORMAL", which is just
wrong. There could be more zones that are acceptable to buffers, so what
it _should_ do is to just walk the zone list that GFP_BUFFER points to.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:07         ` Rik van Riel
  2001-05-26  0:16           ` Linus Torvalds
@ 2001-05-26  0:23           ` Linus Torvalds
  2001-05-26  0:26             ` Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  0:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, linux-kernel

Oh, also: the logic behind the change of the kmem_cache_reap() - instead
of making it conditional on the _reverse_ test of what it has historically
been, why isn't it just completely unconditional? You've basically
dismissed the only valid reason for it to have been (illogically)
conditional, so I'd have expected that just _removing_ the test is better
than reversing it like your patch does..

No?

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:23           ` Linus Torvalds
@ 2001-05-26  0:26             ` Rik van Riel
  2001-05-26  0:30               ` Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  0:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> Oh, also: the logic behind the change of the kmem_cache_reap() - instead
> of making it conditional on the _reverse_ test of what it has historically
> been, why isn't it just completely unconditional? You've basically
> dismissed the only valid reason for it to have been (illogically)
> conditional, so I'd have expected that just _removing_ the test is better
> than reversing it like your patch does..
>
> No?

The function do_try_to_free_pages() also gets called when we're
only short on inactive pages, but we still have TONS of free
memory. In that case, I don't think we'd actually want to steal
free memory from anyone.

Moving it into the same if() conditional the other memory
freeers are in would make sense, though ...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:26             ` Rik van Riel
@ 2001-05-26  0:30               ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  0:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, linux-kernel

On Fri, 25 May 2001, Rik van Riel wrote:
>
> The function do_try_to_free_pages() also gets called when we're
> only short on inactive pages, but we still have TONS of free
> memory. In that case, I don't think we'd actually want to steal
> free memory from anyone.

Well, kmem_cache_reap() doesn't even steal memory from anybody: it just
makes this "tagged for xxx" memory be available to "non-xxx" users too.

And the fact that we're calling do_try_to_free_pages() at all obviously
implies that even if we have memory free, it isn't in the right hands..

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:02       ` Linus Torvalds
  2001-05-26  0:07         ` Rik van Riel
@ 2001-05-26  0:29         ` Ben LaHaise
  2001-05-26  0:34           ` Linus Torvalds
  2001-05-26  0:42           ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli
  1 sibling, 2 replies; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  0:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Rik van Riel, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> So I think I'll buy some experimentation. That HIGHMEM change is too ugly
> to live, though, I'd really like to hear more about why something that
> ugly is necessary.

Highmem systems currently manage to hang themselves quite completely upon
running out of memory in the normal zone.  One of the failure modes is
looping in __alloc_pages from get_unused_buffer_head to map a dirty page.
Another results in looping on allocation of a bounce page for writing a
dirty highmem page back to disk.

Also, note that the current highmem design for bounce buffers is
inherently unstable.  Because all highmem pages that are currently in
flight must have bounce buffers allocated for them, we require a huge
amount of bounce buffers to guarentee progress while submitting io.  The
-ac kernels have a patch from Ingo that provides private pools for bounce
buffers and buffer_heads.  I went a step further and have a memory
reservation patch that provides for memory pools being reserved against a
particular zone.  This is needed to prevent the starvation that irq
allocations can cause.

Some of these cleanups are 2.5 fodder, but we really need something in 2.4
right now, so...

		-ben


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:29         ` Ben LaHaise
@ 2001-05-26  0:34           ` Linus Torvalds
  2001-05-26  0:38             ` Rik van Riel
  2001-05-26  1:28             ` Linux-2.4.5 Linus Torvalds
  2001-05-26  0:42           ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli
  1 sibling, 2 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  0:34 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Alan Cox, Rik van Riel, linux-kernel

On Fri, 25 May 2001, Ben LaHaise wrote:
>
> Highmem systems currently manage to hang themselves quite completely upon
> running out of memory in the normal zone.  One of the failure modes is
> looping in __alloc_pages from get_unused_buffer_head to map a dirty page.
> Another results in looping on allocation of a bounce page for writing a
> dirty highmem page back to disk.

That's not the part of the patch I object to - fixing that is fine.

What I object to it that it special-cases the zone names, even though that
doesn't necessarily make any sense at all.

What about architectures that have other zones? THAT is the kind of
fundmanental design mistake that special-casing DMA and NORMAL is horrible
for.

alloc_pages() doesn't have that kind of problem. To alloc_pages(),
GFP_BUFFER is not "oh, DMA or NORMAL". There, it is simply "oh, use the
zonelist pointed to by GFP_BUFFER". No special casing, no stupid #ifdef
CONFIG_HIGHMEM.

THAT is what I object to.

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:34           ` Linus Torvalds
@ 2001-05-26  0:38             ` Rik van Riel
  2001-05-26  1:28             ` Linux-2.4.5 Linus Torvalds
  1 sibling, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  0:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben LaHaise, Alan Cox, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> That's not the part of the patch I object to - fixing that is fine.
>
> What I object to it that it special-cases the zone names, even
> though that doesn't necessarily make any sense at all.

OK, I'll fix that part.  Maybe even this weekend, before I go
on holidays ;)

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Linux-2.4.5
  2001-05-26  0:34           ` Linus Torvalds
  2001-05-26  0:38             ` Rik van Riel
@ 2001-05-26  1:28             ` Linus Torvalds
  2001-05-26  1:35               ` Linux-2.4.5 Rik van Riel
  2001-05-26  1:39               ` Linux-2.4.5 Ben LaHaise
  1 sibling, 2 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  1:28 UTC (permalink / raw)
  To: Alan Cox, Andrea Arcangeli, Ben LaHaise; +Cc: Rik van Riel, linux-kernel

Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly
attacked the real deadlock cause. It's there as 2.4.5 now.

I'm going to be gone in Japan for the next week (leaving tomorrow
morning), so please don't send me patches - I won't be able to react to
them anyway. Consider the -ac series and the kernel mailing list the
regular communications channels..

Thanks,

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  1:28             ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26  1:35               ` Rik van Riel
  2001-05-26  1:39               ` Linux-2.4.5 Ben LaHaise
  1 sibling, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  1:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Andrea Arcangeli, Ben LaHaise, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly
> attacked the real deadlock cause. It's there as 2.4.5 now.

But only for highmem bounce buffers. Normal GFP_BUFFER
allocations can still headlock.

> I'm going to be gone in Japan for the next week

Oh well, I guess people can always run the -ac kernel ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  1:28             ` Linux-2.4.5 Linus Torvalds
  2001-05-26  1:35               ` Linux-2.4.5 Rik van Riel
@ 2001-05-26  1:39               ` Ben LaHaise
  2001-05-26  1:59                 ` Linux-2.4.5 Andrea Arcangeli
  1 sibling, 1 reply; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  1:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Andrea Arcangeli, Rik van Riel, linux-kernel

Sorry, this doesn't fix the problem.  It still hangs on highmem machines.
Try running cerberus on a PAE kernel sometime.

		-ben

On Fri, 25 May 2001, Linus Torvalds wrote:

> Ok, I applied Andrea's (nee Ingo's) version, as that one most clearly
> attacked the real deadlock cause. It's there as 2.4.5 now.
>
> I'm going to be gone in Japan for the next week (leaving tomorrow
> morning), so please don't send me patches - I won't be able to react to
> them anyway. Consider the -ac series and the kernel mailing list the
> regular communications channels..
>
> Thanks,
>
> 		Linus
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  1:39               ` Linux-2.4.5 Ben LaHaise
@ 2001-05-26  1:59                 ` Andrea Arcangeli
  2001-05-26  2:11                   ` Linux-2.4.5 Ben LaHaise
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  1:59 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 09:39:36PM -0400, Ben LaHaise wrote:
> Sorry, this doesn't fix the problem.  It still hangs on highmem machines.
> Try running cerberus on a PAE kernel sometime.

There can be more bugs of course, two patches I posted are only meant to
fix deadlocks in the allocation fail path of alloc_bounces() and the
second patch in getblk() allocation fail path, nothing more. Those are
strictly necessary fixes as far I can tell, and their implementation was
quite obviously right to my eyes.

Now if you send some debugging info with deadlocks you gets with 2.4.5
vanilla I will be certainly interested to found their source. Also Rik
just said to have a fix for other bugs in that area, I didn't checked
that part.

What I can say is that with my tree I didn't reproduced deadlocks
highmem related with cerberus but I'm using highmem emulation not real
highmem.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  1:59                 ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26  2:11                   ` Ben LaHaise
  2001-05-26  2:38                     ` Linux-2.4.5 Andrea Arcangeli
  2001-05-26 15:41                     ` Linux-2.4.5 Rik van Riel
  0 siblings, 2 replies; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  2:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> Now if you send some debugging info with deadlocks you gets with 2.4.5
> vanilla I will be certainly interested to found their source. Also Rik
> just said to have a fix for other bugs in that area, I didn't checked
> that part.

In the red hat tree, we fixed the problem by adding __GFP_FAIL to
GFP_BUFFER, as well as adding a yield to alloc_pages.  Think of what
happens when you try to allocate a buffer_head to swap out a page when
you're out of memory.  See below.

		-ben

diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h
--- v2.4.4-ac9/include/linux/mm.h	Mon May 14 15:22:17 2001
+++ work/include/linux/mm.h	Mon May 14 18:33:21 2001
@@ -528,7 +528,7 @@


 #define GFP_BOUNCE	(__GFP_HIGH | __GFP_FAIL)
-#define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT)
+#define GFP_BUFFER	(__GFP_HIGH | __GFP_FAIL | __GFP_WAIT)
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_USER	(             __GFP_WAIT | __GFP_IO)
 #define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c
--- v2.4.4-ac9/mm/vmscan.c	Mon May 14 14:57:00 2001
+++ work/mm/vmscan.c	Mon May 14 16:43:05 2001
@@ -636,6 +636,12 @@
 	 */
 	shortage = free_shortage();
 	if (can_get_io_locks && !launder_loop && shortage) {
+		if (gfp_mask & __GFP_WAIT) {
+			__set_current_state(TASK_RUNNING);
+			current->policy |= SCHED_YIELD;
+			schedule();
+		}
+
 		launder_loop = 1;

 		/*


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  2:11                   ` Linux-2.4.5 Ben LaHaise
@ 2001-05-26  2:38                     ` Andrea Arcangeli
  2001-05-26  2:49                       ` Linux-2.4.5 Ben LaHaise
  2001-05-26 15:41                     ` Linux-2.4.5 Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  2:38 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 10:11:40PM -0400, Ben LaHaise wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> 
> > Now if you send some debugging info with deadlocks you gets with 2.4.5
> > vanilla I will be certainly interested to found their source. Also Rik
> > just said to have a fix for other bugs in that area, I didn't checked
> > that part.
> 
> In the red hat tree, we fixed the problem by adding __GFP_FAIL to
> GFP_BUFFER, as well as adding a yield to alloc_pages.  Think of what
> happens when you try to allocate a buffer_head to swap out a page when
> you're out of memory.  See below.

Allocating a buffer head during out of memory is always been deadlock
prone, 2.2 always had the same problem too. I'm not sure why you are
telling me this, I didn't changed anything that happens to be in the
swapout path (besides removing the deadlock in create_bounces with
evolution of first Ingo's patch but that is not specific to the
swapout). I only changed the getblk path (which is not used by the
swapout, at least unless you swapout on a file not on a blkdev, but even
in that case the change is fine).

About yelding interally to alloc_pages it would make sense if we
had a private pool internally to the allocator, otherwise with current
code you definitely want to return a faliure fast to the caller so the
caller will try the private pool before the yield.

btw in the below patch __GFP_FAIL is a noop.

> 
> 		-ben
> 
> diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h
> --- v2.4.4-ac9/include/linux/mm.h	Mon May 14 15:22:17 2001
> +++ work/include/linux/mm.h	Mon May 14 18:33:21 2001
> @@ -528,7 +528,7 @@
> 
> 
>  #define GFP_BOUNCE	(__GFP_HIGH | __GFP_FAIL)
> -#define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT)
> +#define GFP_BUFFER	(__GFP_HIGH | __GFP_FAIL | __GFP_WAIT)
>  #define GFP_ATOMIC	(__GFP_HIGH)
>  #define GFP_USER	(             __GFP_WAIT | __GFP_IO)
>  #define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
> diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c
> --- v2.4.4-ac9/mm/vmscan.c	Mon May 14 14:57:00 2001
> +++ work/mm/vmscan.c	Mon May 14 16:43:05 2001
> @@ -636,6 +636,12 @@
>  	 */
>  	shortage = free_shortage();
>  	if (can_get_io_locks && !launder_loop && shortage) {
> +		if (gfp_mask & __GFP_WAIT) {
> +			__set_current_state(TASK_RUNNING);
> +			current->policy |= SCHED_YIELD;
> +			schedule();
> +		}
> +
>  		launder_loop = 1;
> 
>  		/*
> 


Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  2:38                     ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26  2:49                       ` Ben LaHaise
  2001-05-26  3:11                         ` Linux-2.4.5 Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  2:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> Allocating a buffer head during out of memory is always been deadlock
> prone, 2.2 always had the same problem too. I'm not sure why you are
> telling me this, I didn't changed anything that happens to be in the
> swapout path (besides removing the deadlock in create_bounces with
> evolution of first Ingo's patch but that is not specific to the
> swapout). I only changed the getblk path (which is not used by the
> swapout, at least unless you swapout on a file not on a blkdev, but even
> in that case the change is fine).

Highmem.  0 free pages in ZONE_NORMAL.  Now try to allocate a buffer_head.
Running under heavy load runs into this even after there is a highmem
bounce buffer pool.

> btw in the below patch __GFP_FAIL is a noop.

<shrug>  merge that patch from -ac then.

		-ben


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  2:49                       ` Linux-2.4.5 Ben LaHaise
@ 2001-05-26  3:11                         ` Andrea Arcangeli
  2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
  2001-05-26  4:45                           ` Linux-2.4.5 Rik van Riel
  0 siblings, 2 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  3:11 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote:
> Highmem.  0 free pages in ZONE_NORMAL.  Now try to allocate a buffer_head.

That's a longstanding deadlock, it was there the first time I read
fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also
getblk is deadlock prone in a smiliar manner.

Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE and
see if it makes a difference? The unused_list logic doesn't give a
guarantee either and it's one of the "hiding" logics, but it was working
pretty well usually, maybe something changed that needs more than 2
pages (16 bh) reserved?

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  3:11                         ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26  4:22                           ` Linus Torvalds
  2001-05-26  4:31                             ` Linux-2.4.5 Rik van Riel
                                               ` (2 more replies)
  2001-05-26  4:45                           ` Linux-2.4.5 Rik van Riel
  1 sibling, 3 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  4:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:
>
> On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote:
> > Highmem.  0 free pages in ZONE_NORMAL.  Now try to allocate a buffer_head.
> 
> That's a longstanding deadlock, it was there the first time I read
> fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also
> getblk is deadlock prone in a smiliar manner.

Indeed. 

This is why we always leave a few pages free, exactly to allow nested page
allocators to steal the reserved pages that we keep around. If that
deadlocks, then that's a separate issue altogether.

If people are able to trigger the "we run out of reserved pages" behaviour
under any load, that indicates that we either have too few reserved pages
per zone, or that we have a real thinko somewhere that allows eating up
the reserves we're supposed to have.

And yes, things like spraying the box really hard with network packets can
temporarily eat up the reserves, but that's why kswapd and friends exist,
to get them back. But I could easily imagine that there are more schedule
points missing that could cause user mode to not get to run much. Andrea
fixed one, I think.

If there are more situations like this, please get a stack trace on the
deadlock (fairly easy these days with ctrl-scrolllock), and let's think
about what make for the nasty pattern in the first place instead of trying
to add more "reserved" pages.

For example, maybe we can use HIGHMEM pages more aggressively in some
places, to avoid the case where we're only freeing HIGHMEM pages and
making the non-HIGHMEM case just worse. 

For example, we used to have logic in swapout_process to _not_ swap out
zones that don't need it. We changed swapout to happen in
"page_launder()", but that logic got lost. It's entirely possible that we
should just say "don't bother writing out dirty pages that are in zones
that have no memory pressure", so that we don't use up pages from the
_precious_ zones to free pages in zones that don't need freeing.

So don't try to paper over this by making up new rules. We should think
about WHY the problem happens in the first place, not about trying to fix
it once it has happened (and see the above as an example of why it might
be happening).

But sometimes the right solution is just to have more reserves.

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26  4:31                             ` Rik van Riel
  2001-05-26  8:10                               ` Linux-2.4.5 Linus Torvalds
  2001-05-26  9:18                             ` Linux-2.4.5 arjan
  2001-05-26 14:18                             ` Linux-2.4.5 Andrea Arcangeli
  2 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  4:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:

> This is why we always leave a few pages free, exactly to allow nested
> page allocators to steal the reserved pages that we keep around. If
> that deadlocks, then that's a separate issue altogether.
> 
> If people are able to trigger the "we run out of reserved pages"
> behaviour under any load, that indicates that we either have too few
> reserved pages per zone, or that we have a real thinko somewhere that
> allows eating up the reserves we're supposed to have.

This is exactly what gets fixed in the patch I sent you.
Feel free to reimplement it in a complex way if you want ;)

> But sometimes the right solution is just to have more reserves.

Won't work if the ethernet card is allocating memory
at gigabit speed. And no, my patch won't protect against
this thing either, only memory reservations can.

All my patch does is give us a 2.4 kernel now which
doesn't hang immediately as soon as you run on highmem
machines with a heavy swapping load.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:31                             ` Linux-2.4.5 Rik van Riel
@ 2001-05-26  8:10                               ` Linus Torvalds
  2001-05-26  9:01                                 ` Linux-2.4.5 Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  8:10 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:
> 
> This is exactly what gets fixed in the patch I sent you.

This is not AT ALL what your patch does. 99% of your patch does other
things, including playing games with the "balance dirty" threshold etc. In
ways that make it look very much like the standard dbench kind of load
cannot use HIGHMEM for dirty buffers very effectively.

There's _one_ line of your patch special-cases GFP_BUFFER in page_alloc
and protects the reserves, but the point is that they shouldn't need
protecting: there's something else wrong that makes them be depleted in
the first place. 

Did you read my email?

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  8:10                               ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26  9:01                                 ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26  9:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel


On Sat, 26 May 2001, Linus Torvalds wrote:
> 
> There's _one_ line of your patch special-cases GFP_BUFFER in page_alloc
> and protects the reserves, but the point is that they shouldn't need
> protecting: there's something else wrong that makes them be depleted in
> the first place. 

In fact, there seems to be a major confusion about the use of GFP_BUFFER
around the kernel.

The kernel uses GFP_BUFFER in a few places:

 - grow_buffers(), as called from bread() and friends.

   Here it is correct: we use GFP_BUFFER because we must not cause a
   deadlock on various filesystem datastructures, and whena filesystem
   does a bread() we'd better _never_ cause a writeout that could cause
   some nasty FS lock self-deadlock.

 - creating buffer heads on pages for the page cache uses SLAB_BUFFER.

   This is bogus. It should use SLAB_KERNEL here, because this is not
   called by low-level filesystems, it's called by the VM layer, and we
   don't hold any magic locks or anything like that.

The code actually has some comments to this effect, but is confusing the
issue of "async" and "sync", and that confusion makes the code (a) not
dare do the right thing (there's an attempt that is #ifdef'ed out), and
(b) even the right thing is kind of confused.

I'm leaving for Japan, so this is the last I'll write on this, but as a
request-for-discussion, what about this patch? Does it help? It should
decrease the GFP_BUFFER pressure noticeably (yeah, I removed too much of
the error handling code, but it shouldn't be all that noticeable when
using SLAB_KERNEL, as the allocations should succeed until the machine is
_so_ low on memory that it is truly dead).

Now, this is obviously untested and does a bit too much surgery, but I
really think that the reason for the deadlock is because buffer allocation
itself does things wrong, not so much that we should try to keep infinite
reserves. 

Anybody want to play with these kinds of approaches? Fixing the underlying
problems instead of trying to hide them with special reserve cases...

		Linus

-----
--- v2.4.5/linux/fs/buffer.c	Fri May 25 18:28:55 2001
+++ linux/fs/buffer.c	Sat May 26 01:52:43 2001
@@ -1206,7 +1206,7 @@
  * no-buffer-head deadlock.  Return NULL on failure; waiting for
  * buffer heads is now handled in create_buffers().
  */ 
-static struct buffer_head * get_unused_buffer_head(int async)
+static struct buffer_head * get_unused_buffer_head(int can_do_io)
 {
 	struct buffer_head * bh;
 
@@ -1220,46 +1220,7 @@
 	}
 	spin_unlock(&unused_list_lock);
 
-	/* This is critical.  We can't swap out pages to get
-	 * more buffer heads, because the swap-out may need
-	 * more buffer-heads itself.  Thus SLAB_BUFFER.
-	 */
-	if((bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER)) != NULL) {
-		bh->b_blocknr = -1;
-		bh->b_this_page = NULL;
-		return bh;
-	}
-
-	/*
-	 * If we need an async buffer, use the reserved buffer heads.
-	 */
-	if (async) {
-		spin_lock(&unused_list_lock);
-		if (unused_list) {
-			bh = unused_list;
-			unused_list = bh->b_next_free;
-			nr_unused_buffer_heads--;
-			spin_unlock(&unused_list_lock);
-			return bh;
-		}
-		spin_unlock(&unused_list_lock);
-	}
-#if 0
-	/*
-	 * (Pending further analysis ...)
-	 * Ordinary (non-async) requests can use a different memory priority
-	 * to free up pages. Any swapping thus generated will use async
-	 * buffer heads.
-	 */
-	if(!async &&
-	   (bh = kmem_cache_alloc(bh_cachep, SLAB_KERNEL)) != NULL) {
-		memset(bh, 0, sizeof(*bh));
-		init_waitqueue_head(&bh->b_wait);
-		return bh;
-	}
-#endif
-
-	return NULL;
+	return kmem_cache_alloc(bh_cachep, can_do_io ? SLAB_KERNEL : SLAB_BUFFER);
 }
 
 void set_bh_page (struct buffer_head *bh, struct page *page, unsigned long offset)
@@ -1285,16 +1246,16 @@
  * from ordinary buffer allocations, and only async requests are allowed
  * to sleep waiting for buffer heads. 
  */
-static struct buffer_head * create_buffers(struct page * page, unsigned long size, int async)
+static struct buffer_head * create_buffers(struct page * page, unsigned long size, int can_do_io)
 {
 	struct buffer_head *bh, *head;
 	long offset;
 
-try_again:
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
-		bh = get_unused_buffer_head(async);
+		bh = get_unused_buffer_head(can_do_io);
+		/* Can return failure in the "!can_do_io" case */
 		if (!bh)
 			goto no_grow;
 
@@ -1331,29 +1292,7 @@
 		wake_up(&buffer_wait);
 	}
 
-	/*
-	 * Return failure for non-async IO requests.  Async IO requests
-	 * are not allowed to fail, so we have to wait until buffer heads
-	 * become available.  But we don't want tasks sleeping with 
-	 * partially complete buffers, so all were released above.
-	 */
-	if (!async)
-		return NULL;
-
-	/* We're _really_ low on memory. Now we just
-	 * wait for old buffer heads to become free due to
-	 * finishing IO.  Since this is an async request and
-	 * the reserve list is empty, we're sure there are 
-	 * async buffer heads in use.
-	 */
-	run_task_queue(&tq_disk);
-
-	/* 
-	 * Set our state for sleeping, then check again for buffer heads.
-	 * This ensures we won't miss a wake_up from an interrupt.
-	 */
-	wait_event(buffer_wait, nr_unused_buffer_heads >= MAX_BUF_PER_PAGE);
-	goto try_again;
+	return NULL;
 }
 
 static void unmap_buffer(struct buffer_head * bh)
@@ -1425,7 +1364,6 @@
 {
 	struct buffer_head *bh, *head, *tail;
 
-	/* FIXME: create_buffers should fail if there's no enough memory */
 	head = create_buffers(page, blocksize, 1);
 	if (page->buffers)
 		BUG();


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
  2001-05-26  4:31                             ` Linux-2.4.5 Rik van Riel
@ 2001-05-26  9:18                             ` arjan
  2001-05-26 14:18                             ` Linux-2.4.5 Andrea Arcangeli
  2 siblings, 0 replies; 64+ messages in thread
From: arjan @ 2001-05-26  9:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

In article <Pine.LNX.4.21.0105252107010.1520-100000@penguin.transmeta.com> you wrote:

> If there are more situations like this, please get a stack trace on the
> deadlock (fairly easy these days with ctrl-scrolllock), and let's think
> about what make for the nasty pattern in the first place instead of trying
> to add more "reserved" pages.

What I've been seeing so far is that, say, 16 processes all need memory, and
start to write out stuff. All only half succeed, eg they allocate memory for
the writeout but don't actually get enough and sleep for more.
Now that isn't too much of a problem UNTIL kswapd is one of those.....

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
  2001-05-26  4:31                             ` Linux-2.4.5 Rik van Riel
  2001-05-26  9:18                             ` Linux-2.4.5 arjan
@ 2001-05-26 14:18                             ` Andrea Arcangeli
  2001-05-26 14:21                               ` Linux-2.4.5 Rik van Riel
  2 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 14:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben LaHaise, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 09:22:05PM -0700, Linus Torvalds wrote:
> per zone, or that we have a real thinko somewhere that allows eating up

I think I found the real thinko for the create_buffers other deadlock
experienced by Ben, I suspect this fix is enough to address the
deadlock, this was real bug not just hiding factor (reserved bh aren't
released by irqs, the bounces frees memory, memory balancing stop, no bh
is released and we dealdock in wait_event), I also increased a bit the
reserved bh just in case:

--- 2.4.5pre6aa1/fs/buffer.c.~1~	Fri May 25 04:57:46 2001
+++ 2.4.5pre6aa1/fs/buffer.c	Sat May 26 16:15:03 2001
@@ -61,7 +61,7 @@
 
 #define BUFSIZE_INDEX(X) ((int) buffersize_index[(X)>>9])
 #define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-#define NR_RESERVED (2*MAX_BUF_PER_PAGE)
+#define NR_RESERVED (10*MAX_BUF_PER_PAGE)
 #define MAX_UNUSED_BUFFERS NR_RESERVED+20 /* don't ever have more than this 
 					     number of unused buffer heads */
 
@@ -1416,11 +1416,9 @@
 	 */
 	run_task_queue(&tq_disk);
 
-	/* 
-	 * Set our state for sleeping, then check again for buffer heads.
-	 * This ensures we won't miss a wake_up from an interrupt.
-	 */
-	wait_event(buffer_wait, nr_unused_buffer_heads >= MAX_BUF_PER_PAGE);
+	current->policy |= SCHED_YIELD;
+	__set_current_state(TASK_RUNNING);
+	schedule();
 	goto try_again;
 }
 

please people with >1G machines try to reproduce the deadlock with
cerberus on top of 2.4.5 + the above patch.

I didn't checked the alloc_pages() other thing mentioned by Ben, if
alloc_pages() deadlocks internally that's yet another completly
orthogonal bug and that will be addressed by another patch if it
persists.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:18                             ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 14:21                               ` Rik van Riel
  2001-05-26 14:38                                 ` Linux-2.4.5 Andrea Arcangeli
  2001-05-26 15:09                                 ` Linux-2.4.5 Linus Torvalds
  0 siblings, 2 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 14:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> I didn't checked the alloc_pages() other thing mentioned by Ben, if
> alloc_pages() deadlocks internally that's yet another completly
> orthogonal bug and that will be addressed by another patch if it
> persists.

O, that part is fixed by the patch that Linus threw away
yesterday ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:21                               ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 14:38                                 ` Andrea Arcangeli
  2001-05-26 14:40                                   ` Linux-2.4.5 Rik van Riel
  2001-05-26 15:09                                 ` Linux-2.4.5 Linus Torvalds
  1 sibling, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 14:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, May 26, 2001 at 11:21:18AM -0300, Rik van Riel wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> 
> > I didn't checked the alloc_pages() other thing mentioned by Ben, if
> > alloc_pages() deadlocks internally that's yet another completly
> > orthogonal bug and that will be addressed by another patch if it
> > persists.
> 
> O, that part is fixed by the patch that Linus threw away
> yesterday ;)

what are you smoking? I read your patch and there's nothing related to
such fix in your patch. I won't have much more time to follow up on this
thread. From now on about the highmem deadlocks topic I will only listen
to real world bugreports of 2.4.5 plus the bugfix I posted a few minuts
ago. If anybody is able to reproduce the deadlocks press SYSRQ+T and
send me the output along the System.map. Thanks in advance for your
cooperation.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:38                                 ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 14:40                                   ` Rik van Riel
  2001-05-26 15:17                                     ` Linux-2.4.5 Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 14:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:
> On Sat, May 26, 2001 at 11:21:18AM -0300, Rik van Riel wrote:
> > On Sat, 26 May 2001, Andrea Arcangeli wrote:
> > 
> > > I didn't checked the alloc_pages() other thing mentioned by Ben, if
> > > alloc_pages() deadlocks internally that's yet another completly
> > > orthogonal bug and that will be addressed by another patch if it
> > > persists.
> > 
> > O, that part is fixed by the patch that Linus threw away
> > yesterday ;)
> 
> what are you smoking?

I am smoking the "tested the patch and wasn't able to reproduce
a deadlock" stuff.

> I read your patch and there's nothing related to
> such fix in your patch.

I explained the thing to you about 5 times now. If you don't
have the time to understand the deadlock, please don't try
to confuse the issue by sending out non-fixes.


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:40                                   ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:17                                     ` Linus Torvalds
  2001-05-26 15:28                                       ` Linux-2.4.5 Rik van Riel
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26 15:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:
> 
> I am smoking the "tested the patch and wasn't able to reproduce
> a deadlock" stuff.

I'd be happier if _anybody_ was smoking the "thought about the problem
some more" stuff as well.

I can easily imagine that this part of your patch

                if (gfp_mask & __GFP_WAIT) {
                        memory_pressure++;
-                       try_to_free_pages(gfp_mask);
-                       goto try_again;
+                       if (!order || free_shortage()) {
+                               int progress = try_to_free_pages(gfp_mask);
+                               if (progress || gfp_mask & __GFP_IO)
+                                       goto try_again;
+                       }
                }

is fine. The fact that other parts of your patch were NOT fine should make
you go "Hmm, maybe Linus dismissed it for a reason" instead.

Testing is good. But I want to understand how we get into the situation in
the first place, and whether there are ways to alleviate those problems
too.

Testing and finding that the bug went away under your workload is
good. But thinking that that should stop discussion about the _proper_ fix
is stupid, Rik.

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:17                                     ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26 15:28                                       ` Rik van Riel
  2001-05-26 15:59                                         ` Linux-2.4.5 Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 15:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Linus Torvalds wrote:

>                 if (gfp_mask & __GFP_WAIT) {
>                         memory_pressure++;
> -                       try_to_free_pages(gfp_mask);
> -                       goto try_again;
> +                       if (!order || free_shortage()) {
> +                               int progress = try_to_free_pages(gfp_mask);
> +                               if (progress || gfp_mask & __GFP_IO)
> +                                       goto try_again;
> +                       }
>                 }

Yes, this is it.

> Testing is good. But I want to understand how we get into the
> situation in the first place, and whether there are ways to alleviate
> those problems too.

As I said  create_buffers() -> get_unused_buffer_head()
-> __alloc_pages() -> loop infinitely.

Your simplification of get_unused_buffer_head() fits in
nicely with this.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:28                                       ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:59                                         ` Linus Torvalds
  2001-05-26 22:12                                           ` Linux-2.4.5 Marcelo Tosatti
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26 15:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:
> 
> > Testing is good. But I want to understand how we get into the
> > situation in the first place, and whether there are ways to alleviate
> > those problems too.
> 
> As I said  create_buffers() -> get_unused_buffer_head()
> -> __alloc_pages() -> loop infinitely.

No no no.

Get outside the small world of where it's looping.

Think more on the problem of "how did we get into such a state that we're
so low on memory for swapouts in the FIRST PLACE?"

That's the problem I want to fix. And I suspect part of the equation is
exactly the fact that we use SLAB_BUFFER for the buffer heads. 

Example schenario:
 - we're low on memory, but not disastrously so. We have lots of highmem
   pages, but not that much NORMAL. But we're not uncomfortable yet.
 - somebody starts writing out to a file. He nicely allocates HIGHMEM
   pages for the actual data (no memory pressure on HIGHMEM, so no
   "try_to_free_pages()" called at all)
 - the writer _also_ needs the buffer heads for those written pages (ie
   "generic_block_prepare_write()"), and here he allocates them with
   SLAB_BUFFER. The NORMAL zone starts getting low, and we start calling
   "try_to_free_pages()".
 - we deplete "try_to_free_pages()" and start swapping. Except everybody
   is calling "try_to_free_pages()" with GFP_BUFFER, so nobody can
   actually do any IO, and if we're unlucky there's not much to be free'd
   in the NORMAL zone.

Now do you start to see a pattern here? 

You're trying to fix the symptoms, by attacking the final end. And what
I've been trying to say is that this problem likely has a higher-level
_cause_, and I want that _cause_ fixed. Not the symptoms.

Now, I suspect that part of the cause for this is the fact that we're
using GFP_BUFFER where we shouldn't, which is why we cannot free stuff up
properly in the NORMAL zone.

Now, there could be other things going on too. I'm not saying that it is
necessarily _just_ the experimental silly test-patch I sent out
yesterday. But I _am_ saying that we should try to fix the root cause
first, and then apply the page_alloc.c thing as a last-ditch effort (but
I'd like that "last effort" to be something that nobody can trigger under
any reasonable load).

Do you NOW see my argument? The argument of "we shouldn't have gotten us
into this situation in the first place". See?

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:59                                         ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26 22:12                                           ` Marcelo Tosatti
  2001-05-27  6:53                                             ` Linux-2.4.5 Marcelo Tosatti
  0 siblings, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2001-05-26 22:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox,
	linux-kernel



On Sat, 26 May 2001, Linus Torvalds wrote:

> 
> On Sat, 26 May 2001, Rik van Riel wrote:
> > 
> > > Testing is good. But I want to understand how we get into the
> > > situation in the first place, and whether there are ways to alleviate
> > > those problems too.
> > 
> > As I said  create_buffers() -> get_unused_buffer_head()
> > -> __alloc_pages() -> loop infinitely.
> 
> No no no.
> 
> Get outside the small world of where it's looping.

There are two problems here. 

It is senseless to make GFP_BUFFER allocations loop inside
__alloc_pages() (calling try_to_free_pages()) because these allocations
_can't_ block.

If there is no free memory, they are going to loop. 

This is "independant" from the highmem deadlock. 

See? 

Now lets talk about the highmem deadlock. 

> Think more on the problem of "how did we get into such a state that we're
> so low on memory for swapouts in the FIRST PLACE?"
> 
> That's the problem I want to fix. And I suspect part of the equation is
> exactly the fact that we use SLAB_BUFFER for the buffer heads. 
> 
> Example schenario:
>  - we're low on memory, but not disastrously so. We have lots of highmem
>    pages, but not that much NORMAL. But we're not uncomfortable yet.
>  - somebody starts writing out to a file. He nicely allocates HIGHMEM
>    pages for the actual data (no memory pressure on HIGHMEM, so no
>    "try_to_free_pages()" called at all)
>  - the writer _also_ needs the buffer heads for those written pages (ie
>    "generic_block_prepare_write()"), and here he allocates them with
>    SLAB_BUFFER. The NORMAL zone starts getting low, and we start calling
>    "try_to_free_pages()".
>  - we deplete "try_to_free_pages()" and start swapping. Except everybody
>    is calling "try_to_free_pages()" with GFP_BUFFER, so nobody can
>    actually do any IO, and if we're unlucky there's not much to be free'd
>    in the NORMAL zone.
> 
> Now do you start to see a pattern here? 
> 
> You're trying to fix the symptoms, by attacking the final end. And what
> I've been trying to say is that this problem likely has a higher-level
> _cause_, and I want that _cause_ fixed. Not the symptoms.

You are not going to fix the problem by _only_ doing this bh allocation
change.

Even if some bh allocators _can_ block on IO, there is no guarantee that
they are going to free low memory --- they may start more IO on highmem
pages.

We cannot treat highmem as "yet another zone" zone. All highmem data goes
through the lowmem before reaching the disk, so its clear for me that we
should not try to write out highmem pages in case we have a lowmem
shortage.

Well, IMO.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 22:12                                           ` Linux-2.4.5 Marcelo Tosatti
@ 2001-05-27  6:53                                             ` Marcelo Tosatti
  2001-06-03 23:32                                               ` Linux-2.4.5 Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Marcelo Tosatti @ 2001-05-27  6:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml



On Sat, 26 May 2001, Marcelo Tosatti wrote:

> > You're trying to fix the symptoms, by attacking the final end. And what
> > I've been trying to say is that this problem likely has a higher-level
> > _cause_, and I want that _cause_ fixed. Not the symptoms.
> 
> You are not going to fix the problem by _only_ doing this bh allocation
> change.
> 
> Even if some bh allocators _can_ block on IO, there is no guarantee that
> they are going to free low memory --- they may start more IO on highmem
> pages.
> 
> We cannot treat highmem as "yet another zone" zone. All highmem data goes
> through the lowmem before reaching the disk, so its clear for me that we
> should not try to write out highmem pages in case we have a lowmem
> shortage.
> 
> Well, IMO.

I've just tried something similar to the attached patch, which is a "more
aggressive" version of your suggestion to use SLAB_KERNEL for bh
allocations when possible. This one makes all bh allocators block on IO. 

I've tested multiple "dd if=/dev/zero of=bigfile..." calls. (8GB machine,
different amounts of data being written)

The patch makes the kernel handle heavier loads much better than .5 and
.5aa, but it does not fix the problem for this specific test.


(this one is not against 2.4.5, but what I tested under .5 is basically
the same patch) 

diff -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Mon Apr  2 23:41:08 2001
+++ linux/mm/vmscan.c	Tue Apr  3 01:53:13 2001
@@ -422,7 +422,7 @@
 int page_launder(int gfp_mask, int sync)
 {
 	int launder_loop, maxscan, cleaned_pages, maxlaunder;
-	int can_get_io_locks;
+	int can_get_io_locks, can_queue_buffers;
 	struct list_head * page_lru;
 	struct page * page;
 
@@ -431,6 +431,7 @@
 	 * buffers to disk) if __GFP_IO is set.
 	 */
 	can_get_io_locks = gfp_mask & __GFP_IO;
+	can_queue_buffers = gfp_mask & __GFP_PAGE_IO;
 
 	launder_loop = 0;
 	maxlaunder = 0;
@@ -482,7 +483,7 @@
 				goto page_active;
 
 			/* First time through? Move it to the back of the list */
-			if (!launder_loop) {
+			if (!launder_loop || can_queue_buffers) {
 				list_del(page_lru);
 				list_add(page_lru, &inactive_dirty_list);
 				UnlockPage(page);
@@ -612,7 +613,8 @@
 	 * loads, flush out the dirty pages before we have to wait on
 	 * IO.
 	 */
-	if (can_get_io_locks && !launder_loop && free_shortage()) {
+	if ((can_queue_buffers || can_get_io_locks) && !launder_loop 
+			&& free_shortage()) {
 		launder_loop = 1;
 		/* If we cleaned pages, never do synchronous IO. */
 		if (cleaned_pages)
diff -Nur linux.orig/fs/buffer.c linux/fs/buffer.c
--- linux.orig/fs/buffer.c	Mon Apr  2 23:40:59 2001
+++ linux/fs/buffer.c	Tue Apr  3 01:54:26 2001
@@ -1231,7 +1231,7 @@
 	 * more buffer heads, because the swap-out may need
 	 * more buffer-heads itself.  Thus SLAB_BUFFER.
 	 */
-	if((bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER)) != NULL) {
+	if((bh = kmem_cache_alloc(bh_cachep, SLAB_PAGE_IO)) != NULL) {
 		memset(bh, 0, sizeof(*bh));
 		init_waitqueue_head(&bh->b_wait);
 		return bh;
@@ -2261,7 +2261,7 @@
 		return 0;
 	}
 
-	page = alloc_page(GFP_BUFFER);
+	page = alloc_page(GFP_PAGE_IO);
 	if (!page)
 		goto out;
 	LockPage(page);
diff -Nur linux.orig/include/linux/mm.h linux/include/linux/mm.h
--- linux.orig/include/linux/mm.h	Mon Apr  2 23:41:09 2001
+++ linux/include/linux/mm.h	Tue Apr  3 01:49:29 2001
@@ -480,8 +480,9 @@
 #define __GFP_HIGHMEM	0x0 /* noop */
 #endif
 #define __GFP_VM	0x20
+#define __GFP_PAGE_IO	0x40
 
-
+#define GFP_PAGE_IO	(__GFP_HIGH | __GFP_WAIT | __GFP_PAGE_IO)
 #define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT)
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_USER	(             __GFP_WAIT | __GFP_IO)
diff -Nur linux.orig/include/linux/slab.h linux/include/linux/slab.h
--- linux.orig/include/linux/slab.h	Mon Apr  2 23:41:11 2001
+++ linux/include/linux/slab.h	Tue Apr  3 01:50:01 2001
@@ -15,6 +15,7 @@
 #include	<linux/cache.h>
 
 /* flags for kmem_cache_alloc() */
+#define SLAB_PAGE_IO		GFP_PAGE_IO
 #define	SLAB_BUFFER		GFP_BUFFER
 #define	SLAB_ATOMIC		GFP_ATOMIC
 #define	SLAB_USER		GFP_USER
@@ -22,7 +23,7 @@
 #define	SLAB_NFS		GFP_NFS
 #define	SLAB_DMA		GFP_DMA
 
-#define SLAB_LEVEL_MASK		(__GFP_WAIT|__GFP_HIGH|__GFP_IO)
+#define SLAB_LEVEL_MASK		(__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_PAGE_IO)
 #define	SLAB_NO_GROW		0x00001000UL	/* don't grow a cache */
 
 /* flags to pass to kmem_cache_create().






^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-27  6:53                                             ` Linux-2.4.5 Marcelo Tosatti
@ 2001-06-03 23:32                                               ` Linus Torvalds
  2001-06-05  2:21                                                 ` Linux-2.4.5 Marcelo Tosatti
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-06-03 23:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml


[ Back from Japan - don't start sending me tons of emails yet, as you can
  see I'm only picking up last weeks discussion where it ended right
  now.. ]

On Sat-Sun, 26-27 May 2001, Marcelo Tosatti wrote:
> 
> You are not going to fix the problem by _only_ doing this bh allocation
> change.

I would obviously not disagree with that statement. There are multiple
users of the low-memory zone, and they are all likely to have some of the
same problems. 

> Even if some bh allocators _can_ block on IO, there is no guarantee that
> they are going to free low memory --- they may start more IO on highmem
> pages.

Now, this was actually something I already referred to earlier in this
same thread, see one of myt first mails about this:

Fri, 25 May 2001 21:22:05 Linus Torvalds wrote:
>>
>> For example, we used to have logic in swapout_process to _not_ swap
>> out zones that don't need it. We changed swapout to happen in
>> "page_launder()", but that logic got lost. It's entirely possible that
>> we should just say "don't bother writing out dirty pages that are in
>> zones that have no memory pressure", so that we don't use up pages
>> from the _precious_ zones to free pages in zones that don't need
>> freeing.

So note how there are multiple facets to this problem.


Marcelo goes on to write:
> 
> I've just tried something similar to the attached patch, which is a "more
> aggressive" version of your suggestion to use SLAB_KERNEL for bh
> allocations when possible. This one makes all bh allocators block on IO. 

The patch looks fine. Has anybody else tried it?

Along with, for example, something like this [warning: whitespace damage,
I just cut-and-pasted this as a test-patch], we might actually _fix_ the
problem:

--- v2.4.5/linux/mm/vmscan.c    Fri May 25 18:28:55 2001
+++ linux/mm/vmscan.c   Sun Jun  3 16:26:20 2001
@@ -463,6 +463,7 @@
 
                /* Page is or was in use?  Move it to the active list. */
                if (PageReferenced(page) || page->age > 0 ||
+                               page->zone->free_pages > page->zone->pages_max ||
                                (!page->buffers && page_count(page) > 1) ||
                                page_ramdisk(page)) {
                        del_page_from_inactive_dirty_list(page);

What the above does is fairly obvious: it considers all pages in zones
that don't need to be free'd to be "young", and doesn't even try to write
them out. Because we have absolutely no reason to do so.

This is similar to what we used to do in "try_to_swap_out()" before, and
it still makes sense.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-06-03 23:32                                               ` Linux-2.4.5 Linus Torvalds
@ 2001-06-05  2:21                                                 ` Marcelo Tosatti
  0 siblings, 0 replies; 64+ messages in thread
From: Marcelo Tosatti @ 2001-06-05  2:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Ben LaHaise, Alan Cox, lkml



(ugh, just found this mail lost around here)

On Sun, 3 Jun 2001, Linus Torvalds wrote:

> 
> [ Back from Japan - don't start sending me tons of emails yet, as you can
>   see I'm only picking up last weeks discussion where it ended right
>   now.. ]
> 
> On Sat-Sun, 26-27 May 2001, Marcelo Tosatti wrote:
> > 
> > You are not going to fix the problem by _only_ doing this bh allocation
> > change.
> 
> I would obviously not disagree with that statement. There are multiple
> users of the low-memory zone, and they are all likely to have some of the
> same problems. 
> 
> > Even if some bh allocators _can_ block on IO, there is no guarantee that
> > they are going to free low memory --- they may start more IO on highmem
> > pages.
> 
> Now, this was actually something I already referred to earlier in this
> same thread, see one of myt first mails about this:
> 
> Fri, 25 May 2001 21:22:05 Linus Torvalds wrote:
> >>
> >> For example, we used to have logic in swapout_process to _not_ swap
> >> out zones that don't need it. We changed swapout to happen in
> >> "page_launder()", but that logic got lost. It's entirely possible that
> >> we should just say "don't bother writing out dirty pages that are in
> >> zones that have no memory pressure", so that we don't use up pages
> >> from the _precious_ zones to free pages in zones that don't need
> >> freeing.
> 
> So note how there are multiple facets to this problem.
> 
> 
> Marcelo goes on to write:
> > 
> > I've just tried something similar to the attached patch, which is a "more
> > aggressive" version of your suggestion to use SLAB_KERNEL for bh
> > allocations when possible. This one makes all bh allocators block on IO. 
> 
> The patch looks fine. Has anybody else tried it?

The XFS people have been using GFP_PAGE_IO for sometime in their
CVS. (getblk is not using GFP_PAGE_IO there, though).

> Along with, for example, something like this [warning: whitespace damage,
> I just cut-and-pasted this as a test-patch], we might actually _fix_ the
> problem:
> 
> --- v2.4.5/linux/mm/vmscan.c    Fri May 25 18:28:55 2001
> +++ linux/mm/vmscan.c   Sun Jun  3 16:26:20 2001
> @@ -463,6 +463,7 @@
>  
>                 /* Page is or was in use?  Move it to the active list. */
>                 if (PageReferenced(page) || page->age > 0 ||
> +                               page->zone->free_pages > page->zone->pages_max ||
>                                 (!page->buffers && page_count(page) > 1) ||
>                                 page_ramdisk(page)) {
>                         del_page_from_inactive_dirty_list(page);
> 
> What the above does is fairly obvious: it considers all pages in zones
> that don't need to be free'd to be "young", and doesn't even try to write
> them out. Because we have absolutely no reason to do so.

This patch makes perfect sense, but it does not avoid us from writing out
highmem pages (_even_ if the highmem zone has a shortage) in case the
lowmem is under shortage.

We _need_ low memory to writeout high memory, thats my point.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:21                               ` Linux-2.4.5 Rik van Riel
  2001-05-26 14:38                                 ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 15:09                                 ` Linus Torvalds
  2001-05-26 15:18                                   ` Linux-2.4.5 Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2001-05-26 15:09 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:
> 
> O, that part is fixed by the patch that Linus threw away
> yesterday ;)

Rik, I threw away the parts of the patch that were bad and obvious
band-aids, and it was hard to tell whether any of your patch was a
"real" fix as opposed to just making more reservations.

And you have not replied to any of the real arguments for fixing the
_real_ bugs. You jst repeat "my patch, my patch, my patch", without even
acknowledging that others are finding the real _underlying_ problems.

Please take a moment to realize that you're not exactly being helpful.

		Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:09                                 ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26 15:18                                   ` Rik van Riel
  2001-05-26 15:24                                     ` Linux-2.4.5 Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 15:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Linus Torvalds wrote:
> On Sat, 26 May 2001, Rik van Riel wrote:
> > 
> > O, that part is fixed by the patch that Linus threw away
> > yesterday ;)
> 
> Rik, I threw away the parts of the patch that were bad and obvious
> band-aids, and it was hard to tell whether any of your patch was a
> "real" fix as opposed to just making more reservations.

1) Remove GFP_BUFFER and HIGHMEM related deadlocks, by letting
   these allocations fail instead of looping forever in
   __alloc_pages() when they cannot make any progress there.

It's the changes to __alloc_pages(), where we don't loop forever
but fail the allocation.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:18                                   ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:24                                     ` Andrea Arcangeli
  2001-05-26 15:26                                       ` Linux-2.4.5 Rik van Riel
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 15:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, May 26, 2001 at 12:18:07PM -0300, Rik van Riel wrote:
> It's the changes to __alloc_pages(), where we don't loop forever
> but fail the allocation.

__alloc_pages() should definitely not to loop forever but it should fail
the allocation instead. If it doesn't right now that is yet another bug,
at least with the 2.4 memory management design where we don't know if
there is memory available or not when we do the allocation. I'd really
appreciate if you could re-post this strictly necessary localized fix, I
will definitely agree about that one.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:24                                     ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 15:26                                       ` Rik van Riel
  2001-05-26 15:40                                         ` Linux-2.4.5 Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 15:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:
> On Sat, May 26, 2001 at 12:18:07PM -0300, Rik van Riel wrote:
> > It's the changes to __alloc_pages(), where we don't loop forever
> > but fail the allocation.
> 
> __alloc_pages() should definitely not to loop forever but it should
> fail the allocation instead.

Guess what's in my patch?

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:26                                       ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:40                                         ` Andrea Arcangeli
  0 siblings, 0 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 15:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Ben LaHaise, Alan Cox, linux-kernel

On Sat, May 26, 2001 at 12:26:38PM -0300, Rik van Riel wrote:
> Guess what's in my patch?

that part is fine indeed, it's ages that I am telling that alloc_pages
must always be allowed to fail or things will deadlock, go back to the
discussion with Ingo of a few months ago, finally you seems convinced
about that.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  3:11                         ` Linux-2.4.5 Andrea Arcangeli
  2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
@ 2001-05-26  4:45                           ` Rik van Riel
  2001-05-26  4:47                             ` Linux-2.4.5 Rik van Riel
  2001-05-26 14:32                             ` Linux-2.4.5 Andrea Arcangeli
  1 sibling, 2 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  4:45 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:
> On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote:
> > Highmem.  0 free pages in ZONE_NORMAL.  Now try to allocate a buffer_head.
> 
> That's a longstanding deadlock, it was there the first time I read
> fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also
> getblk is deadlock prone in a smiliar manner.

Not only this, but the fix is _surprisingly_ simple...
All we need to make sure of is that the following order
of allocations is possible and that we back out instead
of deadlock when we don't succeed at any step.

1) normal user allocation
2) buffer allocation (bounce buffer + bufferhead)
3) allocation from interrupt (for device driver)

This is fixed by the patch I sent because:

1) user allocations stop when we reach zone->pages_min and
   keep looping until we freed some memory ... well, these
   don't just loop because we can guarantee that freeing
   memory with GFP_USER or GFP_KERNEL is possible

2) GFP_BUFFER allocations can allocate down to the point
   where free pages go to zone->pages_min * 3 / 4, so we
   can continue to swapout highmem pages when userland
   allocations have stopped ... this is needed because
   otherwise we cannot always make progress on highmem
   pages and we could have the effective amount of RAM
   in the system reduced to less than 1GB, in really bad
   cases not having this could even cause a deadlock

3) If the device driver needs to allocate something, it
   has from zone->pages_min*3/4 down to zone->pages_min/4
   space to allocate stuff, this should be very useful
   for swap or mmap() over the network, or to encrypted
   block devices, etc...

> Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE
> and see if it makes a difference?

No Comment(tm)   *grin*

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:45                           ` Linux-2.4.5 Rik van Riel
@ 2001-05-26  4:47                             ` Rik van Riel
  2001-05-26  6:07                               ` Linux-2.4.5 Ben LaHaise
  2001-05-26 14:32                             ` Linux-2.4.5 Andrea Arcangeli
  1 sibling, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  4:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:

> 1) normal user allocation
> 2) buffer allocation (bounce buffer + bufferhead)
> 3) allocation from interrupt (for device driver)

Hmmmm, now that I think of it, we always need to be able
to guarantee _both_ 2) and 3).  For different allocators
and interrupts.  I guess the only long-term solution is
to have real memory reservation strategies, like the one
in Ben's patch.

That might be a 2.5 thing, though ... Ben?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:47                             ` Linux-2.4.5 Rik van Riel
@ 2001-05-26  6:07                               ` Ben LaHaise
  0 siblings, 0 replies; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  6:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Rik van Riel wrote:

> That might be a 2.5 thing, though ... Ben?

It isn't an immediate priority for me at this moment, but in a few weeks I
hope to have it back at the top of my concerns.  If anyone considers it
truely important, I might be persuaded to give it a bit more nudging
along, but if people try to explain to me that the concept is completely
unnescessary, I'll redirect their notes to /dev/null.

		-ben

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  4:45                           ` Linux-2.4.5 Rik van Riel
  2001-05-26  4:47                             ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 14:32                             ` Andrea Arcangeli
  2001-05-26 14:36                               ` Linux-2.4.5 Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 14:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, May 26, 2001 at 01:45:27AM -0300, Rik van Riel wrote:
> 3) If the device driver needs to allocate something, it
>    has from zone->pages_min*3/4 down to zone->pages_min/4
>    space to allocate stuff, this should be very useful
>    for swap or mmap() over the network, or to encrypted
>    block devices, etc...

Anything supposed to work because there's enough memory between
zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken
period.

> > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE
> > and see if it makes a difference?
> 
> No Comment(tm)   *grin*

I'm having lots of fun, thanks.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:32                             ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 14:36                               ` Rik van Riel
  2001-05-26 15:03                                 ` Linux-2.4.5 Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 14:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:
> On Sat, May 26, 2001 at 01:45:27AM -0300, Rik van Riel wrote:
> > 3) If the device driver needs to allocate something, it
> >    has from zone->pages_min*3/4 down to zone->pages_min/4
> >    space to allocate stuff, this should be very useful
> >    for swap or mmap() over the network, or to encrypted
> >    block devices, etc...
> 
> Anything supposed to work because there's enough memory between
> zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken
> period.

No. It's not about having enough memory between those levels.
It's about _failing_ the allocation when you reach a limit.

> > > Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE
> > > and see if it makes a difference?
> >
> > No Comment(tm)   *grin*
> 
> I'm having lots of fun, thanks.

Now _this_ is tweaking magic limits ;)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 14:36                               ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:03                                 ` Andrea Arcangeli
  2001-05-26 15:08                                   ` Linux-2.4.5 Rik van Riel
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 15:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, May 26, 2001 at 11:36:22AM -0300, Rik van Riel wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> > > No Comment(tm)   *grin*
> > 
> > I'm having lots of fun, thanks.
> 
> Now _this_ is tweaking magic limits ;)

Others agreed that the real source of the create_buffers could be just
too few reserved pages in the unused_list, the unused_list is the only
"deadlock avoidance logic" designed to avoid create_buffers to deadlock,
so I really don't see why you don't listen to me and you just assume my
idea is obviously wrong while it obviously isn't. The only thing I can
do is to laugh while reading your illogical "No Comment(tm)   *grin*",
if I would take such comment serously I would just ignore the VM and
stay in other subsystem where those funny things never happens as far I
can tell, which I'm not going to do just because of your new funny "No
Comment(tm) *grin*". If for any reason you notice I somehow invite those
things to happen please let me know and I will certainly try to get
better from my part.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:03                                 ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 15:08                                   ` Rik van Riel
  2001-05-26 15:20                                     ` Linux-2.4.5 Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 15:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> Others agreed that the real source of the create_buffers could be just
> too few reserved pages in the unused_list

To quote you, from the message to which I replied with the
"No Comment" comment:

------> Andrea Arcangeli wrote:
Anything supposed to work because there's enough memory between
zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken
period.
------

And not 10 lines lower you decide to raise some magic
limit yourself. I guess my irony threshold must be
lower than yours, or something.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26 15:08                                   ` Linux-2.4.5 Rik van Riel
@ 2001-05-26 15:20                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26 15:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

On Sat, May 26, 2001 at 12:08:34PM -0300, Rik van Riel wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> 
> > Others agreed that the real source of the create_buffers could be just
> > too few reserved pages in the unused_list
> 
> To quote you, from the message to which I replied with the
> "No Comment" comment:
> 
> ------> Andrea Arcangeli wrote:
> Anything supposed to work because there's enough memory between
> zone->pages_min*3/4 and zone->pages_min/4 is just obviously broken
> period.
> ------

What are you smoking again? You said "No Comment(tm)   *grin*" to my
suggestion to increase NR_RESERVED (attached).

Andrea

[-- Attachment #2: Type: message/rfc822, Size: 3674 bytes --]

From: Rik van Riel <riel@conectiva.com.br>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Ben LaHaise <bcrl@redhat.com>, Linus Torvalds <torvalds@transmeta.com>, Alan Cox <alan@lxorguk.ukuu.org.uk>, linux-kernel@vger.kernel.org
Subject: Re: Linux-2.4.5
Date: Sat, 26 May 2001 01:45:27 -0300 (BRST)
Message-ID: <Pine.LNX.4.21.0105260137140.30264-100000@imladris.rielhome.conectiva>

On Sat, 26 May 2001, Andrea Arcangeli wrote:
> On Fri, May 25, 2001 at 10:49:38PM -0400, Ben LaHaise wrote:
> > Highmem.  0 free pages in ZONE_NORMAL.  Now try to allocate a buffer_head.
> 
> That's a longstanding deadlock, it was there the first time I read
> fs/buffer.c, nothing related to highmem, we have it in 2.2 too. Also
> getblk is deadlock prone in a smiliar manner.

Not only this, but the fix is _surprisingly_ simple...
All we need to make sure of is that the following order
of allocations is possible and that we back out instead
of deadlock when we don't succeed at any step.

1) normal user allocation
2) buffer allocation (bounce buffer + bufferhead)
3) allocation from interrupt (for device driver)

This is fixed by the patch I sent because:

1) user allocations stop when we reach zone->pages_min and
   keep looping until we freed some memory ... well, these
   don't just loop because we can guarantee that freeing
   memory with GFP_USER or GFP_KERNEL is possible

2) GFP_BUFFER allocations can allocate down to the point
   where free pages go to zone->pages_min * 3 / 4, so we
   can continue to swapout highmem pages when userland
   allocations have stopped ... this is needed because
   otherwise we cannot always make progress on highmem
   pages and we could have the effective amount of RAM
   in the system reduced to less than 1GB, in really bad
   cases not having this could even cause a deadlock

3) If the device driver needs to allocate something, it
   has from zone->pages_min*3/4 down to zone->pages_min/4
   space to allocate stuff, this should be very useful
   for swap or mmap() over the network, or to encrypted
   block devices, etc...

> Can you try to simply change NR_RESERVED to say 200*MAX_BUF_PER_PAGE
> and see if it makes a difference?

No Comment(tm)   *grin*

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Linux-2.4.5
  2001-05-26  2:11                   ` Linux-2.4.5 Ben LaHaise
  2001-05-26  2:38                     ` Linux-2.4.5 Andrea Arcangeli
@ 2001-05-26 15:41                     ` Rik van Riel
  1 sibling, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26 15:41 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Andrea Arcangeli, Linus Torvalds, Alan Cox, linux-kernel

On Fri, 25 May 2001, Ben LaHaise wrote:

> +++ work/mm/vmscan.c	Mon May 14 16:43:05 2001
> @@ -636,6 +636,12 @@

> +		if (gfp_mask & __GFP_WAIT) {

Without __GFP_WAIT, we will never call do_try_to_free_pages()

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:29         ` Ben LaHaise
  2001-05-26  0:34           ` Linus Torvalds
@ 2001-05-26  0:42           ` Andrea Arcangeli
  2001-05-26  0:52             ` Ben LaHaise
  2001-05-26  1:43             ` Rik van Riel
  1 sibling, 2 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  0:42 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 08:29:38PM -0400, Ben LaHaise wrote:
> amount of bounce buffers to guarentee progress while submitting io.  The
> -ac kernels have a patch from Ingo that provides private pools for bounce
> buffers and buffer_heads.  I went a step further and have a memory
> reservation patch that provides for memory pools being reserved against a
> particular zone.  This is needed to prevent the starvation that irq
> allocations can cause.
> 
> Some of these cleanups are 2.5 fodder, but we really need something in 2.4
> right now, so...

Please merge this one in 2.4 for now (originally from Ingo, I only
improved it), this is a real definitive fix and there's no nicer way to
handle that unless you want to generalize an API for people to generate
private anti-deadlock ("make sure to always make a progress") memory
pools:

diff -urN 2.4.4/mm/highmem.c highmem-deadlock/mm/highmem.c
--- 2.4.4/mm/highmem.c	Sat Apr 28 05:24:48 2001
+++ highmem-deadlock/mm/highmem.c	Sat Apr 28 18:21:24 2001
@@ -159,6 +159,19 @@
 	spin_unlock(&kmap_lock);
 }
 
+#define POOL_SIZE 32
+
+/*
+ * This lock gets no contention at all, normally.
+ */
+static spinlock_t emergency_lock = SPIN_LOCK_UNLOCKED;
+
+int nr_emergency_pages;
+static LIST_HEAD(emergency_pages);
+
+int nr_emergency_bhs;
+static LIST_HEAD(emergency_bhs);
+
 /*
  * Simple bounce buffer support for highmem pages.
  * This will be moved to the block layer in 2.5.
@@ -203,17 +216,72 @@
 
 static inline void bounce_end_io (struct buffer_head *bh, int uptodate)
 {
+	struct page *page;
 	struct buffer_head *bh_orig = (struct buffer_head *)(bh->b_private);
+	unsigned long flags;
 
 	bh_orig->b_end_io(bh_orig, uptodate);
-	__free_page(bh->b_page);
+
+	page = bh->b_page;
+
+	spin_lock_irqsave(&emergency_lock, flags);
+	if (nr_emergency_pages >= POOL_SIZE)
+		__free_page(page);
+	else {
+		/*
+		 * We are abusing page->list to manage
+		 * the highmem emergency pool:
+		 */
+		list_add(&page->list, &emergency_pages);
+		nr_emergency_pages++;
+	}
+	
+	if (nr_emergency_bhs >= POOL_SIZE) {
 #ifdef HIGHMEM_DEBUG
-	/* Don't clobber the constructed slab cache */
-	init_waitqueue_head(&bh->b_wait);
+		/* Don't clobber the constructed slab cache */
+		init_waitqueue_head(&bh->b_wait);
 #endif
-	kmem_cache_free(bh_cachep, bh);
+		kmem_cache_free(bh_cachep, bh);
+	} else {
+		/*
+		 * Ditto in the bh case, here we abuse b_inode_buffers:
+		 */
+		list_add(&bh->b_inode_buffers, &emergency_bhs);
+		nr_emergency_bhs++;
+	}
+	spin_unlock_irqrestore(&emergency_lock, flags);
 }
 
+static __init int init_emergency_pool(void)
+{
+	spin_lock_irq(&emergency_lock);
+	while (nr_emergency_pages < POOL_SIZE) {
+		struct page * page = alloc_page(GFP_ATOMIC);
+		if (!page) {
+			printk("couldn't refill highmem emergency pages");
+			break;
+		}
+		list_add(&page->list, &emergency_pages);
+		nr_emergency_pages++;
+	}
+	while (nr_emergency_bhs < POOL_SIZE) {
+		struct buffer_head * bh = kmem_cache_alloc(bh_cachep, SLAB_ATOMIC);
+		if (!bh) {
+			printk("couldn't refill highmem emergency bhs");
+			break;
+		}
+		list_add(&bh->b_inode_buffers, &emergency_bhs);
+		nr_emergency_bhs++;
+	}
+	spin_unlock_irq(&emergency_lock);
+	printk("allocated %d pages and %d bhs reserved for the highmem bounces\n",
+	       nr_emergency_pages, nr_emergency_bhs);
+
+	return 0;
+}
+
+__initcall(init_emergency_pool);
+
 static void bounce_end_io_write (struct buffer_head *bh, int uptodate)
 {
 	bounce_end_io(bh, uptodate);
@@ -228,6 +296,82 @@
 	bounce_end_io(bh, uptodate);
 }
 
+struct page *alloc_bounce_page (void)
+{
+	struct list_head *tmp;
+	struct page *page;
+
+repeat_alloc:
+	page = alloc_page(GFP_BUFFER);
+	if (page)
+		return page;
+	/*
+	 * No luck. First, kick the VM so it doesnt idle around while
+	 * we are using up our emergency rations.
+	 */
+	wakeup_bdflush(0);
+
+	/*
+	 * Try to allocate from the emergency pool.
+	 */
+	tmp = &emergency_pages;
+	spin_lock_irq(&emergency_lock);
+	if (!list_empty(tmp)) {
+		page = list_entry(tmp->next, struct page, list);
+		list_del(tmp->next);
+		nr_emergency_pages--;
+	}
+	spin_unlock_irq(&emergency_lock);
+	if (page)
+		return page;
+
+	/* we need to wait I/O completion */
+	run_task_queue(&tq_disk);
+
+	current->policy |= SCHED_YIELD;
+	__set_current_state(TASK_RUNNING);
+	schedule();
+	goto repeat_alloc;
+}
+
+struct buffer_head *alloc_bounce_bh (void)
+{
+	struct list_head *tmp;
+	struct buffer_head *bh;
+
+repeat_alloc:
+	bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER);
+	if (bh)
+		return bh;
+	/*
+	 * No luck. First, kick the VM so it doesnt idle around while
+	 * we are using up our emergency rations.
+	 */
+	wakeup_bdflush(0);
+
+	/*
+	 * Try to allocate from the emergency pool.
+	 */
+	tmp = &emergency_bhs;
+	spin_lock_irq(&emergency_lock);
+	if (!list_empty(tmp)) {
+		bh = list_entry(tmp->next, struct buffer_head, b_inode_buffers);
+		list_del(tmp->next);
+		nr_emergency_bhs--;
+	}
+	spin_unlock_irq(&emergency_lock);
+	if (bh)
+		return bh;
+
+	/* we need to wait I/O completion */
+	run_task_queue(&tq_disk);
+
+	current->policy |= SCHED_YIELD;
+	__set_current_state(TASK_RUNNING);
+	schedule();
+	goto repeat_alloc;
+}
+
 struct buffer_head * create_bounce(int rw, struct buffer_head * bh_orig)
 {
 	struct page *page;
@@ -236,24 +380,15 @@
 	if (!PageHighMem(bh_orig->b_page))
 		return bh_orig;
 
-repeat_bh:
-	bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER);
-	if (!bh) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-		goto repeat_bh;
-	}
+	bh = alloc_bounce_bh();
 	/*
 	 * This is wasteful for 1k buffers, but this is a stopgap measure
 	 * and we are being ineffective anyway. This approach simplifies
 	 * things immensly. On boxes with more than 4GB RAM this should
 	 * not be an issue anyway.
 	 */
-repeat_page:
-	page = alloc_page(GFP_BUFFER);
-	if (!page) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-		goto repeat_page;
-	}
+	page = alloc_bounce_page();
+
 	set_bh_page(bh, page, 0);
 
 	bh->b_next = NULL;


And this one as well to avoid tight loops in getblk without reschedules
in between when normal zone is empty:

diff -urN 2.4.4pre1/fs/buffer.c 2.4.4pre1-blkdev/fs/buffer.c
--- 2.4.4pre1/fs/buffer.c	Sun Apr  1 01:17:30 2001
+++ 2.4.4pre1-blkdev/fs/buffer.c	Mon Apr  9 15:37:20 2001
@@ -628,7 +622,7 @@
    to do in order to release the ramdisk memory is to destroy dirty buffers.
 
    These are two special cases. Normal usage imply the device driver
-   to issue a sync on the device (without waiting I/O completation) and
+   to issue a sync on the device (without waiting I/O completion) and
    then an invalidate_buffers call that doesn't trash dirty buffers. */
 void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers)
 {
@@ -762,7 +756,12 @@
 	balance_dirty(NODEV);
 	if (free_shortage())
 		page_launder(GFP_BUFFER, 0);
-	grow_buffers(size);
+	if (!grow_buffers(size)) {
+		wakeup_bdflush(1);
+		current->policy |= SCHED_YIELD;
+		__set_current_state(TASK_RUNNING);
+		schedule();
+	}
 }
 
 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -1027,12 +1026,13 @@
 	write_unlock(&hash_table_lock);
 	spin_unlock(&lru_list_lock);
 	refill_freelist(size);
+	/* FIXME: getblk should fail if there's no enough memory */
 	goto repeat;
 }
 
 /* -1 -> no need to flush
     0 -> async flush
-    1 -> sync flush (wait for I/O completation) */
+    1 -> sync flush (wait for I/O completion) */
 int balance_dirty_state(kdev_t dev)
 {
 	unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
@@ -1431,6 +1431,7 @@
 {
 	struct buffer_head *bh, *head, *tail;
 
+	/* FIXME: create_buffers should fail if there's no enough memory */
 	head = create_buffers(page, blocksize, 1);
 	if (page->buffers)
 		BUG();
@@ -2367,11 +2368,9 @@
 	spin_lock(&free_list[index].lock);
 	tmp = bh;
 	do {
-		struct buffer_head *p = tmp;
-
-		tmp = tmp->b_this_page;
-		if (buffer_busy(p))
+		if (buffer_busy(tmp))
 			goto busy_buffer_page;
+		tmp = tmp->b_this_page;
 	} while (tmp != bh);
 
 	spin_lock(&unused_list_lock);

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:42           ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli
@ 2001-05-26  0:52             ` Ben LaHaise
  2001-05-26  1:27               ` Andrea Arcangeli
  2001-05-26  1:43             ` Rik van Riel
  1 sibling, 1 reply; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  0:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> Please merge this one in 2.4 for now (originally from Ingo, I only
> improved it), this is a real definitive fix and there's no nicer way to
> handle that unless you want to generalize an API for people to generate
> private anti-deadlock ("make sure to always make a progress") memory
> pools:

Alternatively, the following might be more interesting...

		-ben

diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/arch/i386/kernel/irq.c linux-toolbox-current/arch/i386/kernel/irq.c
--- kernel-2.4.3-works/linux.orig/arch/i386/kernel/irq.c	Thu May 10 16:04:39 2001
+++ linux-toolbox-current/arch/i386/kernel/irq.c	Thu May 10 12:16:21 2001
@@ -32,6 +32,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/irq.h>
 #include <linux/proc_fs.h>
+#include <linux/mm/reservation.h>

 #include <asm/atomic.h>
 #include <asm/io.h>
@@ -576,7 +577,10 @@
 	irq_desc_t *desc = irq_desc + irq;
 	struct irqaction * action;
 	unsigned int status;
+	struct page_reservation *saved_irq_rsv;

+	saved_irq_rsv = current->page_reservations;
+	current->page_reservations = &irq_rsv;
 	kstat.irqs[cpu][irq]++;
 	spin_lock(&desc->lock);
 	desc->handler->ack(irq);
@@ -638,6 +642,7 @@

 	if (softirq_active(cpu) & softirq_mask(cpu))
 		do_softirq();
+	current->page_reservations = saved_irq_rsv;
 	return 1;
 }

diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/arch/i386/mm/fault.c linux-toolbox-current/arch/i386/mm/fault.c
--- kernel-2.4.3-works/linux.orig/arch/i386/mm/fault.c	Thu May 10 16:04:40 2001
+++ linux-toolbox-current/arch/i386/mm/fault.c	Mon May 14 13:26:57 2001
@@ -196,6 +196,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;

+	atomic_inc(&mm->in_fault_count);
 	down_read(&mm->mmap_sem);

 	vma = find_vma(mm, address);
@@ -269,6 +270,7 @@
 		if (bit < 32)
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
+	atomic_dec(&mm->in_fault_count);
 	up_read(&mm->mmap_sem);
 	return;

@@ -277,6 +279,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
+	atomic_dec(&mm->in_fault_count);
 	up_read(&mm->mmap_sem);

 bad_area_nosemaphore:
@@ -339,6 +342,7 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
+	atomic_dec(&mm->in_fault_count);
 	up_read(&mm->mmap_sem);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (error_code & 4)
@@ -346,6 +350,7 @@
 	goto no_context;

 do_sigbus:
+	atomic_dec(&mm->in_fault_count);
 	up_read(&mm->mmap_sem);

 	/*
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/fs/buffer.c linux-toolbox-current/fs/buffer.c
--- kernel-2.4.3-works/linux.orig/fs/buffer.c	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/fs/buffer.c	Fri May 11 16:42:19 2001
@@ -45,6 +45,7 @@
 #include <linux/quotaops.h>
 #include <linux/iobuf.h>
 #include <linux/highmem.h>
+#include <linux/mm/reservation.h>

 #include <asm/uaccess.h>
 #include <asm/io.h>
@@ -2735,6 +2736,7 @@
  */
 int bdflush(void *sem)
 {
+	static struct page_reservation rsv;
 	struct task_struct *tsk = current;
 	int flushed;
 	/*
@@ -2748,6 +2750,12 @@
 	strcpy(tsk->comm, "bdflush");
 	bdflush_tsk = tsk;

+	init_page_reservation(&rsv, RSV_MULTISHOT, ZONE_NORMAL);
+	if (reserve_pages(&rsv, GFP_KERNEL, 64))
+		panic("bdflush unable to reserve emergency pages!\n");
+	tsk->page_reservations = &rsv;
+
+
 	/* avoid getting signals */
 	spin_lock_irq(&tsk->sigmask_lock);
 	flush_signals(tsk);
@@ -2778,6 +2786,8 @@
 		   the next schedule will block. */
 		__set_current_state(TASK_RUNNING);
 	}
+
+	destroy_page_reservation(&rsv);
 }

 /*
@@ -2788,6 +2798,7 @@
  */
 int kupdate(void *sem)
 {
+	static struct page_reservation rsv;
 	struct task_struct * tsk = current;
 	int interval;

@@ -2795,6 +2806,11 @@
 	tsk->pgrp = 1;
 	strcpy(tsk->comm, "kupdated");

+	init_page_reservation(&rsv, RSV_MULTISHOT, ZONE_NORMAL);
+	if (reserve_pages(&rsv, GFP_KERNEL, 32))
+		panic("bdflush unable to reserve emergency pages!\n");
+	tsk->page_reservations = &rsv;
+
 	/* sigstop and sigcont will stop and wakeup kupdate */
 	spin_lock_irq(&tsk->sigmask_lock);
 	sigfillset(&tsk->blocked);
@@ -2833,6 +2849,7 @@
 #endif
 		sync_old_buffers();
 	}
+	destroy_page_reservation(&rsv);
 }

 static int __init bdflush_init(void)
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mm/reservation.h linux-toolbox-current/include/linux/mm/reservation.h
--- kernel-2.4.3-works/linux.orig/include/linux/mm/reservation.h	Wed Dec 31 19:00:00 1969
+++ linux-toolbox-current/include/linux/mm/reservation.h	Thu May 10 12:16:21 2001
@@ -0,0 +1,48 @@
+#ifndef __LINUX__MM__RESERVATION_H
+#define __LINUX__MM__RESERVATION_H
+/* inclinde/linux/mm/reservation.h
+ *	written by Benjamin LaHaise
+ *
+ *	Copyright 2001 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or modify
+ *	it under the terms of the GNU General Public License as published by
+ *	the Free Software Foundation; either version 2 of the License, or
+ *	(at your option) any later version.
+ *
+ *	This program is distributed in the hope that it will be useful,
+ *	but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *	GNU General Public License for more details.
+ *
+ *	You should have received a copy of the GNU General Public License
+ *	along with this program; if not, write to the Free Software
+ *	Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *	based in part on ideas/code from Arjan Van de Ven and Stephen Tweedie.
+ */
+
+#define RSV_ONESHOT	0x00
+#define RSV_MULTISHOT	0x01	/* reservation will replenish itself */
+
+struct page_reservation {
+	struct list_head	list;
+	unsigned		avail, used;
+	int			flags;
+	zone_t			*zone;
+};
+
+extern struct page_reservation irq_rsv;
+
+extern void init_page_reservation(struct page_reservation *rsv, int flags, int zone);
+extern void destroy_page_reservation(struct page_reservation *rsv);
+
+/* Reservation is an all or nothing thing.  A successful reservation
+ * returns 0.  Anything else is a failure.
+ */
+extern int reserve_pages(struct page_reservation *rsv, int gfp_mask, unsigned count);
+
+/* Release a previously reserved amount of memory. */
+extern void put_reserved_pages(struct page_reservation *rsv, unsigned count);
+
+#endif
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mm.h linux-toolbox-current/include/linux/mm.h
--- kernel-2.4.3-works/linux.orig/include/linux/mm.h	Thu May 10 16:04:40 2001
+++ linux-toolbox-current/include/linux/mm.h	Mon May 14 13:33:43 2001
@@ -528,7 +528,7 @@


 #define GFP_BOUNCE	(__GFP_HIGH | __GFP_FAIL)
-#define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT)
+#define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT | __GFP_FAIL)
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_USER	(             __GFP_WAIT | __GFP_IO)
 #define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/mmzone.h linux-toolbox-current/include/linux/mmzone.h
--- kernel-2.4.3-works/linux.orig/include/linux/mmzone.h	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/include/linux/mmzone.h	Fri May 11 20:46:13 2001
@@ -50,6 +50,10 @@
 	unsigned long		inactive_dirty_pages;
 	unsigned long		pages_min, pages_low, pages_high;

+	/* Page reservation */
+	unsigned long		reserved_pages;
+	struct list_head	depleted_rsv_list;
+
 	/*
 	 * free areas of different sizes
 	 */
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/include/linux/sched.h linux-toolbox-current/include/linux/sched.h
--- kernel-2.4.3-works/linux.orig/include/linux/sched.h	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/include/linux/sched.h	Mon May 14 13:23:55 2001
@@ -210,6 +210,7 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t in_fault_count;		/* number of in progress page faults */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
@@ -241,6 +242,7 @@
 	pgd:		swapper_pg_dir, 		\
 	mm_users:	ATOMIC_INIT(2), 		\
 	mm_count:	ATOMIC_INIT(1), 		\
+	in_fault_count:	ATOMIC_INIT(0), 		\
 	map_count:	1, 				\
 	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
@@ -406,6 +408,8 @@
    	u32 self_exec_id;
 /* Protection of (de-)allocation: mm, files, fs, tty */
 	spinlock_t alloc_lock;
+
+	struct page_reservation  *page_reservations;
 };

 /*
@@ -486,7 +490,8 @@
     sig:		&init_signals,					\
     pending:		{ NULL, &tsk.pending.head, {{0}}},		\
     blocked:		{{0}},						\
-    alloc_lock:		SPIN_LOCK_UNLOCKED				\
+    alloc_lock:		SPIN_LOCK_UNLOCKED,				\
+    page_reservations:	NULL,						\
 }


diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/init/main.c linux-toolbox-current/init/main.c
--- kernel-2.4.3-works/linux.orig/init/main.c	Thu May 10 16:04:39 2001
+++ linux-toolbox-current/init/main.c	Thu May 10 21:00:20 2001
@@ -28,6 +28,7 @@
 #include <linux/iobuf.h>
 #include <linux/bootmem.h>
 #include <linux/tty.h>
+#include <linux/mm/reservation.h>

 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -88,6 +89,9 @@

 static int init(void *);

+extern struct page_reservation atomic_rsv;
+extern struct page_reservation swap_rsv;
+
 extern void init_IRQ(void);
 extern void init_modules(void);
 extern void sock_init(void);
@@ -654,6 +658,13 @@
 	proc_root_init();
 #endif
 	mempages = num_physpages;
+
+	if (reserve_pages(&irq_rsv, GFP_KERNEL, mempages >> 9))
+		panic("unable to reserve irq memory.\n");
+	if (reserve_pages(&swap_rsv, GFP_KERNEL, mempages >> 9))
+		panic("unable to reserve swap memory.\n");
+	if (reserve_pages(&atomic_rsv, GFP_KERNEL, mempages >> 10))
+		panic("unable to reserve atomic memory.\n");

 	fork_init(mempages);
 	proc_caches_init();
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/kernel/exit.c linux-toolbox-current/kernel/exit.c
--- kernel-2.4.3-works/linux.orig/kernel/exit.c	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/kernel/exit.c	Thu May 10 12:15:34 2001
@@ -10,6 +10,7 @@
 #include <linux/smp_lock.h>
 #include <linux/module.h>
 #include <linux/tty.h>
+#include <linux/mm/reservation.h>
 #ifdef CONFIG_BSD_PROCESS_ACCT
 #include <linux/acct.h>
 #endif
@@ -422,6 +423,11 @@
 NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
+
+	if (tsk->page_reservations) {
+		destroy_page_reservation(tsk->page_reservations);
+		tsk->page_reservations = NULL;
+	}

 	if (in_interrupt())
 		panic("Aiee, killing interrupt handler!");
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/kernel/fork.c linux-toolbox-current/kernel/fork.c
--- kernel-2.4.3-works/linux.orig/kernel/fork.c	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/kernel/fork.c	Mon May 14 13:24:45 2001
@@ -203,6 +203,7 @@
 {
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
+	atomic_set(&mm->in_fault_count, 0);
 	init_rwsem(&mm->mmap_sem);
 	mm->page_table_lock = SPIN_LOCK_UNLOCKED;
 	mm->pgd = pgd_alloc();
@@ -630,6 +631,7 @@
 	p->tty_old_pgrp = 0;
 	p->times.tms_utime = p->times.tms_stime = 0;
 	p->times.tms_cutime = p->times.tms_cstime = 0;
+	p->page_reservations = 0;
 #ifdef CONFIG_SMP
 	{
 		int i;
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/mm/page_alloc.c linux-toolbox-current/mm/page_alloc.c
--- kernel-2.4.3-works/linux.orig/mm/page_alloc.c	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/mm/page_alloc.c	Fri May 11 15:52:54 2001
@@ -18,7 +18,31 @@
 #include <linux/pagemap.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/mm/reservation.h>

+extern int
+	page_launder_calls,
+	page_launder_scans,
+	page_launder_scanned_pages,
+	page_launder_scanned_not_dirty,
+	page_launder_scanned_active,
+	page_launder_scanned_skipped,
+	page_launder_scanned_dirty,
+	page_launder_scanned_dirty_skip,
+	page_launder_scanned_dirty_nolaunder,
+	page_launder_scanned_dirty_swapcache,
+	page_launder_scanned_buffers,
+	page_launder_scanned_buffers_refiled,
+	page_launder_scanned_buffers_freed_buffer,
+	page_launder_scanned_buffers_active,
+	page_launder_scanned_buffers_cleaned,
+	page_launder_scanned_buffers_released,
+	page_launder_scanned_mapped_clean,
+	page_launder_scanned_still_active
+;
+struct page_reservation atomic_rsv;
+struct page_reservation swap_rsv;
+struct page_reservation irq_rsv;
 int nr_swap_pages;
 int nr_active_pages;
 int nr_inactive_dirty_pages;
@@ -99,7 +123,7 @@

 	page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
 	page->age = PAGE_AGE_START;
-
+
 	zone = page->zone;

 	mask = (~0UL) << order;
@@ -115,7 +139,8 @@

 	__save_flags(flags);
 	__cli();
-	if (!order && (per_cpu->nr_pages < per_cpu->max_nr_pages)) {
+	if (!order && (per_cpu->nr_pages < per_cpu->max_nr_pages) && list_empty(&zone->depleted_rsv_list)) {
+static int foo; if (foo++ < 5) printk("freeing per-cpu page\n");
 		list_add(&page->list, &per_cpu->head);
 		per_cpu->nr_pages++;
 		__restore_flags(flags);
@@ -124,6 +149,20 @@

 	spin_lock(&zone->lock);

+	/* Check if we need to replenish any of this zone's reservations. */
+	if (!list_empty(&zone->depleted_rsv_list)) {
+		struct page_reservation *rsv = list_entry(zone->depleted_rsv_list.next, struct page_reservation, list);
+static int foo; if (foo++ < 5) printk("updating reserve: %p %u %u\n", rsv, rsv->avail, rsv->used);
+		if (!rsv->used)
+			BUG();
+		rsv->avail++;
+		rsv->used--;
+
+		list_del_init(&rsv->list);
+		if (rsv->used)
+			list_add(&rsv->list, zone->depleted_rsv_list.prev);
+	}
+
 	zone->free_pages -= mask;

 	while (mask + (1 << (MAX_ORDER-1))) {
@@ -190,12 +229,13 @@
 }


-static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned long order));
-static struct page * rmqueue(zone_t *zone, unsigned long order)
+static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned long order, struct page_reservation *rsv));
+static struct page * rmqueue(zone_t *zone, unsigned long order, struct page_reservation *rsv)
 {
 	per_cpu_t *per_cpu = zone->cpu_pages + smp_processor_id();
 	free_area_t * area = zone->free_area + order;
 	unsigned long curr_order = order;
+	unsigned long free_pages;
 	struct list_head *head, *curr;
 	unsigned long flags;
 	struct page *page;
@@ -216,7 +256,15 @@
 	}

 	spin_lock(&zone->lock);
-	do {
+
+	free_pages = zone->free_pages;
+	free_pages -= zone->reserved_pages;
+
+	/* Using a reservation? */
+	if (rsv)
+		free_pages += rsv->avail;
+
+	if (free_pages >= (1 << order)) do {
 		head = &area->free_list;
 		curr = memlist_next(head);

@@ -232,6 +280,16 @@
 			zone->free_pages -= 1 << order;

 			page = expand(zone, page, index, order, curr_order, area);
+			if (rsv && (rsv->avail >= (1 << order))) {
+static int foo; if (foo++ < 5) printk("alloc from reserv: %p %u %u\n", rsv, rsv->avail, rsv->used);
+				if (!rsv->used && (rsv->flags & RSV_MULTISHOT)) {
+static int foo; if (foo++ < 5) printk("multishot reserv: %p\n", rsv);
+					list_add(&rsv->list, &zone->depleted_rsv_list);
+				rsv->avail -= 1 << order;
+				rsv->used += 1 << order;
+}
+				zone->reserved_pages--;
+			}
 			spin_unlock_irqrestore(&zone->lock, flags);

 			set_page_count(page, 1);
@@ -265,6 +323,7 @@
 	for (;;) {
 		zone_t *z = *(zone++);
 		unsigned long water_mark;
+		unsigned long free_pages;

 		if (!z)
 			break;
@@ -275,6 +334,8 @@
 		 * We allocate if the number of free + inactive_clean
 		 * pages is above the watermark.
 		 */
+		free_pages = z->free_pages - z->reserved_pages;
+
 		switch (limit) {
 			default:
 			case PAGES_MIN:
@@ -287,14 +348,14 @@
 				water_mark = z->pages_high;
 		}

-		if (z->free_pages + z->inactive_clean_pages > water_mark) {
+		if (free_pages + z->inactive_clean_pages > water_mark) {
 			struct page *page = NULL;
 			/* If possible, reclaim a page directly. */
-			if (direct_reclaim && z->free_pages < z->pages_min + 8)
+			if (direct_reclaim && free_pages < z->pages_min + 8)
 				page = reclaim_page(z);
 			/* If that fails, fall back to rmqueue. */
 			if (!page)
-				page = rmqueue(z, order);
+				page = rmqueue(z, order, NULL);
 			if (page)
 				return page;
 		}
@@ -304,6 +365,8 @@
 	return NULL;
 }

+extern struct page *get_reserved_page (void);
+

 /*
  * This is the 'heart' of the zoned buddy allocator:
@@ -320,7 +383,7 @@
 	 * Allocations put pressure on the VM subsystem.
 	 */
 	memory_pressure++;
-
+
 	/*
 	 * (If anyone calls gfp from interrupts nonatomically then it
 	 * will sooner or later tripped up by a schedule().)
@@ -351,11 +414,11 @@
 		if (!z->size)
 			BUG();

-		if (z->free_pages >= z->pages_low) {
-			page = rmqueue(z, order);
+		if (z->free_pages - z->reserved_pages >= z->pages_low) {
+			page = rmqueue(z, order, NULL);
 			if (page)
-				return page;
-		} else if (z->free_pages < z->pages_min &&
+				goto out_success;
+		} else if (z->free_pages - z->reserved_pages < z->pages_min &&
 					waitqueue_active(&kreclaimd_wait)) {
 				wake_up_interruptible(&kreclaimd_wait);
 		}
@@ -371,7 +434,7 @@
 	 */
 	page = __alloc_pages_limit(zonelist, order, PAGES_HIGH, direct_reclaim);
 	if (page)
-		return page;
+		goto out_success;

 	/*
 	 * Then try to allocate a page from a zone with more
@@ -383,7 +446,7 @@
 	 */
 	page = __alloc_pages_limit(zonelist, order, PAGES_LOW, direct_reclaim);
 	if (page)
-		return page;
+		goto out_success;

 	/*
 	 * OK, none of the zones on our zonelist has lots
@@ -418,8 +481,42 @@
 	 */
 	page = __alloc_pages_limit(zonelist, order, PAGES_MIN, direct_reclaim);
 	if (page)
-		return page;
+		goto out_success;

+	/* Memory reservation hook.  Note: memory reservations are
+	 * attempted after all other normal means of allocations have
+	 * failed.  Give it a try with the memory reservation and see
+	 * what happens.
+	 * TODO: with memory reservations in place, much of the code
+	 * below is completely bogus.  Clean this up!  -ben
+	 */
+	if (current->page_reservations || (gfp_mask & __GFP_HIGH)) {
+		struct page_reservation *rsv;
+
+		if (gfp_mask & __GFP_HIGH)
+			rsv = &atomic_rsv;
+		else
+			rsv = current->page_reservations;
+
+		do {
+			zone = zonelist->zones;
+			for (;;) {
+				zone_t *z = *(zone++);
+				if (!z)
+					break;
+
+				if (z == rsv->zone) {
+static int foo; if (foo++ < 5) printk("trying reservation: %p\n", current->page_reservations);
+					page = rmqueue(z, order, rsv);
+					if (page)
+						goto out_success;
+					break;
+				}
+			}
+		} while (rsv == &atomic_rsv &&
+			 (rsv = current->page_reservations));
+	}
+
 	/*
 	 *	If we dont want to try too hard then we can give up
 	 *	now
@@ -465,9 +562,9 @@
 						break;
 					__free_page(page);
 					/* Try if the allocation succeeds. */
-					page = rmqueue(z, order);
+					page = rmqueue(z, order, NULL);
 					if (page)
-						return page;
+						goto out_success;
 				}
 			}
 		}
@@ -485,6 +582,11 @@
 			memory_pressure++;
 			try_to_free_pages(gfp_mask);
 			wakeup_bdflush(0);
+			if ((gfp_mask & __GFP_WAIT) && !(current->flags & PF_ATOMICALLOC)) {
+				__set_current_state(TASK_RUNNING);
+				current->policy |= SCHED_YIELD;
+				schedule();
+			}
 			goto try_again;
 		}
 	}
@@ -511,31 +613,25 @@
 		if (direct_reclaim) {
 			page = reclaim_page(z);
  			if (page)
-				return page;
+				goto out_success;
 		}

 		/* XXX: is pages_min/4 a good amount to reserve for this? */
-		if (z->free_pages < z->pages_min / 4 &&
+		if (z->free_pages - z->reserved_pages < z->pages_min / 4 &&
 		    !((current->flags & PF_MEMALLOC) &&
 		      (gfp_mask & __GFP_WAIT)))
 			continue;
-		page = rmqueue(z, order);
+		page = rmqueue(z, order, NULL);
 		if (page)
-			return page;
+			goto out_success;
 	}

-	// okay - we are in trouble, lets go to the DMA pool directly:
-
-	{
-		zone_t *z = pgdat_list->node_zones;
-
-		page = rmqueue(z, order);
-		if (page)
-			return page;
-	}
 	/* No luck.. */
 	printk(KERN_INFO "__alloc_pages: %lu-order allocation failed.\n", order);
 	return NULL;
+
+out_success:
+	return page;
 }

 /*
@@ -588,7 +684,7 @@
 	sum = 0;
 	while (pgdat) {
 		for (zone = pgdat->node_zones; zone < pgdat->node_zones + MAX_NR_ZONES; zone++)
-			sum += zone->free_pages;
+			sum += zone->free_pages - zone->reserved_pages;
 		pgdat = pgdat->node_next;
 	}
 	return sum;
@@ -605,7 +701,8 @@
 	sum = 0;
 	pgdat = pgdat_list;
 	while (pgdat) {
-		sum += (pgdat->node_zones+zone_type)->free_pages;
+		zone_t *z = pgdat->node_zones+zone_type;
+		sum += z->free_pages - z->reserved_pages;
 		pgdat = pgdat->node_next;
 	}
 	return sum;
@@ -694,6 +791,7 @@

 	while (pgdat) {
 		pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+		pages -= pgdat->node_zones[ZONE_HIGHMEM].reserved_pages;
 		pgdat = pgdat->node_next;
 	}
 	return pages;
@@ -723,13 +821,37 @@
 		freepages.low,
 		freepages.high);

+	printk(	"%8u %8u %8u %8u %8u %8u\n"
+		"%8u %8u %8u %8u %8u %8u\n"
+		"%8u %8u %8u %8u %8u %8u\n",
+	page_launder_calls,
+	page_launder_scans,
+	page_launder_scanned_pages,
+	page_launder_scanned_not_dirty,
+	page_launder_scanned_active,
+	page_launder_scanned_skipped,
+	page_launder_scanned_dirty,
+	page_launder_scanned_dirty_skip,
+	page_launder_scanned_dirty_nolaunder,
+	page_launder_scanned_dirty_swapcache,
+	page_launder_scanned_buffers,
+	page_launder_scanned_buffers_refiled,
+	page_launder_scanned_buffers_freed_buffer,
+	page_launder_scanned_buffers_active,
+	page_launder_scanned_buffers_cleaned,
+	page_launder_scanned_buffers_released,
+	page_launder_scanned_mapped_clean,
+	page_launder_scanned_still_active
+);
+
 	for (type = 0; type < MAX_NR_ZONES; type++) {
 		struct list_head *head, *curr;
 		zone_t *zone = pgdat->node_zones + type;
- 		unsigned long nr, total, flags;
+ 		unsigned long nr, total, flags, reserved;

-		total = 0;
+		reserved = total = 0;
 		if (zone->size) {
+			printk("Zone %s: ", zone->name);
 			spin_lock_irqsave(&zone->lock, flags);
 		 	for (order = 0; order < MAX_ORDER; order++) {
 				head = &(zone->free_area + order)->free_list;
@@ -745,9 +867,10 @@
 				printk("%lu*%lukB ", nr,
 						(PAGE_SIZE>>10) << order);
 			}
+			reserved = zone->reserved_pages;
 			spin_unlock_irqrestore(&zone->lock, flags);
 		}
-		printk("= %lukB)\n", total * (PAGE_SIZE>>10));
+		printk("= %lukB)  Reserved: %lukB\n", total * (PAGE_SIZE>>10), reserved * (PAGE_SIZE>>10));
 	}

 #ifdef SWAP_CACHE_INFO
@@ -901,8 +1024,11 @@
 		zone->lock = SPIN_LOCK_UNLOCKED;
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->reserved_pages = 0;
 		zone->inactive_clean_pages = 0;
 		zone->inactive_dirty_pages = 0;
+		zone->reserved_pages = 0;
+		INIT_LIST_HEAD(&zone->depleted_rsv_list);
 		memlist_init(&zone->inactive_clean_list);
 		if (!size)
 			continue;
@@ -914,7 +1040,7 @@
 			mask = zone_balance_min[j];
 		else if (mask > zone_balance_max[j])
 			mask = zone_balance_max[j];
-		zone->pages_min = mask;
+		zone->pages_min = 0;
 		zone->pages_low = mask*2;
 		zone->pages_high = mask*3;
 		/*
@@ -928,7 +1054,7 @@
 		 * for people who require it to catch load spikes in eg.
 		 * gigabit ethernet routing...
 		 */
-		freepages.min += mask;
+		freepages.min = 0;
 		freepages.low += mask*2;
 		freepages.high += mask*3;
 		zone->zone_mem_map = mem_map + offset;
@@ -955,12 +1081,16 @@
 			bitmap_size = size >> i;
 			bitmap_size = (bitmap_size + 7) >> 3;
 			bitmap_size = LONG_ALIGN(bitmap_size);
-			bitmap_size *= 2;
+			bitmap_size *= 1;
 			zone->free_area[i].map =
 			  (unsigned int *) alloc_bootmem_node(pgdat, bitmap_size);
 		}
 	}
 	build_zonelists(pgdat);
+
+	init_page_reservation(&irq_rsv, RSV_MULTISHOT, ZONE_NORMAL);
+	init_page_reservation(&swap_rsv, RSV_MULTISHOT, ZONE_NORMAL);
+	init_page_reservation(&atomic_rsv, RSV_MULTISHOT, ZONE_NORMAL);
 }

 void __init free_area_init(unsigned long *zones_size)
@@ -977,6 +1107,101 @@
 	for (j = 0; j < MAX_NR_ZONES; j++) printk("%d  ", zone_balance_ratio[j]);
 	printk("\n");
 	return 1;
+}
+
+void init_page_reservation(struct page_reservation *rsv, int flags, int zone)
+{
+static int foo; if (foo++ < 5) printk("init_page_reservation(%p, %d, %d)\n", rsv, flags, zone);
+	INIT_LIST_HEAD(&rsv->list);
+	rsv->avail = 0;
+	rsv->used = 0;
+	rsv->flags = flags;
+
+	/* FIXME: This doesn't work properly on NUMA or multizoned setups.
+	 */
+	rsv->zone = pgdat_list->node_zones + zone;
+}
+
+void destroy_page_reservation(struct page_reservation *rsv)
+{
+	unsigned long flags;
+	zone_t *zone = rsv->zone;
+static int foo; if (foo++ < 5) printk("destroy_page_reservation(%p)\n", rsv);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->reserved_pages -= rsv->avail;
+	list_del_init(&rsv->list);	/* This relies on list_del_init being used */
+	spin_unlock_irqrestore(&zone->lock, flags);
+	memset(rsv, 0x57, sizeof(*rsv));
+}
+
+int reserve_pages(struct page_reservation *rsv, int gfp_mask, unsigned count)
+{
+	unsigned long flags, free_pages;
+	zone_t *zone = rsv->zone;
+	unsigned orig = count;
+	int tries = 5;
+static int foo; if (foo++ < 5) printk("reserve_pages(%p, %d, %u)\n", rsv, gfp_mask, count);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	free_pages = zone->free_pages - zone->reserved_pages;
+	if (free_pages > count)
+		free_pages = count;
+	count -= free_pages;
+	zone->reserved_pages += free_pages;
+
+	rsv->used += count;
+	if (count)
+		zone->pages_min++;
+	list_del_init(&rsv->list);
+	if (rsv->used)
+		list_add(&rsv->list, zone->depleted_rsv_list.prev);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	while (--tries && rsv->used) {
+		try_to_free_pages(gfp_mask);
+		if ((gfp_mask & __GFP_WAIT) && !(current->flags & PF_ATOMICALLOC)) {
+			__set_current_state(TASK_RUNNING);
+			current->policy |= SCHED_YIELD;
+			schedule();
+		}
+	}
+
+	if (count) {
+		spin_lock_irqsave(&zone->lock, flags);
+		zone->pages_min--;
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	if (!rsv->used)
+		return 0;
+
+	put_reserved_pages(rsv, orig);
+	return -ENOMEM;
+}
+
+void put_reserved_pages(struct page_reservation *rsv, unsigned count)
+{
+	unsigned long flags;
+	zone_t *zone = rsv->zone;
+static int foo; if (foo++ < 5) printk("put_reserved_pages(%p, %u)\n", rsv, count);
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (rsv->used <= count) {
+		count -= rsv->used;
+		rsv->used = 0;
+	} else {
+		rsv->used -= count;
+		count = 0;
+	}
+
+	if (count > rsv->avail)
+		BUG();
+
+	rsv->avail -= count;
+	zone->reserved_pages -= count;
+	spin_unlock_irqrestore(&zone->lock, flags);
 }

 __setup("memfrac=", setup_mem_frac);
diff --exclude=.* --exclude=*.[^ch]* -urN kernel-2.4.3-works/linux.orig/mm/vmscan.c linux-toolbox-current/mm/vmscan.c
--- kernel-2.4.3-works/linux.orig/mm/vmscan.c	Thu May 10 16:07:27 2001
+++ linux-toolbox-current/mm/vmscan.c	Mon May 14 13:30:26 2001
@@ -21,9 +21,32 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/file.h>
+#include <linux/mm/reservation.h>

 #include <asm/pgalloc.h>

+extern struct page_reservation swap_rsv;
+
+int	page_launder_calls,
+	page_launder_scans,
+	page_launder_scanned_pages,
+	page_launder_scanned_not_dirty,
+	page_launder_scanned_active,
+	page_launder_scanned_skipped,
+	page_launder_scanned_dirty,
+	page_launder_scanned_dirty_skip,
+	page_launder_scanned_dirty_nolaunder,
+	page_launder_scanned_dirty_swapcache,
+	page_launder_scanned_buffers,
+	page_launder_scanned_buffers_refiled,
+	page_launder_scanned_buffers_freed_buffer,
+	page_launder_scanned_buffers_active,
+	page_launder_scanned_buffers_cleaned,
+	page_launder_scanned_buffers_released,
+	page_launder_scanned_mapped_clean,
+	page_launder_scanned_still_active
+;
+
 /*
  * The swap-out function returns 1 if it successfully
  * scanned all the pages it was asked to (`count').
@@ -230,6 +253,14 @@
 {
 	unsigned long address;
 	struct vm_area_struct* vma;
+	unsigned long min_rss = atomic_read(&mm->in_fault_count);
+
+	min_rss *= 64;
+	if (min_rss > 256)
+		min_rss = 256;
+
+	if (mm->rss <= min_rss)
+		return;

 	/*
 	 * Go through process' page directory.
@@ -466,11 +497,18 @@
 	flushed_pages = 0;
 	freed_pages = 0;

+	spin_lock(&pagemap_lru_lock);
+	page_launder_calls++;
+	spin_unlock(&pagemap_lru_lock);
+
 dirty_page_rescan:
 	spin_lock(&pagemap_lru_lock);
+	page_launder_scans++;
+
 	maxscan = nr_inactive_dirty_pages;
 	while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list &&
 				maxscan-- > 0) {
+		page_launder_scanned_pages++;

 		page = list_entry(page_lru, struct page, lru);
 		zone = page->zone;
@@ -481,6 +519,7 @@
 			list_del(page_lru);
 			nr_inactive_dirty_pages--;
 			zone->inactive_dirty_pages--;
+			page_launder_scanned_not_dirty++;
 			continue;
 		}

@@ -490,6 +529,7 @@
 		     page_ramdisk(page)) {
 			del_page_from_inactive_dirty_list(page);
 			add_page_to_active_list(page);
+			page_launder_scanned_active++;
 			continue;
 		}

@@ -505,7 +545,7 @@
 		if (launder_loop && !maxlaunder)
 			break;
 		if (launder_loop && zone->inactive_clean_pages +
-				zone->free_pages > zone->pages_high)
+				zone->free_pages - zone->reserved_pages > zone->pages_high)
 			goto skip_page;

 		/*
@@ -514,6 +554,7 @@
 		 */
 		if (TryLockPage(page)) {
 skip_page:
+			page_launder_scanned_skipped++;
 			list_del(page_lru);
 			list_add(page_lru, &inactive_dirty_list);
 			continue;
@@ -524,13 +565,19 @@
 		 * last copy..
 		 */
 		if (PageDirty(page)) {
+			struct page_reservation *saved_rsv;
 			int (*writepage)(struct page *) = page->mapping->a_ops->writepage;

-			if (!writepage || !can_get_io_locks)
+			page_launder_scanned_dirty++;
+
+			if (!writepage || !can_get_io_locks) {
+				page_launder_scanned_dirty_skip++;
 				goto page_active;
+			}

 			/* First time through? Move it to the back of the list */
 			if (!launder_loop) {
+				page_launder_scanned_dirty_nolaunder++;
 				list_del(page_lru);
 				list_add(page_lru, &inactive_dirty_list);
 				UnlockPage(page);
@@ -542,10 +589,16 @@
 			page_cache_get(page);
 			spin_unlock(&pagemap_lru_lock);

+			saved_rsv = current->page_reservations;
+			current->page_reservations = &swap_rsv;
 			writepage(page);
+			current->page_reservations = saved_rsv;
+
 			/* XXX: all ->writepage()s should use nr_async_pages */
 			if (!PageSwapCache(page))
 				flushed_pages++;
+			else
+				page_launder_scanned_dirty_swapcache++;
 			maxlaunder--;
 			page_cache_release(page);

@@ -565,6 +618,7 @@
 		 */
 		if (page->buffers) {
 			int wait, clearedbuf;
+			page_launder_scanned_buffers++;
 			/*
 			 * Since we might be doing disk IO, we have to
 			 * drop the spinlock and take an extra reference
@@ -594,21 +648,25 @@

 			/* The buffers were not freed. */
 			if (!clearedbuf) {
+				page_launder_scanned_buffers_refiled++;
 				add_page_to_inactive_dirty_list(page);
 				if (wait)
 					flushed_pages++;

 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
+				page_launder_scanned_buffers_freed_buffer++;
 				atomic_dec(&buffermem_pages);
 				freed_pages++;

 			/* The page has more users besides the cache and us. */
 			} else if (page_count(page) > 2) {
+				page_launder_scanned_buffers_active++;
 				add_page_to_active_list(page);

 			/* OK, we "created" a freeable page. */
 			} else /* page->mapping && page_count(page) == 2 */ {
+				page_launder_scanned_buffers_cleaned++;
 				add_page_to_inactive_clean_list(page);
 				freed_pages++;
 			}
@@ -618,10 +676,12 @@
 			 * We can only do it here because we are accessing
 			 * the page struct above.
 			 */
+			page_launder_scanned_buffers_released++;
 			UnlockPage(page);
 			page_cache_release(page);

 		} else if (page->mapping && !PageDirty(page)) {
+			page_launder_scanned_mapped_clean++;
 			/*
 			 * If a page had an extra reference in
 			 * deactivate_page(), we will find it here.
@@ -634,6 +694,7 @@
 			freed_pages++;

 		} else {
+			page_launder_scanned_still_active++;
 page_active:
 			/*
 			 * OK, we don't know what to do with the page.
@@ -660,6 +721,13 @@
 	 */
 	shortage = free_shortage();
 	if (can_get_io_locks && !launder_loop && shortage) {
+		if (gfp_mask & __GFP_WAIT) {
+			__set_current_state(TASK_RUNNING);
+			current->policy |= SCHED_YIELD;
+			schedule();
+		}
+
+		shortage = free_shortage();
 		launder_loop = 1;

 		/*
@@ -835,10 +903,11 @@
 		for(i = 0; i < MAX_NR_ZONES; i++) {
 			zone_t *zone = pgdat->node_zones+ i;
 			if (zone->size && (zone->inactive_clean_pages +
-					zone->free_pages < zone->pages_min+1)) {
+					zone->free_pages - zone->reserved_pages < zone->pages_min+1)) {
 				/* + 1 to have overlap with alloc_pages() !! */
 				sum += zone->pages_min + 1;
 				sum -= zone->free_pages;
+				sum += zone->reserved_pages;
 				sum -= zone->inactive_clean_pages;
 			}
 		}
@@ -881,6 +950,7 @@
 			zone_shortage -= zone->inactive_dirty_pages;
 			zone_shortage -= zone->inactive_clean_pages;
 			zone_shortage -= zone->free_pages;
+			zone_shortage += zone->reserved_pages;
 			if (zone_shortage > 0)
 				shortage += zone_shortage;
 		}
@@ -1009,6 +1079,7 @@

 int kswapd(void *unused)
 {
+	static struct page_reservation kswapd_rsv;
 	struct task_struct *tsk = current;

 	tsk->session = 1;
@@ -1016,6 +1087,11 @@
 	strcpy(tsk->comm, "kswapd");
 	sigfillset(&tsk->blocked);
 	kswapd_task = tsk;
+
+	init_page_reservation(&kswapd_rsv, RSV_MULTISHOT, ZONE_NORMAL);
+	if (reserve_pages(&kswapd_rsv, GFP_KERNEL, 32))
+		panic("kswapd unable to reserve emergency pages!\n");
+	tsk->page_reservations = &kswapd_rsv;

 	/*
 	 * Tell the memory management that we're a "memory allocator",
@@ -1086,6 +1162,8 @@
 			oom_kill();
 		}
 	}
+
+	destroy_page_reservation(&kswapd_rsv);
 }

 void wakeup_kswapd(void)
@@ -1102,6 +1180,10 @@
 int try_to_free_pages(unsigned int gfp_mask)
 {
 	int ret = 1;
+	struct page_reservation *rsv = current->page_reservations;
+
+	if (!rsv)
+		current->page_reservations = &swap_rsv;

 	if (gfp_mask & __GFP_WAIT) {
 		unsigned long caller_memalloc = current->flags & PF_MEMALLOC;
@@ -1111,6 +1193,8 @@
 		current->flags |= caller_memalloc;
 	}

+	current->page_reservations = rsv;
+
 	return ret;
 }

@@ -1151,7 +1235,7 @@
 				if (!zone->size)
 					continue;

-				while (zone->free_pages < zone->pages_low) {
+				while (zone->free_pages - zone->reserved_pages < zone->pages_low) {
 					struct page * page;
 					page = reclaim_page(zone);
 					if (!page)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:52             ` Ben LaHaise
@ 2001-05-26  1:27               ` Andrea Arcangeli
  2001-05-26  1:38                 ` Ben LaHaise
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  1:27 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 08:52:50PM -0400, Ben LaHaise wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> 
> > Please merge this one in 2.4 for now (originally from Ingo, I only
> > improved it), this is a real definitive fix and there's no nicer way to
> > handle that unless you want to generalize an API for people to generate
> > private anti-deadlock ("make sure to always make a progress") memory
> > pools:
> 
> Alternatively, the following might be more interesting...

side note, it won't compile against pre6, that's against the RH kernel
that has the tux stuff (PF_ATOMICALLOC in this case) in it, that's also
included in 2.4.5pre6aa1 if you apply the optional patches in the 30_tux
directory.

I only had a short look but unless I'm missing something it cannot fix
anything in the highmem area, you are not reserving anything for highmem
bounces in particular, you're reserving for swap/irq/atomic allocations
in general or for some task in particular, that's just broken design I
think as you will fail anyways eventually because the stacking under
those layers won't give you any guarantee to make progress. The pool
must live in the very latest layer of allocations if you want to have
any guarantee of making progress evenutally, create_bounces() actually
is called as the very last thing before the atomic stuff, infact loop
can deadlock as well as it would need its own private pool too after the
create_bounces() in case of non noop transfer function (but loop is not
a shotstopper so we didn't addressed that yet, create_bounces itself is
a showtsopper instead).

The only case where a reserved pool make sense is when you know that
waiting (i.e. running a task queue, scheduling and trying again to
allocate later) you will succeed the allocation for sure eventually
(possibly with a FIFO policy to make also starvation impossible, not
only deadlocks). If you don't have that guarantee those pools
atuomatically become only a sourcecode and runtime waste, possibly they
could hide core bugs in the allocator or stuff like that.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  1:27               ` Andrea Arcangeli
@ 2001-05-26  1:38                 ` Ben LaHaise
  2001-05-26  1:49                   ` Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  1:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> side note, it won't compile against pre6, that's against the RH kernel
> that has the tux stuff (PF_ATOMICALLOC in this case) in it, that's also
> included in 2.4.5pre6aa1 if you apply the optional patches in the 30_tux
> directory.

Yeah, I know.  The point is I wanted to know if people were interested in
the approach.  Man, nobody's capable of being a visionary anymore.

> I only had a short look but unless I'm missing something it cannot fix
> anything in the highmem area, you are not reserving anything for highmem
> bounces in particular, you're reserving for swap/irq/atomic allocations
> in general or for some task in particular, that's just broken design I
> think as you will fail anyways eventually because the stacking under
> those layers won't give you any guarantee to make progress. The pool
> must live in the very latest layer of allocations if you want to have
> any guarantee of making progress evenutally, create_bounces() actually
> is called as the very last thing before the atomic stuff, infact loop
> can deadlock as well as it would need its own private pool too after the
> create_bounces() in case of non noop transfer function (but loop is not
> a shotstopper so we didn't addressed that yet, create_bounces itself is
> a showtsopper instead).

You're missing a few subtle points:

	1. reservations are against a specific zone
	2. try_to_free_pages uses the swap reservation
	3. irqs can no longer eat memory allocations that are needed for
	   swap

Note that with this patch the current garbage in the zone structure with
pages_min (which doesn't work reliably) becomes obsolete.

> The only case where a reserved pool make sense is when you know that
> waiting (i.e. running a task queue, scheduling and trying again to
> allocate later) you will succeed the allocation for sure eventually
> (possibly with a FIFO policy to make also starvation impossible, not
> only deadlocks). If you don't have that guarantee those pools
> atuomatically become only a sourcecode and runtime waste, possibly they
> could hide core bugs in the allocator or stuff like that.

You're completely wrong here.

		-ben


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  1:38                 ` Ben LaHaise
@ 2001-05-26  1:49                   ` Andrea Arcangeli
  2001-05-26  2:01                     ` Ben LaHaise
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  1:49 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote:
> You're missing a few subtle points:
> 
> 	1. reservations are against a specific zone

A single zone is not used only for one thing, period. In my previous
email I enlighted the only conditions under which a reserved pool can
avoid a deadlock.

> 	2. try_to_free_pages uses the swap reservation

try_to_free_pages has an huge stacking under it, bounce
bufferes/loop/nbd/whatever being just some of them.

> 	3. irqs can no longer eat memory allocations that are needed for
> 	   swap

you don't even need irq to still deadlock.

> Note that with this patch the current garbage in the zone structure with
> pages_min (which doesn't work reliably) becomes obsolete.

The "garbage" is just an heuristic to allow atomic allocation to work in
the common case dynamically. Anything deadlock related cannot rely on
pages_min.

I am talking about fixing the thing, of course I perfectly know you can
hide it pretty well, but I definitely hate those kind of hiding patches.

> > The only case where a reserved pool make sense is when you know that
> > waiting (i.e. running a task queue, scheduling and trying again to
> > allocate later) you will succeed the allocation for sure eventually
> > (possibly with a FIFO policy to make also starvation impossible, not
> > only deadlocks). If you don't have that guarantee those pools
> > atuomatically become only a sourcecode and runtime waste, possibly they
> > could hide core bugs in the allocator or stuff like that.
> 
> You're completely wrong here.

I don't think so.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  1:49                   ` Andrea Arcangeli
@ 2001-05-26  2:01                     ` Ben LaHaise
  2001-05-26  2:26                       ` Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  2:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote:
> > You're missing a few subtle points:
> >
> > 	1. reservations are against a specific zone
>
> A single zone is not used only for one thing, period. In my previous
> email I enlighted the only conditions under which a reserved pool can
> avoid a deadlock.

Well, until we come up with a better design for a zone allocator that
doesn't involve walking lists and polluting the cache all over the place,
it'll be against a single zone.

> > 	2. try_to_free_pages uses the swap reservation
>
> try_to_free_pages has an huge stacking under it, bounce
> bufferes/loop/nbd/whatever being just some of them.

Fine, then add one to the bounce buffer allocation code, it's all of about
3 lines added.

> > 	3. irqs can no longer eat memory allocations that are needed for
> > 	   swap
>
> you don't even need irq to still deadlock.

I never said you didn't.  But Ingo's patch DOES NOT PROTECT AGAINST
DEADLOCKS CAUSED BY INTERRUPT ALLOCATIONS.  Heck, it doesn't even fix the
deadlock that started this discussion.  Think about what happens when
your network interfaces start rapidly eating all of the memory kswapd is
freeing.  Also, I haven't posted this patch previously exactly because it
is against a specific kind of deadlock that nobody cares about right now,
plus it touches the core of the page allocator, which isn't nescessarily a
good idea during the 2.4 timeframe.

That said, the reservation concept is generic code, which the bounce
buffer patch most certainly isn't.  It can even be improved to overlap
with the page cache pages in the zone, so it isn't even really "free" ram
as currently implemented.

> > Note that with this patch the current garbage in the zone structure with
> > pages_min (which doesn't work reliably) becomes obsolete.
>
> The "garbage" is just an heuristic to allow atomic allocation to work in
> the common case dynamically. Anything deadlock related cannot rely on
> pages_min.

> I am talking about fixing the thing, of course I perfectly know you can
> hide it pretty well, but I definitely hate those kind of hiding patches.

Re-read the above and reconsider.  The reservation doesn't need to be
permitted until after page_alloc has blocked.  Heck, do a schedule before
attempting to use the reservation.

> > > The only case where a reserved pool make sense is when you know that
> > > waiting (i.e. running a task queue, scheduling and trying again to
> > > allocate later) you will succeed the allocation for sure eventually
> > > (possibly with a FIFO policy to make also starvation impossible, not
> > > only deadlocks). If you don't have that guarantee those pools
> > > atuomatically become only a sourcecode and runtime waste, possibly they
> > > could hide core bugs in the allocator or stuff like that.
> >
> > You're completely wrong here.
>
> I don't think so.

Atomicity isn't what I care about.  It's about being able to keep memory
around so that certain allocations can proceed, and those pools cannot be
eaten into by other tasks.  Extend the concept a little further, and you
can even reclaim higher order pages by having a reservation outstanding on
that order.  But I'm just dreaming.

		-ben

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  2:01                     ` Ben LaHaise
@ 2001-05-26  2:26                       ` Andrea Arcangeli
  2001-05-26  2:40                         ` Ben LaHaise
  0 siblings, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2001-05-26  2:26 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Fri, May 25, 2001 at 10:01:37PM -0400, Ben LaHaise wrote:
> On Sat, 26 May 2001, Andrea Arcangeli wrote:
> 
> > On Fri, May 25, 2001 at 09:38:36PM -0400, Ben LaHaise wrote:
> > > You're missing a few subtle points:
> > >
> > > 	1. reservations are against a specific zone
> >
> > A single zone is not used only for one thing, period. In my previous
> > email I enlighted the only conditions under which a reserved pool can
> > avoid a deadlock.
> 
> Well, until we come up with a better design for a zone allocator that
> doesn't involve walking lists and polluting the cache all over the place,
> it'll be against a single zone.

I meant each zone is used by more than one user, that definitely won't
change.

> > > 	2. try_to_free_pages uses the swap reservation
> >
> > try_to_free_pages has an huge stacking under it, bounce
> > bufferes/loop/nbd/whatever being just some of them.
> 
> Fine, then add one to the bounce buffer allocation code, it's all of about
> 3 lines added.

Yes, you should add it to the bounce buffers to the loop to nbd to
whatever and then remove it from all other places that you put into it
right now. That's why I'm saying your patch won't fix anything (but
hide) as it was in its first revision.

> I never said you didn't.  But Ingo's patch DOES NOT PROTECT AGAINST
> DEADLOCKS CAUSED BY INTERRUPT ALLOCATIONS.  Heck, it doesn't even fix the

It does, but only for the create_bounces. As said if you want to "fix",
not to "hide" you need to address every single case, a generic reserved
pool is just useless. Now try to get a dealdock in 2.4.5 with tasks
locked up in create_bounces() if you say it does not protect against
irqs. see?

> That said, the reservation concept is generic code, which the bounce
> buffer patch most certainly isn't.  It can even be improved to overlap

What I said is that instead of handling the pool by hand in every single
place one could write an API to generalize it, but very often a simple
API hooked into the page allocator may not be flexible enough to
guarantee the kind of allocations we need, highmem is just one example
where we need more flexibility not just for the pages but also for the
bh (but ok that's mostly an implementation issue, if you do the math
right, it's harder but you can still use a generic API).

> with the page cache pages in the zone, so it isn't even really "free" ram
> as currently implemented.

yes that would be a very nice property, again I'm not against a generic
interface for creating reserved pool of memory (I mentioned that
possibilty before reading your patch after all). It's just the
implemetation (mainly the per-task hook overwritten) that didn't
convinced me and the usage that was at least apparently obviously wrong
to my eyes.

> Re-read the above and reconsider.  The reservation doesn't need to be
> permitted until after page_alloc has blocked.  Heck, do a schedule before

I don't see what you mean here, could you elaborate?

> Atomicity isn't what I care about.  It's about being able to keep memory
> around so that certain allocations can proceed, and those pools cannot be
> eaten into by other tasks.  [..]

Those pools needs to be gloabl unless
you want to allocate them at fork() for every single task like you did
for some the kernel threads, and if you make them global per-zone or
per-whatever not every single case it will deadlock.  Or better it will
works by luck, it proceeds until you don't have a case where you didn't
only needed 32 pages reserved, but where you needed 33 pages reserved to
avoid the deadlock, it's on the same lines of the pci_map_* brokeness in
some sense.

Allocating those pools per-task is just a total waste, those are
"security" pools, in the 99% of cases you won't need them and you will
allocate the memory dynamically, they just needs to be there for
correctness and to avoid the dealdock very seldom.

Andrea

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  2:26                       ` Andrea Arcangeli
@ 2001-05-26  2:40                         ` Ben LaHaise
  0 siblings, 0 replies; 64+ messages in thread
From: Ben LaHaise @ 2001-05-26  2:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Alan Cox, Rik van Riel, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> It does, but only for the create_bounces. As said if you want to "fix",
> not to "hide" you need to address every single case, a generic reserved
> pool is just useless. Now try to get a dealdock in 2.4.5 with tasks
> locked up in create_bounces() if you say it does not protect against
> irqs. see?

So?  It just deadlocks in create_buffers/alloc_pages.  We finished
debugging that weeks ago, and I'm not interested in repeating that.

> What I said is that instead of handling the pool by hand in every single
> place one could write an API to generalize it, but very often a simple
> API hooked into the page allocator may not be flexible enough to
> guarantee the kind of allocations we need, highmem is just one example
> where we need more flexibility not just for the pages but also for the
> bh (but ok that's mostly an implementation issue, if you do the math
> right, it's harder but you can still use a generic API).

Well, this is the infrastructure for guaranteeing atomic allocations.  The
only beautifying it really needs is in adding an alloc_pages variant that
takes the reservation pool as a parameter instead of the current mucking
with current->page_reservation.

> > Re-read the above and reconsider.  The reservation doesn't need to be
> > permitted until after page_alloc has blocked.  Heck, do a schedule before
>
> I don't see what you mean here, could you elaborate?

I simply meant that the reservation code wasn't bent on providing atomic
allocations for non-atomic allocators.  IIRC, I hooked into __alloc_pages
after the normal mechanism of allocating pages failed, but where it may
have already slept, but I think that was part of the other patch I posted
in the last email.  Our tree contains a lot of patches, and some of the
effects like this are built on top of each other, so I'm not surprised
that a critical piece like that was missing.

		-ben

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-26  0:42           ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli
  2001-05-26  0:52             ` Ben LaHaise
@ 2001-05-26  1:43             ` Rik van Riel
  1 sibling, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2001-05-26  1:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, Alan Cox, linux-kernel

On Sat, 26 May 2001, Andrea Arcangeli wrote:

> Please merge this one in 2.4 for now (originally from Ingo, I only
> improved it), this is a real definitive fix

With the only minor detail being that it DOESN'T WORK.

You're not solving the problems of GFP_BUFFER allocators
looping forever in __alloc_pages(), the deadlock can still
happen.

You've only solved the 1 specific case of highmem.c getting
a page for bounce buffers, but you'll happily let the thing
deadlock while trying to get buffer heads for a normal low
memory page!

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 21:12 ` Linus Torvalds
  2001-05-25 22:20   ` Rik van Riel
@ 2001-05-25 22:35   ` Rik van Riel
  2001-05-25 23:07     ` Linus Torvalds
  1 sibling, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2001-05-25 22:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, 25 May 2001, Linus Torvalds wrote:
> On Fri, 25 May 2001, Rik van Riel wrote:
> >
> > OK, shoot me.  Here it is again, this time _with_ patch...
>
> I'm not going to apply this as long as it plays experimental
> games with "shrink_icache()" and friends. I haven't seen anybody
> comment on the performance on this, and I can well imagine that
> it would potentially shrink the dcache too aggressively when
> there are lots of inactive-dirty pages around, where
> page_launder is the right thing to do, and shrinking
> icache/dcache might not be.

Your analysis exactly describes what happens in your current
code, though I have to admit that my patch doesn't change it.

Without the patch my workstation (with ~180MB RAM) usually has
between 50 and 80MB of inode/dentry cache and real usable stuff
gets  swapped out.

Either you can go make Linux 2.4 usable again for normal people,
or you can go buy us all a gig of ram.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [with-PATCH-really] highmem deadlock removal, balancing & cleanup
  2001-05-25 22:35   ` Rik van Riel
@ 2001-05-25 23:07     ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2001-05-25 23:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel



On Fri, 25 May 2001, Rik van Riel wrote:
>
> Without the patch my workstation (with ~180MB RAM) usually has
> between 50 and 80MB of inode/dentry cache and real usable stuff
> gets  swapped out.

All I want is more people giving feedback.

It's clear that neither my nor your machine is a good thing to base things
on.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2001-06-05  3:58 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-25 20:00 [with-PATCH-really] highmem deadlock removal, balancing & cleanup Rik van Riel
2001-05-25 21:12 ` Linus Torvalds
2001-05-25 22:20   ` Rik van Riel
2001-05-25 23:05     ` Linus Torvalds
2001-05-25 23:13     ` Alan Cox
2001-05-25 23:19       ` Rik van Riel
2001-05-26  0:02       ` Linus Torvalds
2001-05-26  0:07         ` Rik van Riel
2001-05-26  0:16           ` Linus Torvalds
2001-05-26  0:23           ` Linus Torvalds
2001-05-26  0:26             ` Rik van Riel
2001-05-26  0:30               ` Linus Torvalds
2001-05-26  0:29         ` Ben LaHaise
2001-05-26  0:34           ` Linus Torvalds
2001-05-26  0:38             ` Rik van Riel
2001-05-26  1:28             ` Linux-2.4.5 Linus Torvalds
2001-05-26  1:35               ` Linux-2.4.5 Rik van Riel
2001-05-26  1:39               ` Linux-2.4.5 Ben LaHaise
2001-05-26  1:59                 ` Linux-2.4.5 Andrea Arcangeli
2001-05-26  2:11                   ` Linux-2.4.5 Ben LaHaise
2001-05-26  2:38                     ` Linux-2.4.5 Andrea Arcangeli
2001-05-26  2:49                       ` Linux-2.4.5 Ben LaHaise
2001-05-26  3:11                         ` Linux-2.4.5 Andrea Arcangeli
2001-05-26  4:22                           ` Linux-2.4.5 Linus Torvalds
2001-05-26  4:31                             ` Linux-2.4.5 Rik van Riel
2001-05-26  8:10                               ` Linux-2.4.5 Linus Torvalds
2001-05-26  9:01                                 ` Linux-2.4.5 Linus Torvalds
2001-05-26  9:18                             ` Linux-2.4.5 arjan
2001-05-26 14:18                             ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 14:21                               ` Linux-2.4.5 Rik van Riel
2001-05-26 14:38                                 ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 14:40                                   ` Linux-2.4.5 Rik van Riel
2001-05-26 15:17                                     ` Linux-2.4.5 Linus Torvalds
2001-05-26 15:28                                       ` Linux-2.4.5 Rik van Riel
2001-05-26 15:59                                         ` Linux-2.4.5 Linus Torvalds
2001-05-26 22:12                                           ` Linux-2.4.5 Marcelo Tosatti
2001-05-27  6:53                                             ` Linux-2.4.5 Marcelo Tosatti
2001-06-03 23:32                                               ` Linux-2.4.5 Linus Torvalds
2001-06-05  2:21                                                 ` Linux-2.4.5 Marcelo Tosatti
2001-05-26 15:09                                 ` Linux-2.4.5 Linus Torvalds
2001-05-26 15:18                                   ` Linux-2.4.5 Rik van Riel
2001-05-26 15:24                                     ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 15:26                                       ` Linux-2.4.5 Rik van Riel
2001-05-26 15:40                                         ` Linux-2.4.5 Andrea Arcangeli
2001-05-26  4:45                           ` Linux-2.4.5 Rik van Riel
2001-05-26  4:47                             ` Linux-2.4.5 Rik van Riel
2001-05-26  6:07                               ` Linux-2.4.5 Ben LaHaise
2001-05-26 14:32                             ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 14:36                               ` Linux-2.4.5 Rik van Riel
2001-05-26 15:03                                 ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 15:08                                   ` Linux-2.4.5 Rik van Riel
2001-05-26 15:20                                     ` Linux-2.4.5 Andrea Arcangeli
2001-05-26 15:41                     ` Linux-2.4.5 Rik van Riel
2001-05-26  0:42           ` [with-PATCH-really] highmem deadlock removal, balancing & cleanup Andrea Arcangeli
2001-05-26  0:52             ` Ben LaHaise
2001-05-26  1:27               ` Andrea Arcangeli
2001-05-26  1:38                 ` Ben LaHaise
2001-05-26  1:49                   ` Andrea Arcangeli
2001-05-26  2:01                     ` Ben LaHaise
2001-05-26  2:26                       ` Andrea Arcangeli
2001-05-26  2:40                         ` Ben LaHaise
2001-05-26  1:43             ` Rik van Riel
2001-05-25 22:35   ` Rik van Riel
2001-05-25 23:07     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox