filesystem access slowing system to a crawl

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* filesystem access slowing system to a crawl
@ 2003-02-04  9:29 Thomas Bätzler
  2003-02-05  9:03 ` Denis Vlasenko
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Thomas Bätzler @ 2003-02-04  9:29 UTC (permalink / raw)
  To: linux-kernel

Hi,

maybe you could help me out with a really weird problem we're having
with a NFS fileserver for a couple of webservers:

- Dual Xeon 2.2 GHz
- 6 GB RAM
- QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
- Debian "woody" w/Kernel 2.4.19

Running just "find /" (or ls -R or tar on a large directory) locally
slows the box down to absolute unresponsiveness - it takes minutes
to just run ps and kill the find process. During that time, kupdated
and kswapd gobble up all available CPU time. 

The system performs great otherwise, so I've ruled out a hardware
problem. It can't be a load problem because during normal operation,
the system is more or less bored out of its mind (70-90% idle time).

I'm really at the end of my wits here :-(

Any help would be greatly appreciated!

TIA,
Thomas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-04  9:29 filesystem access slowing system to a crawl Thomas Bätzler
@ 2003-02-05  9:03 ` Denis Vlasenko
  2003-02-05  9:39 ` Andrew Morton
  2003-02-20 19:30 ` William Stearns
  2 siblings, 0 replies; 28+ messages in thread
From: Denis Vlasenko @ 2003-02-05  9:03 UTC (permalink / raw)
  To: Thomas B?tzler, linux-kernel

On 4 February 2003 11:29, Thomas B?tzler wrote:
> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
>
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
>
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time.
>
> The system performs great otherwise, so I've ruled out a hardware
> problem. It can't be a load problem because during normal operation,
> the system is more or less bored out of its mind (70-90% idle time).
>
> I'm really at the end of my wits here :-(
>
> Any help would be greatly appreciated!

Canned response: 
* does non-highmem kernel make any difference?
* does UP kernel make any difference?
* can you profile kernel while "time ls -R" is running?
* try 2.4.20 and/or .21-pre4
* tell us what you found out
--
vda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-04  9:29 filesystem access slowing system to a crawl Thomas Bätzler
  2003-02-05  9:03 ` Denis Vlasenko
@ 2003-02-05  9:39 ` Andrew Morton
  2003-02-19 16:42   ` Marc-Christian Petersen
  2003-02-20 19:30 ` William Stearns
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2003-02-05  9:39 UTC (permalink / raw)
  To: t.baetzler; +Cc: linux-kernel

>
> Hi,
> 
> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
> 
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
> 
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time. 
> 

Could be that your "low memory" is filled up with inodes.  This would
only happen in these tests if you're using ext2, and there are a *lot*
of directories.

I've prepared a lineup of Andrea's VM patches at

http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/

It would be useful if you could apply 10_inode-highmem-2.patch and
report back.  It applies to 2.4.19 as well, and should work OK there.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-05  9:39 ` Andrew Morton
@ 2003-02-19 16:42   ` Marc-Christian Petersen
  2003-02-19 17:49     ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Marc-Christian Petersen @ 2003-02-19 16:42 UTC (permalink / raw)
  To: Andrew Morton, t.baetzler; +Cc: linux-kernel, Andrea Arcangeli

On Wednesday 05 February 2003 10:39, Andrew Morton wrote:

Hi Andrew,

> > Running just "find /" (or ls -R or tar on a large directory) locally
> > slows the box down to absolute unresponsiveness - it takes minutes
> > to just run ps and kill the find process. During that time, kupdated
> > and kswapd gobble up all available CPU time.
> Could be that your "low memory" is filled up with inodes.  This would
> only happen in these tests if you're using ext2, and there are a *lot*
> of directories.
> I've prepared a lineup of Andrea's VM patches at
> It would be useful if you could apply 10_inode-highmem-2.patch and
> report back.  It applies to 2.4.19 as well, and should work OK there.
is there any reason why this (inode-highmem-2) has never been submitted for 
inclusion into mainline yet?

ciao, Marc

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-19 16:42   ` Marc-Christian Petersen
@ 2003-02-19 17:49     ` Andrea Arcangeli
  2003-02-20 15:29       ` Marc-Christian Petersen
  2003-02-26 23:17       ` filesystem access slowing system to a crawl Marc-Christian Petersen
  0 siblings, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-19 17:49 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrew Morton, t.baetzler, linux-kernel, Marcelo Tosatti

On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
> 
> Hi Andrew,
> 
> > > Running just "find /" (or ls -R or tar on a large directory) locally
> > > slows the box down to absolute unresponsiveness - it takes minutes
> > > to just run ps and kill the find process. During that time, kupdated
> > > and kswapd gobble up all available CPU time.
> > Could be that your "low memory" is filled up with inodes.  This would
> > only happen in these tests if you're using ext2, and there are a *lot*
> > of directories.
> > I've prepared a lineup of Andrea's VM patches at
> > It would be useful if you could apply 10_inode-highmem-2.patch and
> > report back.  It applies to 2.4.19 as well, and should work OK there.
> is there any reason why this (inode-highmem-2) has never been submitted for 
> inclusion into mainline yet?

Marcelo please include this:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/10_inode-highmem-2

other fixes should be included too but they don't apply cleanly yet
unfortunately, I (or somebody else) should rediff them against mainline.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-19 17:49     ` Andrea Arcangeli
@ 2003-02-20 15:29       ` Marc-Christian Petersen
  2003-02-20 18:35         ` Andrew Morton
  2003-02-26 23:17       ` filesystem access slowing system to a crawl Marc-Christian Petersen
  1 sibling, 1 reply; 28+ messages in thread
From: Marc-Christian Petersen @ 2003-02-20 15:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, t.baetzler, linux-kernel, Marcelo Tosatti

On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:

Hi Andrea,

> Marcelo please include this:
> 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
>1pre4aa3/10_inode-highmem-2
great. Thanks. Now let's hope Marcelo use this :)

> other fixes should be included too but they don't apply cleanly yet
> unfortunately, I (or somebody else) should rediff them against mainline.
Can you tell me what in special you mean? I'd do this.

ciao, Marc

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-20 15:29       ` Marc-Christian Petersen
@ 2003-02-20 18:35         ` Andrew Morton
  2003-02-20 21:32           ` Marc-Christian Petersen
  2003-02-20 21:54           ` xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl] Andrea Arcangeli
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2003-02-20 18:35 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: andrea, t.baetzler, linux-kernel, marcelo

Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
>
> On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > Marcelo please include this:
> > 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> >1pre4aa3/10_inode-highmem-2
> great. Thanks. Now let's hope Marcelo use this :)
> 
> > other fixes should be included too but they don't apply cleanly yet
> > unfortunately, I (or somebody else) should rediff them against mainline.
> Can you tell me what in special you mean? I'd do this.
> 

Andrea's VM patches, against 2.4.21-pre4 are at

	http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/

The applying order is in the series file.

These have been rediffed, and apply cleanly.  They have not been
tested much though.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-04  9:29 filesystem access slowing system to a crawl Thomas Bätzler
  2003-02-05  9:03 ` Denis Vlasenko
  2003-02-05  9:39 ` Andrew Morton
@ 2003-02-20 19:30 ` William Stearns
  2 siblings, 0 replies; 28+ messages in thread
From: William Stearns @ 2003-02-20 19:30 UTC (permalink / raw)
  To: Thomas Bätzler; +Cc: ML-linux-kernel, William Stearns

Good morning, Thomas,

On Tue, 4 Feb 2003, Thomas Bätzler wrote:

> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
> 
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
> 
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time. 
> 
> The system performs great otherwise, so I've ruled out a hardware
> problem. It can't be a load problem because during normal operation,
> the system is more or less bored out of its mind (70-90% idle time).
> 
> I'm really at the end of my wits here :-(
> 
> Any help would be greatly appreciated!

	I'm sure the inode problem Andrew and Andrea have pointed out is 
more likely.
	However, just out of interest, does the problem go away or become
less severe if you use "noatime" on that filesystem?

mount -o remount,noatime /my_raid_mount_point

	?
	Cheers,
	- Bill

---------------------------------------------------------------------------
	Lavish spending can be disastrous.  Don't buy any lavishes for a 
while.
(Courtesy of Paul Jakma <paul@clubi.ie>)
--------------------------------------------------------------------------
William Stearns (wstearns@pobox.com).  Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at:   http://www.stearns.org
Linux articles at:                         http://www.opensourcedigest.com
--------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-20 18:35         ` Andrew Morton
@ 2003-02-20 21:32           ` Marc-Christian Petersen
  2003-02-20 21:41             ` Andrew Morton
  2003-02-20 21:54           ` xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl] Andrea Arcangeli
  1 sibling, 1 reply; 28+ messages in thread
From: Marc-Christian Petersen @ 2003-02-20 21:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, t.baetzler, linux-kernel, marcelo

On Thursday 20 February 2003 19:35, Andrew Morton wrote:

Hi Andrew,

> Andrea's VM patches, against 2.4.21-pre4 are at
> 	http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> The applying order is in the series file.
I am afraid Marcelo will never accept these or some of them.

Or am I wrong?

ciao, Marc



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-20 21:32           ` Marc-Christian Petersen
@ 2003-02-20 21:41             ` Andrew Morton
  2003-02-20 22:08               ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2003-02-20 21:41 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: andrea, t.baetzler, linux-kernel, marcelo

Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
>
> On Thursday 20 February 2003 19:35, Andrew Morton wrote:
> 
> Hi Andrew,
> 
> > Andrea's VM patches, against 2.4.21-pre4 are at
> > 	http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> > The applying order is in the series file.
> I am afraid Marcelo will never accept these or some of them.
> 

The most important one is inode-highmem.  It's a safe patch, and the risk of
it causing problems due to not having other surrounding -aa stuff is low.

It's a matter of someone getting down, testing it and sending it.

Ho hum.  It'll take an hour.  I shall try.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 18:35         ` Andrew Morton
  2003-02-20 21:32           ` Marc-Christian Petersen
@ 2003-02-20 21:54           ` Andrea Arcangeli
  2003-02-20 22:56             ` Trond Myklebust
  2003-02-20 23:15             ` Andreas Dilger
  1 sibling, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-20 21:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc-Christian Petersen, t.baetzler, linux-kernel, marcelo

On Thu, Feb 20, 2003 at 10:35:43AM -0800, Andrew Morton wrote:
> Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
> >
> > On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
> > 
> > Hi Andrea,
> > 
> > > Marcelo please include this:
> > > 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > >1pre4aa3/10_inode-highmem-2
> > great. Thanks. Now let's hope Marcelo use this :)
> > 
> > > other fixes should be included too but they don't apply cleanly yet
> > > unfortunately, I (or somebody else) should rediff them against mainline.
> > Can you tell me what in special you mean? I'd do this.
> > 
> 
> Andrea's VM patches, against 2.4.21-pre4 are at
> 
> 	http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> 
> The applying order is in the series file.

Cool!

> 
> These have been rediffed, and apply cleanly.  They have not been
> tested much though.

If they didn't reject in non obvious way they should work fine too ;)
If Marcelo merges them I'll verify everything when I update to his tree
like I do regularly with everything else that rejects.

btw, I finished today fixing a deadlock condition in the xdr layer
triggered by nfs on highmem machines, here's the fix against 2.4.21pre4,
please apply it now to pre4 or will have to live in my tree with the
other hundred of patches like it happened to some of the patches we're
discussing in this thread.

Explanation is very simple: you _can't_ kmap two times in the context of
a single task (especially if more than one task can run the same code at
the same time). I don't yet have the confirmation that this fixes the
deadlock though (it takes days to reproduce so it will take weeks to
confirm), but I can't see anything else wrong at the moment, and this
remains a genuine highmem deadlock that has to be fixed.  The fix is
optimal, no change unless you run out of kmaps and in turn you can
deadlock, i.e. all the light workloads won't be affected at all.

Note, this was developed on top of 2.4.21pre4aa3, so I had to rework it
to make it apply cleanly to mainline, the version I tested and included
in -aa is different, so this one is untested but if it compiles it will
work like a charm ;).

2.5.62 has the very same deadlock condition in xdr triggered by nfs too.
Andrew, if you're forward porting it yourself like with the filebacked
vma merging feature just let me know so we make sure not to duplicate
effort.

diff -urNp nfs-ref/include/asm-i386/highmem.h nfs/include/asm-i386/highmem.h
--- nfs-ref/include/asm-i386/highmem.h	2003-02-14 07:01:58.000000000 +0100
+++ nfs/include/asm-i386/highmem.h	2003-02-20 21:42:17.000000000 +0100
@@ -56,16 +56,19 @@ extern void kmap_init(void) __init;
 #define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
 #define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))
 
-extern void * FASTCALL(kmap_high(struct page *page));
+extern void * FASTCALL(kmap_high(struct page *page, int nonblocking));
 extern void FASTCALL(kunmap_high(struct page *page));
 
-static inline void *kmap(struct page *page)
+#define kmap(page) __kmap(page, 0)
+#define kmap_nonblock(page) __kmap(page, 1)
+
+static inline void *__kmap(struct page *page, int nonblocking)
 {
 	if (in_interrupt())
 		out_of_line_bug();
 	if (page < highmem_start_page)
 		return page_address(page);
-	return kmap_high(page);
+	return kmap_high(page, nonblocking);
 }
 
 static inline void kunmap(struct page *page)
diff -urNp nfs-ref/include/linux/sunrpc/xdr.h nfs/include/linux/sunrpc/xdr.h
--- nfs-ref/include/linux/sunrpc/xdr.h	2003-02-19 01:12:41.000000000 +0100
+++ nfs/include/linux/sunrpc/xdr.h	2003-02-20 21:39:51.000000000 +0100
@@ -137,7 +137,7 @@ void xdr_zero_iovec(struct iovec *, int,
  * XDR buffer helper functions
  */
 extern int xdr_kmap(struct iovec *, struct xdr_buf *, unsigned int);
-extern void xdr_kunmap(struct xdr_buf *, unsigned int);
+extern void xdr_kunmap(struct xdr_buf *, unsigned int, int);
 extern void xdr_shift_buf(struct xdr_buf *, size_t);
 
 /*
diff -urNp nfs-ref/mm/highmem.c nfs/mm/highmem.c
--- nfs-ref/mm/highmem.c	2002-11-29 02:23:18.000000000 +0100
+++ nfs/mm/highmem.c	2003-02-20 21:45:27.000000000 +0100
@@ -77,7 +77,7 @@ static void flush_all_zero_pkmaps(void)
 	flush_tlb_all();
 }
 
-static inline unsigned long map_new_virtual(struct page *page)
+static inline unsigned long map_new_virtual(struct page *page, int nonblocking)
 {
 	unsigned long vaddr;
 	int count;
@@ -96,6 +96,9 @@ start:
 		if (--count)
 			continue;
 
+		if (nonblocking)
+			return 0;
+
 		/*
 		 * Sleep for somebody else to unmap their entries
 		 */
@@ -126,7 +129,7 @@ start:
 	return vaddr;
 }
 
-void *kmap_high(struct page *page)
+void *kmap_high(struct page *page, int nonblocking)
 {
 	unsigned long vaddr;
 
@@ -138,11 +141,15 @@ void *kmap_high(struct page *page)
 	 */
 	spin_lock(&kmap_lock);
 	vaddr = (unsigned long) page->virtual;
-	if (!vaddr)
-		vaddr = map_new_virtual(page);
+	if (!vaddr) {
+		vaddr = map_new_virtual(page, nonblocking);
+		if (!vaddr)
+			goto out;
+	}
 	pkmap_count[PKMAP_NR(vaddr)]++;
 	if (pkmap_count[PKMAP_NR(vaddr)] < 2)
 		BUG();
+ out:
 	spin_unlock(&kmap_lock);
 	return (void*) vaddr;
 }
diff -urNp nfs-ref/net/sunrpc/xdr.c nfs/net/sunrpc/xdr.c
--- nfs-ref/net/sunrpc/xdr.c	2002-11-29 02:23:23.000000000 +0100
+++ nfs/net/sunrpc/xdr.c	2003-02-20 21:39:51.000000000 +0100
@@ -180,7 +180,7 @@ int xdr_kmap(struct iovec *iov_base, str
 {
 	struct iovec	*iov = iov_base;
 	struct page	**ppage = xdr->pages;
-	unsigned int	len, pglen = xdr->page_len;
+	unsigned int	len, pglen = xdr->page_len, first_kmap;
 
 	len = xdr->head[0].iov_len;
 	if (base < len) {
@@ -203,9 +203,17 @@ int xdr_kmap(struct iovec *iov_base, str
 		ppage += base >> PAGE_CACHE_SHIFT;
 		base &= ~PAGE_CACHE_MASK;
 	}
+	first_kmap = 1;
 	do {
 		len = PAGE_CACHE_SIZE;
-		iov->iov_base = kmap(*ppage);
+		if (first_kmap) {
+			first_kmap = 0;
+			iov->iov_base = kmap(*ppage);
+		} else {
+			iov->iov_base = kmap_nonblock(*ppage);
+			if (!iov->iov_base)
+				goto out;
+		}
 		if (base) {
 			iov->iov_base += base;
 			len -= base;
@@ -223,20 +231,23 @@ map_tail:
 		iov->iov_base = (char *)xdr->tail[0].iov_base + base;
 		iov++;
 	}
+ out:
 	return (iov - iov_base);
 }
 
-void xdr_kunmap(struct xdr_buf *xdr, unsigned int base)
+void xdr_kunmap(struct xdr_buf *xdr, unsigned int base, int niov)
 {
 	struct page	**ppage = xdr->pages;
 	unsigned int	pglen = xdr->page_len;
 
 	if (!pglen)
 		return;
-	if (base > xdr->head[0].iov_len)
+	if (base >= xdr->head[0].iov_len)
 		base -= xdr->head[0].iov_len;
-	else
+	else {
+		niov--;
 		base = 0;
+	}
 
 	if (base >= pglen)
 		return;
@@ -250,7 +261,11 @@ void xdr_kunmap(struct xdr_buf *xdr, uns
 		 * we bump pglen here, and just subtract PAGE_CACHE_SIZE... */
 		pglen += base & ~PAGE_CACHE_MASK;
 	}
-	for (;;) {
+	/*
+	 * In case we could only do a partial xdr_kmap, all remaining iovecs
+	 * refer to pages. Otherwise we detect the end through pglen.
+	 */
+	for (; niov; niov--) {
 		flush_dcache_page(*ppage);
 		kunmap(*ppage);
 		if (pglen <= PAGE_CACHE_SIZE)
@@ -322,9 +337,22 @@ void
 xdr_shift_buf(struct xdr_buf *xdr, size_t len)
 {
 	struct iovec iov[MAX_IOVEC];
-	unsigned int nr;
+	unsigned int nr, len_part, n, skip;
+
+	skip = 0;
+	do {
+
+		nr = xdr_kmap(iov, xdr, skip);
+
+		len_part = 0;
+		for (n = 0; n < nr; n++)
+			len_part += iov[n].iov_len;
+
+		xdr_shift_iovec(iov, nr, len_part);
+
+		xdr_kunmap(xdr, skip, nr);
 
-	nr = xdr_kmap(iov, xdr, 0);
-	xdr_shift_iovec(iov, nr, len);
-	xdr_kunmap(xdr, 0);
+		skip += len_part;
+		len -= len_part;
+	} while (len);
 }
diff -urNp nfs-ref/net/sunrpc/xprt.c nfs/net/sunrpc/xprt.c
--- nfs-ref/net/sunrpc/xprt.c	2003-01-29 06:14:32.000000000 +0100
+++ nfs/net/sunrpc/xprt.c	2003-02-20 21:39:51.000000000 +0100
@@ -226,23 +226,34 @@ xprt_sendmsg(struct rpc_xprt *xprt, stru
 	/* Dont repeat bytes */
 	skip = req->rq_bytes_sent;
 	slen = xdr->len - skip;
-	niov = xdr_kmap(niv, xdr, skip);
+	oldfs = get_fs(); set_fs(get_ds());
+	do {
+		unsigned int slen_part, n;
 
-	msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
-	msg.msg_iov	= niv;
-	msg.msg_iovlen	= niov;
-	msg.msg_name	= (struct sockaddr *) &xprt->addr;
-	msg.msg_namelen = sizeof(xprt->addr);
-	msg.msg_control = NULL;
-	msg.msg_controllen = 0;
+		niov = xdr_kmap(niv, xdr, skip);
 
-	oldfs = get_fs(); set_fs(get_ds());
-	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-	result = sock_sendmsg(sock, &msg, slen);
+		msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
+		msg.msg_iov	= niv;
+		msg.msg_iovlen	= niov;
+		msg.msg_name	= (struct sockaddr *) &xprt->addr;
+		msg.msg_namelen = sizeof(xprt->addr);
+		msg.msg_control = NULL;
+		msg.msg_controllen = 0;
+
+		slen_part = 0;
+		for (n = 0; n < niov; n++)
+			slen_part += niv[n].iov_len;
+
+		clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		result = sock_sendmsg(sock, &msg, slen_part);
+
+		xdr_kunmap(xdr, skip, niov);
+
+		skip += slen_part;
+		slen -= slen_part;
+	} while (result >= 0 && slen);
 	set_fs(oldfs);
 
-	xdr_kunmap(xdr, skip);
-
 	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", slen, result);
 
 	if (result >= 0)

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-20 21:41             ` Andrew Morton
@ 2003-02-20 22:08               ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-20 22:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc-Christian Petersen, t.baetzler, linux-kernel, marcelo

On Thu, Feb 20, 2003 at 01:41:04PM -0800, Andrew Morton wrote:
> Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
> >
> > On Thursday 20 February 2003 19:35, Andrew Morton wrote:
> > 
> > Hi Andrew,
> > 
> > > Andrea's VM patches, against 2.4.21-pre4 are at
> > > 	http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> > > The applying order is in the series file.
> > I am afraid Marcelo will never accept these or some of them.
> > 
> 
> The most important one is inode-highmem.  It's a safe patch, and the risk of
> it causing problems due to not having other surrounding -aa stuff is low.
> 
> It's a matter of someone getting down, testing it and sending it.
> 
> Ho hum.  It'll take an hour.  I shall try.

this is a pre kernel, it's meant to *test* stuff, if anything will go
wrong we're here ready to fix it immediatly. Sure, applying the patch of
the last minute to an -rc just before releasing the new official kernel
w/o any kind of testing was a bad idea, but we must not be too much
conservative either, especially like in these cases where we are fixing
bugs, I mean we can't delay bugfixes with the argument that they could
introduce new bugs, otherwise we can as well stop fixing bugs.

Also note that this stuff is being tested aggressively for a very long
time by lots of people, it's not a last minute patch like the xdr
highmem deadlock ;).

Don't take me wrong, I'm not saying that Marcelo is too conservative,
quite the opposite, I'm simply not so pessimistic that the stuff won't
go in ;).

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 21:54           ` xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl] Andrea Arcangeli
@ 2003-02-20 22:56             ` Trond Myklebust
  2003-02-20 23:04               ` Jeff Garzik
                                 ` (2 more replies)
  2003-02-20 23:15             ` Andreas Dilger
  1 sibling, 3 replies; 28+ messages in thread
From: Trond Myklebust @ 2003-02-20 22:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

[-- Attachment #1: Type: text/plain, Size: 606 bytes --]

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > 2.5.62 has the very same deadlock condition in xdr triggered by
     >        nfs too.
     > Andrew, if you're forward porting it yourself like with the
     > filebacked vma merging feature just let me know so we make sure
     > not to duplicate effort.

For 2.5.x we should rather fix MSG_MORE so that it actually works
instead of messing with hacks to kmap().

For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
kmap of > 1 page in one call. Appended here as an attachment FYI
(Marcelo do *not* apply!).

Cheers,
  Trond


[-- Attachment #2: va02-kmap-multplepages-2.5.7.patch --]
[-- Type: text/plain, Size: 10405 bytes --]

--- linux/include/linux/csem.h.CSEMORG	Mon Apr  8 06:38:03 2002
+++ linux/include/linux/csem.h	Mon Apr  8 08:13:00 2002
@@ -0,0 +1,45 @@
+/* 
+ * csem.h: Count semaphores, public interface
+ * Written by Hirokazu Takahashi (taka@valinux.co.jp).
+ */
+
+#ifndef _LINUX_CNTSEM_H
+#define _LINUX_CNTSEM_H
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <asm/system.h>
+#include <asm/atomic.h>
+
+
+struct csemaphore {
+	spinlock_t	lock;
+	atomic_t	count;
+	wait_queue_head_t wait;
+};
+
+#define __CSEMAPHORE_INITIALIZER(name, count) \
+	{ SPIN_LOCK_UNLOCKED,  ATOMIC_INIT(count), \
+	__WAIT_QUEUE_HEAD_INITIALIZER((name).wait) }
+
+#define __DECLARE_CSEMAPHORE_GENERIC(name,count) \
+	struct csemaphore name = __CSEMAPHORE_INITIALIZER(name,count)
+
+extern void _down_count(struct csemaphore * , int);
+extern void _up_count(struct csemaphore * , int);
+
+static inline void down_count(struct csemaphore *sem, int cnt)
+{
+	if (cnt) _down_count(sem, cnt);
+}
+
+static inline void up_count(struct csemaphore *sem, int cnt)
+{
+	if (cnt) _up_count(sem, cnt);
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_CNTSEM_H */
--- linux/lib/csem.c.CSEMORG	Mon Apr  8 06:35:41 2002
+++ linux/lib/csem.c	Mon Apr  8 08:04:55 2002
@@ -0,0 +1,55 @@
+/*
+ * csem.c: Count semaphores: contention handling generic functions
+ *         You can get one or more counts at once.
+ * Written by Hirokazu Takahashi (taka@valinux.co.jp).
+ */
+
+#include <linux/csem.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+
+
+void _down_count(struct csemaphore *sem, int n)
+{
+	struct task_struct *tsk = current;
+	DECLARE_WAITQUEUE(wait, tsk);
+
+	spin_lock(&sem->lock);
+	if (!waitqueue_active(&sem->wait) && atomic_read(&sem->count) >= n) {
+		atomic_sub(n, &sem->count);
+		spin_unlock(&sem->lock);
+		return;
+	}
+	tsk->state = TASK_UNINTERRUPTIBLE;
+	add_wait_queue_exclusive(&sem->wait, &wait);
+wait:
+	spin_unlock(&sem->lock);
+	schedule();
+	spin_lock(&sem->lock);
+	if (atomic_read(&sem->count) < n) {
+		tsk->state = TASK_UNINTERRUPTIBLE;
+		goto wait;
+	}
+	atomic_sub(n, &sem->count);
+	remove_wait_queue(&sem->wait, &wait);
+	tsk->state = TASK_RUNNING;
+	if (waitqueue_active(&sem->wait) && atomic_read(&sem->count)) {
+		wake_up(&sem->wait);
+	}
+	spin_unlock(&sem->lock);
+}
+
+
+void _up_count(struct csemaphore *sem, int n)
+{
+	spin_lock(&sem->lock);
+	atomic_add(n, &sem->count);
+	if (waitqueue_active(&sem->wait)) {
+		wake_up(&sem->wait);
+	}
+	spin_unlock(&sem->lock);
+}
+
+EXPORT_SYMBOL(_down_count);
+EXPORT_SYMBOL(_up_count);
+
--- linux/lib/Makefile.CSEMORG	Mon Apr  8 06:35:28 2002
+++ linux/lib/Makefile	Mon Apr  8 06:36:30 2002
@@ -8,9 +8,9 @@
 
 L_TARGET := lib.a
 
-export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o crc32.o
+export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o crc32.o csem.o
 
-obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o
+obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o csem.o
 
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
--- linux/include/asm-i386/highmem.h.CSEMORG	Mon Apr  8 06:01:42 2002
+++ linux/include/asm-i386/highmem.h	Mon Apr  8 06:02:39 2002
@@ -50,8 +50,8 @@
 #define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
 #define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))
 
-extern void * FASTCALL(kmap_high(struct page *page));
-extern void FASTCALL(kunmap_high(struct page *page));
+extern void * FASTCALL(kmap_highpages(struct page **page, int cnt));
+extern void FASTCALL(kunmap_highpages(struct page **page, int cnt));
 
 static inline void *kmap(struct page *page)
 {
@@ -59,7 +59,8 @@
 		BUG();
 	if (page < highmem_start_page)
 		return page_address(page);
-	return kmap_high(page);
+	kmap_highpages(&page, 1);
+	return page_address(page);
 }
 
 static inline void kunmap(struct page *page)
@@ -68,7 +69,7 @@
 		BUG();
 	if (page < highmem_start_page)
 		return;
-	kunmap_high(page);
+	kunmap_highpages(&page, 1);
 }
 
 /*
--- linux/include/linux/highmem.h.CSEMORG	Mon Apr  8 12:01:36 2002
+++ linux/include/linux/highmem.h	Mon Apr  8 12:06:27 2002
@@ -71,6 +71,9 @@
 
 #define kunmap(page) do { } while (0)
 
+#define kmap_highpages(pagep,cnt)	do { } while (0)
+#define kunmap_highpages(pagep,cnt)	do { } while (0)
+
 #define kmap_atomic(page,idx)		kmap(page)
 #define kunmap_atomic(page,idx)		kunmap(page)
 
--- linux/kernel/ksyms.c.CSEMORG	Mon Apr  8 06:02:00 2002
+++ linux/kernel/ksyms.c	Mon Apr  8 06:02:39 2002
@@ -118,8 +118,8 @@
 EXPORT_SYMBOL(init_mm);
 EXPORT_SYMBOL(create_bounce);
 #ifdef CONFIG_HIGHMEM
-EXPORT_SYMBOL(kmap_high);
-EXPORT_SYMBOL(kunmap_high);
+EXPORT_SYMBOL(kmap_highpages);
+EXPORT_SYMBOL(kunmap_highpages);
 EXPORT_SYMBOL(highmem_start_page);
 EXPORT_SYMBOL(kmap_prot);
 EXPORT_SYMBOL(kmap_pte);
--- linux/mm/highmem.c.CSEMORG	Mon Apr  8 06:02:11 2002
+++ linux/mm/highmem.c	Mon Apr  8 06:49:06 2002
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
+#include <linux/csem.h>
 #include <asm/pgalloc.h>
 
 static mempool_t *page_pool, *isa_page_pool;
@@ -51,7 +52,7 @@
 
 pte_t * pkmap_page_table;
 
-static DECLARE_WAIT_QUEUE_HEAD(pkmap_map_wait);
+static __DECLARE_CSEMAPHORE_GENERIC(pkmap_sem, LAST_PKMAP);
 
 static void flush_all_zero_pkmaps(void)
 {
@@ -96,7 +97,6 @@
 	unsigned long vaddr;
 	int count;
 
-start:
 	count = LAST_PKMAP;
 	/* Find an empty entry */
 	for (;;) {
@@ -110,26 +110,7 @@
 		if (--count)
 			continue;
 
-		/*
-		 * Sleep for somebody else to unmap their entries
-		 */
-		{
-			DECLARE_WAITQUEUE(wait, current);
-
-			current->state = TASK_UNINTERRUPTIBLE;
-			add_wait_queue(&pkmap_map_wait, &wait);
-			spin_unlock(&kmap_lock);
-			schedule();
-			remove_wait_queue(&pkmap_map_wait, &wait);
-			spin_lock(&kmap_lock);
-
-			/* Somebody else might have mapped it while we slept */
-			if (page->virtual)
-				return (unsigned long) page->virtual;
-
-			/* Re-start */
-			goto start;
-		}
+		panic("kmap: failed to allocate a entry.\n");
 	}
 	vaddr = PKMAP_ADDR(last_pkmap_nr);
 	set_pte(&(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
@@ -140,10 +121,16 @@
 	return vaddr;
 }
 
-void *kmap_high(struct page *page)
+void *kmap_highpages(struct page **page, int cnt)
 {
 	unsigned long vaddr;
-
+	int	hcnt = 0;
+	int	mapped = 0;
+	int	i;
+
+	for (i = 0; i < cnt; i++)
+		if (page[i] >= highmem_start_page) hcnt++;
+	down_count(&pkmap_sem , hcnt);
 	/*
 	 * For highmem pages, we can't trust "virtual" until
 	 * after we have the lock.
@@ -151,54 +138,65 @@
 	 * We cannot call this from interrupts, as it may block
 	 */
 	spin_lock(&kmap_lock);
-	vaddr = (unsigned long) page->virtual;
-	if (!vaddr)
-		vaddr = map_new_virtual(page);
-	pkmap_count[PKMAP_NR(vaddr)]++;
-	if (pkmap_count[PKMAP_NR(vaddr)] < 2)
-		BUG();
+	for (i = 0; i < cnt; i++) {
+		if (page[i] < highmem_start_page)
+			continue;
+		vaddr = (unsigned long) page[i]->virtual;
+		if (!vaddr)
+			vaddr = map_new_virtual(page[i]);
+		if (pkmap_count[PKMAP_NR(vaddr)] == 1)
+			mapped++;
+		pkmap_count[PKMAP_NR(vaddr)]++;
+		if (pkmap_count[PKMAP_NR(vaddr)] < 2)
+			BUG();
+	}
+	if (hcnt != mapped)
+		up_count(&pkmap_sem, hcnt - mapped);
 	spin_unlock(&kmap_lock);
 	return (void*) vaddr;
 }
 
-void kunmap_high(struct page *page)
+void kunmap_highpages(struct page **page, int cnt)
 {
 	unsigned long vaddr;
 	unsigned long nr;
-	int need_wakeup;
+	int release_cnt = 0;
+	int i;
 
 	spin_lock(&kmap_lock);
-	vaddr = (unsigned long) page->virtual;
-	if (!vaddr)
-		BUG();
-	nr = PKMAP_NR(vaddr);
+	for (i = 0; i < cnt; i++) {
+		if (page[i] < highmem_start_page)
+			continue;
+		vaddr = (unsigned long) page[i]->virtual;
+		if (!vaddr)
+			BUG();
+		nr = PKMAP_NR(vaddr);
 
-	/*
-	 * A count must never go down to zero
-	 * without a TLB flush!
-	 */
-	need_wakeup = 0;
-	switch (--pkmap_count[nr]) {
-	case 0:
-		BUG();
-	case 1:
 		/*
-		 * Avoid an unnecessary wake_up() function call.
-		 * The common case is pkmap_count[] == 1, but
-		 * no waiters.
-		 * The tasks queued in the wait-queue are guarded
-		 * by both the lock in the wait-queue-head and by
-		 * the kmap_lock.  As the kmap_lock is held here,
-		 * no need for the wait-queue-head's lock.  Simply
-		 * test if the queue is empty.
+		 * A count must never go down to zero
+		 * without a TLB flush!
 		 */
-		need_wakeup = waitqueue_active(&pkmap_map_wait);
+		switch (--pkmap_count[nr]) {
+		case 0:
+			BUG();
+		case 1:
+			/*
+			 * Avoid an unnecessary wake_up() function call.
+			 * The common case is pkmap_count[] == 1, but
+			 * no waiters.
+			 * The tasks queued in the wait-queue are guarded
+			 * by both the lock in the wait-queue-head and by
+			 * the kmap_lock.  As the kmap_lock is held here,
+			 * no need for the wait-queue-head's lock.  Simply
+			 * test if the queue is empty.
+			 */
+			release_cnt++;
+		}
 	}
 	spin_unlock(&kmap_lock);
 
 	/* do wake-up, if needed, race-free outside of the spin lock */
-	if (need_wakeup)
-		wake_up(&pkmap_map_wait);
+	up_count(&pkmap_sem , release_cnt);
 }
 
 #define POOL_SIZE	64
--- linux/net/sunrpc/svcsock.c.CSEMORG	Mon Apr  8 06:02:26 2002
+++ linux/net/sunrpc/svcsock.c	Mon Apr  8 07:59:57 2002
@@ -338,10 +338,12 @@
 	 */
 	msg.msg_flags	= 0;
 
-	/* Danger!: multiple kmap() calls may cause deadlock */
-	for (i = 1; i < bufp->nriov; i++) {
-		if (bufp->page[i])
-			bufp->iov[i].iov_base += (unsigned int)kmap(bufp->page[i]);
+	if (bufp->nriov > 1) {
+		kmap_highpages(&bufp->page[1], bufp->nriov - 1);
+		for (i = 1; i < bufp->nriov; i++) {
+			if (bufp->page[i])
+				bufp->iov[i].iov_base += (unsigned int)page_address(bufp->page[i]);
+		}
 	}
 
 	/* TODO: Sendpage mechanism will work good than sock_sendmsg() */
@@ -349,12 +351,13 @@
 	len = sock_sendmsg(sock, &msg, buflen);
 	set_fs(oldfs);
 
-	for (i = 1; i < bufp->nriov; i++) {
-		struct page *page = bufp->page[i];
-		if (page) {
-			kunmap(page);
-			page_cache_release(page);
-			bufp->page[i] = NULL;
+	if (bufp->nriov > 1) {
+		kunmap_highpages(&bufp->page[1], bufp->nriov - 1);
+		for (i = 1; i < bufp->nriov; i++) {
+			if (bufp->page[i]) {
+				page_cache_release(bufp->page[i]);
+				bufp->page[i] = NULL;
+			}
 		}
 	}
 	dprintk("svc: socket %p sendto([%p %Zu... ], %d, %d) = %d\n",

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 22:56             ` Trond Myklebust
@ 2003-02-20 23:04               ` Jeff Garzik
  2003-02-20 23:12                 ` Trond Myklebust
  2003-02-21  9:41                 ` Andrea Arcangeli
  2003-02-21  9:37               ` Andrea Arcangeli
  2003-02-21 20:52               ` Andrew Morton
  2 siblings, 2 replies; 28+ messages in thread
From: Jeff Garzik @ 2003-02-20 23:04 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrea Arcangeli, Andrew Morton, Marc-Christian Petersen,
	t.baetzler, linux-kernel, marcelo

On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:
> 
>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.
> 
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().
> 
> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).


One should also consider kmap_atomic...  (bcrl suggest)

	Jeff




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 23:04               ` Jeff Garzik
@ 2003-02-20 23:12                 ` Trond Myklebust
  2003-02-21  9:41                   ` Andrea Arcangeli
  2003-02-21  9:41                 ` Andrea Arcangeli
  1 sibling, 1 reply; 28+ messages in thread
From: Trond Myklebust @ 2003-02-20 23:12 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrea Arcangeli, Andrew Morton, Marc-Christian Petersen,
	t.baetzler, linux-kernel, marcelo

>>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:

     > One should also consider kmap_atomic...  (bcrl suggest)

The problem is that sendmsg() can sleep. kmap_atomic() isn't really
appropriate here.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 21:54           ` xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl] Andrea Arcangeli
  2003-02-20 22:56             ` Trond Myklebust
@ 2003-02-20 23:15             ` Andreas Dilger
  2003-02-21  9:46               ` Andrea Arcangeli
  1 sibling, 1 reply; 28+ messages in thread
From: Andreas Dilger @ 2003-02-20 23:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

On Feb 20, 2003  22:54 +0100, Andrea Arcangeli wrote:
> Explanation is very simple: you _can't_ kmap two times in the context of
> a single task (especially if more than one task can run the same code at
> the same time). I don't yet have the confirmation that this fixes the
> deadlock though (it takes days to reproduce so it will take weeks to
> confirm), but I can't see anything else wrong at the moment, and this
> remains a genuine highmem deadlock that has to be fixed.  The fix is
> optimal, no change unless you run out of kmaps and in turn you can
> deadlock, i.e. all the light workloads won't be affected at all.

We had a similar problem in Lustre, where we have to kmap multiple pages
at once and hold them over a network RPC (which is doing zero-copy DMA
into multiple pages at once), and there is possibly a very heavy load
of kmaps because the client and the server can be on the same system.

What we did was set up a "kmap reservation", which used an atomic_dec()
+ wait_event() to reschedule the task until it could get enough kmaps
to satisfy the request without deadlocking (i.e. exceeding the kmap cap
which we conservitavely set at 3/4 of all kmap space).

A single "server" task could exceed the kmap cap by enough to satisfy the
maximum possible request size, so that a single system with both clients
and servers can always make forward progress even in the face of clients
trying to kmap more than the total amount of kmap space.

This works for us because we are the only consumer of huge amounts of kmaps
on our systems, but it would be nice to have a generic interface to do that
so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 22:56             ` Trond Myklebust
  2003-02-20 23:04               ` Jeff Garzik
@ 2003-02-21  9:37               ` Andrea Arcangeli
  2003-02-21 20:52               ` Andrew Morton
  2 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-21  9:37 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:
> 
>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.
> 
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().
> 
> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).

you can't do it this way, the number of kmap available can be just 1,
and you can ask for 10000 in a row this way. Furthmore you want to be
able to use all the kmaps available, think if you have 11 kmaps, and 10
are constantly in use. I much prefer my approch that is the most
finegrined and scalable and it doesn't risk to deadlock in function of
the number of kmaps in the pool and the max reservation you make. I just
considered the approch implemented in the patch you quoted and I
discarded it for the reasons explained above.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 23:04               ` Jeff Garzik
  2003-02-20 23:12                 ` Trond Myklebust
@ 2003-02-21  9:41                 ` Andrea Arcangeli
  1 sibling, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-21  9:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Trond Myklebust, Andrew Morton, Marc-Christian Petersen,
	t.baetzler, linux-kernel, marcelo

On Thu, Feb 20, 2003 at 06:04:30PM -0500, Jeff Garzik wrote:
> On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> > >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:
> > 
> >      > 2.5.62 has the very same deadlock condition in xdr triggered by
> >      >        nfs too.
> >      > Andrew, if you're forward porting it yourself like with the
> >      > filebacked vma merging feature just let me know so we make sure
> >      > not to duplicate effort.
> > 
> > For 2.5.x we should rather fix MSG_MORE so that it actually works
> > instead of messing with hacks to kmap().
> > 
> > For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> > kmap of > 1 page in one call. Appended here as an attachment FYI
> > (Marcelo do *not* apply!).
> 
> 
> One should also consider kmap_atomic...  (bcrl suggest)

impossible, either you submit page structures to the IP layer, or you
*must* have persistence, depending on a sock_sendmsg that can't schedule
would be totally broken (or the preemptive thing is a joke). nfs client
O_DIRET zerocopy would be a nice feature but this is 2.4.

the only option would be the atomic and at the same time persistent
kmaps in the process address space that don't work well with threads...
but again this is 2.4 and we miss it even in 2.5 because of the troubles
they generate.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 23:12                 ` Trond Myklebust
@ 2003-02-21  9:41                   ` Andrea Arcangeli
  2003-02-22  0:40                     ` David S. Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-21  9:41 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Garzik, Andrew Morton, Marc-Christian Petersen, t.baetzler,
	linux-kernel, marcelo

On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> >>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:
> 
>      > One should also consider kmap_atomic...  (bcrl suggest)
> 
> The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> appropriate here.

100% correct.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 23:15             ` Andreas Dilger
@ 2003-02-21  9:46               ` Andrea Arcangeli
  2003-02-21 19:41                 ` Andreas Dilger
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-21  9:46 UTC (permalink / raw)
  To: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> On Feb 20, 2003  22:54 +0100, Andrea Arcangeli wrote:
> > Explanation is very simple: you _can't_ kmap two times in the context of
> > a single task (especially if more than one task can run the same code at
> > the same time). I don't yet have the confirmation that this fixes the
> > deadlock though (it takes days to reproduce so it will take weeks to
> > confirm), but I can't see anything else wrong at the moment, and this
> > remains a genuine highmem deadlock that has to be fixed.  The fix is
> > optimal, no change unless you run out of kmaps and in turn you can
> > deadlock, i.e. all the light workloads won't be affected at all.
> 
> We had a similar problem in Lustre, where we have to kmap multiple pages
> at once and hold them over a network RPC (which is doing zero-copy DMA
> into multiple pages at once), and there is possibly a very heavy load
> of kmaps because the client and the server can be on the same system.
> 
> What we did was set up a "kmap reservation", which used an atomic_dec()
> + wait_event() to reschedule the task until it could get enough kmaps
> to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> which we conservitavely set at 3/4 of all kmap space).

Your approch was fragile (every arch is free to give you just 1 kmap in
the pool and you still must not deadlock) and it's not capable of using
the whole kmap pool at the same time. the only robust and efficient way
to fix it is the kmap_nonblock IMHO

> A single "server" task could exceed the kmap cap by enough to satisfy the
> maximum possible request size, so that a single system with both clients
> and servers can always make forward progress even in the face of clients
> trying to kmap more than the total amount of kmap space.
> 
> This works for us because we are the only consumer of huge amounts of kmaps
> on our systems, but it would be nice to have a generic interface to do that
> so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
against Lustre even with your design (assuming you don't fall in the two
problems mentioned above). But still your design is more fragile and
less scalable, especially for a generic implementation where you don't
know how many pages you'll reserve in mean, and you don't know how many
kmaps entries the architecture can provide to you. But of course with
kmap_nonblock you'll have to fallback submitting single pages if it
fails, it's a bit more difficult but it's more robust and optimized IMHO.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-21  9:46               ` Andrea Arcangeli
@ 2003-02-21 19:41                 ` Andreas Dilger
  2003-02-21 19:46                   ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Andreas Dilger @ 2003-02-21 19:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

On Feb 21, 2003  10:46 +0100, Andrea Arcangeli wrote:
> On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> > What we did was set up a "kmap reservation", which used an atomic_dec()
> > + wait_event() to reschedule the task until it could get enough kmaps
> > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > which we conservitavely set at 3/4 of all kmap space).
> 
> Your approch was fragile (every arch is free to give you just 1 kmap in
> the pool and you still must not deadlock) and it's not capable of using
> the whole kmap pool at the same time. the only robust and efficient way
> to fix it is the kmap_nonblock IMHO

So (says the person who only ever uses i386 and ia64), does an arch exist
which needs highmem/kmap, but only ever gives 1 kmap in the pool?

> > This works for us because we are the only consumer of huge amounts of kmaps
> > on our systems, but it would be nice to have a generic interface to do that
> > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
> 
> This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> against Lustre even with your design (assuming you don't fall in the two
> problems mentioned above). But still your design is more fragile and
> less scalable, especially for a generic implementation where you don't
> know how many pages you'll reserve in mean, and you don't know how many
> kmaps entries the architecture can provide to you. But of course with
> kmap_nonblock you'll have to fallback submitting single pages if it
> fails, it's a bit more difficult but it's more robust and optimized IMHO.

In our case, Lustre (well Portals really, the underlying network protocol)
always knows in advance the number of pages that it will need to kmap
because the client needs to tell the server in advance how much bulk data
is going to send.  This is required for being able to do RDMA.  It might
be possible to have the server do the transfer in multiple parts if
kmap_nonblock() failed, but that is not how things are currently set up,
which is why we block in advance until we know we can get enough pages.

This is very similar to ext3 journaling, which requests in advance the
maximum number of journal blocks it might need, and blocks until it can
get them all.

The only problem happens when other parts of the kernel start acquiring
multiple kmaps without using the same reservation/accounting system as us.
Each works fine in isolation, but in combination it fails.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-21 19:41                 ` Andreas Dilger
@ 2003-02-21 19:46                   ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 19:46 UTC (permalink / raw)
  To: Andrew Morton, Marc-Christian Petersen, t.baetzler, linux-kernel,
	marcelo

On Fri, Feb 21, 2003 at 12:41:09PM -0700, Andreas Dilger wrote:
> On Feb 21, 2003  10:46 +0100, Andrea Arcangeli wrote:
> > On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> > > What we did was set up a "kmap reservation", which used an atomic_dec()
> > > + wait_event() to reschedule the task until it could get enough kmaps
> > > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > > which we conservitavely set at 3/4 of all kmap space).
> > 
> > Your approch was fragile (every arch is free to give you just 1 kmap in
> > the pool and you still must not deadlock) and it's not capable of using
> > the whole kmap pool at the same time. the only robust and efficient way
> > to fix it is the kmap_nonblock IMHO
> 
> So (says the person who only ever uses i386 and ia64), does an arch exist
> which needs highmem/kmap, but only ever gives 1 kmap in the pool?
> 
> > > This works for us because we are the only consumer of huge amounts of kmaps
> > > on our systems, but it would be nice to have a generic interface to do that
> > > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
> > 
> > This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> > against Lustre even with your design (assuming you don't fall in the two
> > problems mentioned above). But still your design is more fragile and
> > less scalable, especially for a generic implementation where you don't
> > know how many pages you'll reserve in mean, and you don't know how many
> > kmaps entries the architecture can provide to you. But of course with
> > kmap_nonblock you'll have to fallback submitting single pages if it
> > fails, it's a bit more difficult but it's more robust and optimized IMHO.
> 
> In our case, Lustre (well Portals really, the underlying network protocol)
> always knows in advance the number of pages that it will need to kmap
> because the client needs to tell the server in advance how much bulk data
> is going to send.  This is required for being able to do RDMA.  It might
> be possible to have the server do the transfer in multiple parts if
> kmap_nonblock() failed, but that is not how things are currently set up,
> which is why we block in advance until we know we can get enough pages.
> 
> This is very similar to ext3 journaling, which requests in advance the
> maximum number of journal blocks it might need, and blocks until it can
> get them all.
> 
> The only problem happens when other parts of the kernel start acquiring
> multiple kmaps without using the same reservation/accounting system as us.
> Each works fine in isolation, but in combination it fails.

no, if the other places are not buggy, it won't fail, regardless if they
use your mechanism or the kmap_nonblock. you don't have to use your
mechanism everywhere to make your mechanism work. For istance you will
be fine with the kmap_nonblock fix in combination with your current
code. Not sure why you think otherwise.

I understand it may be simpler to do the full reservation, in ext3 you
don't even risk anything because you know how large the pool is, but I
think for these cases the kmap_nonblock is superior because you have
obvious depdency on the architecture and you're not able to use at best
all the kmap pool (and here there's not a transaction that has to be
committed all at once so it's doable).  still in practice it will work
fine in combination of the other safe usages (like kmap_nonblock) if you
reserve few enough pages at time.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-20 22:56             ` Trond Myklebust
  2003-02-20 23:04               ` Jeff Garzik
  2003-02-21  9:37               ` Andrea Arcangeli
@ 2003-02-21 20:52               ` Andrew Morton
  2003-02-21 21:32                 ` Trond Myklebust
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2003-02-21 20:52 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: andrea, m.c.p, t.baetzler, linux-kernel, marcelo

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:
> 
>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.
> 
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().

Is the fixing of MSG_MORE likely to actually happen?

> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).

Andrea's patch is quite simple.  Although I wonder if this, in
xdr_kmap():

+		} else {
+			iov->iov_base = kmap_nonblock(*ppage);
+			if (!iov->iov_base)
+				goto out;
+		}

should be skipping the map_tail thing?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-21 20:52               ` Andrew Morton
@ 2003-02-21 21:32                 ` Trond Myklebust
  0 siblings, 0 replies; 28+ messages in thread
From: Trond Myklebust @ 2003-02-21 21:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, m.c.p, t.baetzler, linux-kernel, marcelo

>>>>> " " == Andrew Morton <akpm@digeo.com> writes:

    >> For 2.5.x we should rather fix MSG_MORE so that it actually
    >> works instead of messing with hacks to kmap().

     > Is the fixing of MSG_MORE likely to actually happen?

We had better try. The server/knfsd has already converted to sendpage
+ MSG_MORE 8-)

That won't work for 2.4.x though, since that doesn't have support for
sendpage over UDP.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-21  9:41                   ` Andrea Arcangeli
@ 2003-02-22  0:40                     ` David S. Miller
  2003-02-23 15:22                       ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: David S. Miller @ 2003-02-22  0:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Trond Myklebust, Jeff Garzik, Andrew Morton,
	Marc-Christian Petersen, t.baetzler, linux-kernel, marcelo

On Fri, 2003-02-21 at 01:41, Andrea Arcangeli wrote:
> On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> > >>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:
> > 
> >      > One should also consider kmap_atomic...  (bcrl suggest)
> > 
> > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > appropriate here.
> 
> 100% correct.

It actually depends upon whether you have sk->priority set
to GFP_ATOMIC or GFP_KERNEL.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]
  2003-02-22  0:40                     ` David S. Miller
@ 2003-02-23 15:22                       ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2003-02-23 15:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: Trond Myklebust, Jeff Garzik, Andrew Morton,
	Marc-Christian Petersen, t.baetzler, linux-kernel, marcelo

On Fri, Feb 21, 2003 at 04:40:41PM -0800, David S. Miller wrote:
> On Fri, 2003-02-21 at 01:41, Andrea Arcangeli wrote:
> > On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> > > >>>>> " " == Jeff Garzik <jgarzik@pobox.com> writes:
> > > 
> > >      > One should also consider kmap_atomic...  (bcrl suggest)
> > > 
> > > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > > appropriate here.
> > 
> > 100% correct.
> 
> It actually depends upon whether you have sk->priority set
> to GFP_ATOMIC or GFP_KERNEL.

You must not disable preemption when entering sock_sendmsg no matter
sk->priority. disabling preemption inside sock_sendmsg is way too late
so even if you have such preemption bug in sock_sendmsg, it won't help.
you would need to disable preemption in the caller before doing the
kmap_atomic if something. And again that is a preemption bug.

Not to tell you'd need to allocate a big pool of atomic kmaps to do
that, and this would eat hundred megs of virtual address space since
it's replicated per-cpu. This is has even less sense, those machines
where the highmem deadlock triggers eats normal zone big time.

Really, the claim that it can be solved with atomic kmaps doesn't make
any sense to me, nor the fact the sock_sendmsg will not schedule if
called with GFP_ATOMIC. Of course it must not schedule if it can be
called from an irq with priority=GFP_ATOMIC, but this isn't the case
we're discussing here, an irq implicitly just disabled preemption by
design and calling sock_sendmsg from irq isn't really desiderable (even
if technically possible maybe with priority=GFP_ATOMIC according to you)
because it will take some time.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-19 17:49     ` Andrea Arcangeli
  2003-02-20 15:29       ` Marc-Christian Petersen
@ 2003-02-26 23:17       ` Marc-Christian Petersen
  2003-02-27  8:51         ` Marc-Christian Petersen
  1 sibling, 1 reply; 28+ messages in thread
From: Marc-Christian Petersen @ 2003-02-26 23:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, Andrew Morton

On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:

Hi Marcelo,

apply this, please!

> On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> > On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
> >
> > Hi Andrew,
> >
> > > > Running just "find /" (or ls -R or tar on a large directory) locally
> > > > slows the box down to absolute unresponsiveness - it takes minutes
> > > > to just run ps and kill the find process. During that time, kupdated
> > > > and kswapd gobble up all available CPU time.
> > >
> > > Could be that your "low memory" is filled up with inodes.  This would
> > > only happen in these tests if you're using ext2, and there are a *lot*
> > > of directories.
> > > I've prepared a lineup of Andrea's VM patches at
> > > It would be useful if you could apply 10_inode-highmem-2.patch and
> > > report back.  It applies to 2.4.19 as well, and should work OK there.
> >
> > is there any reason why this (inode-highmem-2) has never been submitted
> > for inclusion into mainline yet?


Marcelo please include this:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/10_inode-highmem-2
other fixes should be included too but they don't apply cleanly yet
unfortunately, I (or somebody else) should rediff them against mainline.
> Andrea


ciao, Marc



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: filesystem access slowing system to a crawl
  2003-02-26 23:17       ` filesystem access slowing system to a crawl Marc-Christian Petersen
@ 2003-02-27  8:51         ` Marc-Christian Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Marc-Christian Petersen @ 2003-02-27  8:51 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2618 bytes --]

On Thursday 27 February 2003 00:17, Marc-Christian Petersen wrote:

Hi again,

> Hi Marcelo,
> apply this, please!
Patch is by Andrea. I will send this every day once until I see the merge in 
-BK or a mail from you here on LKML why you don't take it!

P.S.: I see some bogus patches in -BK (now -pre5) which got merged. This patch
      exists since ages (inode-highmem-2), survived tons of testing and it is
      a must!

I can only _repeat_ Andrea (I agree 100% with his statement):

------------------------------------------------------------------------
this is a pre kernel, it's meant to *test* stuff, if anything will go
wrong we're here ready to fix it immediatly. Sure, applying the patch of
the last minute to an -rc just before releasing the new official kernel
w/o any kind of testing was a bad idea, but we must not be too much
conservative either, especially like in these cases where we are fixing
bugs, I mean we can't delay bugfixes with the argument that they could
introduce new bugs, otherwise we can as well stop fixing bugs.

Also note that this stuff is being tested aggressively for a very long
time by lots of people, it's not a last minute patch like the xdr
highmem deadlock ;).
------------------------------------------------------------------------


regards!

>
> > On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> > > On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
> > >
> > > Hi Andrew,
> > >
> > > > > Running just "find /" (or ls -R or tar on a large directory)
> > > > > locally slows the box down to absolute unresponsiveness - it takes
> > > > > minutes to just run ps and kill the find process. During that time,
> > > > > kupdated and kswapd gobble up all available CPU time.
> > > >
> > > > Could be that your "low memory" is filled up with inodes.  This would
> > > > only happen in these tests if you're using ext2, and there are a
> > > > *lot* of directories.
> > > > I've prepared a lineup of Andrea's VM patches at
> > > > It would be useful if you could apply 10_inode-highmem-2.patch and
> > > > report back.  It applies to 2.4.19 as well, and should work OK there.
> > >
> > > is there any reason why this (inode-highmem-2) has never been submitted
> > > for inclusion into mainline yet?
>
> Marcelo please include this:
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21
>pre4aa3/10_inode-highmem-2 other fixes should be included too but they don't
> apply cleanly yet unfortunately, I (or somebody else) should rediff them
> against mainline.

[-- Attachment #2: inode-highmem-2.patch --]
[-- Type: text/x-diff, Size: 3834 bytes --]

diff -urNp 2.4.19pre9/fs/inode.c 2.4.19pre9aa2/fs/inode.c
--- 2.4.19pre9/fs/inode.c	Wed May 29 02:12:36 2002
+++ 2.4.19pre9aa2/fs/inode.c	Fri May 31 04:43:41 2002
@@ -665,35 +665,88 @@ void prune_icache(int goal)
 {
 	LIST_HEAD(list);
 	struct list_head *entry, *freeable = &list;
-	int count;
+	int count, pass;
 	struct inode * inode;
 
-	spin_lock(&inode_lock);
+	count = pass = 0;
+	entry = &inode_unused;
 
-	count = 0;
-	entry = inode_unused.prev;
-	while (entry != &inode_unused)
-	{
-		struct list_head *tmp = entry;
+	spin_lock(&inode_lock);
+	while (goal && pass++ < 2) {
+		entry = inode_unused.prev;
+		while (entry != &inode_unused)
+		{
+			struct list_head *tmp = entry;
 
-		entry = entry->prev;
-		inode = INODE(tmp);
-		if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
-			continue;
-		if (!CAN_UNUSE(inode))
-			continue;
-		if (atomic_read(&inode->i_count))
-			continue;
-		list_del(tmp);
-		list_del(&inode->i_hash);
-		INIT_LIST_HEAD(&inode->i_hash);
-		list_add(tmp, freeable);
-		inode->i_state |= I_FREEING;
-		count++;
-		if (!--goal)
-			break;
+			entry = entry->prev;
+			inode = INODE(tmp);
+			if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
+				continue;
+			if (atomic_read(&inode->i_count))
+				continue;
+			if (pass == 2 && !inode->i_state && !CAN_UNUSE(inode)) {
+				if (inode_has_buffers(inode))
+					/*
+					 * If the inode has dirty buffers
+					 * pending, start flushing out bdflush.ndirty
+					 * worth of data even if there's no dirty-memory
+					 * pressure. Do nothing else in this
+					 * case, until all dirty buffers are gone
+					 * we can do nothing about the inode other than
+					 * to keep flushing dirty stuff. We could also
+					 * flush only the dirty buffers in the inode
+					 * but there's no API to do it asynchronously
+					 * and this simpler approch to deal with the
+					 * dirty payload shouldn't make much difference
+					 * in practice. Also keep in mind if somebody
+					 * keeps overwriting data in a flood we'd
+					 * never manage to drop the inode anyways,
+					 * and we really shouldn't do that because
+					 * it's an heavily used one.
+					 */
+					wakeup_bdflush();
+				else if (inode->i_data.nrpages)
+					/*
+					 * If we're here it means the only reason
+					 * we cannot drop the inode is that its
+					 * due its pagecache so go ahead and trim it
+					 * hard. If it doesn't go away it means
+					 * they're dirty or dirty/pinned pages ala
+					 * ramfs.
+					 *
+					 * invalidate_inode_pages() is a non
+					 * blocking operation but we introduce
+					 * a dependency order between the
+					 * inode_lock and the pagemap_lru_lock,
+					 * the inode_lock must always be taken
+					 * first from now on.
+					 */
+					invalidate_inode_pages(inode);
+			}
+			if (!CAN_UNUSE(inode))
+				continue;
+			list_del(tmp);
+			list_del(&inode->i_hash);
+			INIT_LIST_HEAD(&inode->i_hash);
+			list_add(tmp, freeable);
+			inode->i_state |= I_FREEING;
+			count++;
+			if (!--goal)
+				break;
+		}
 	}
 	inodes_stat.nr_unused -= count;
+
+	/*
+	 * the unused list is hardly an LRU so it makes
+	 * more sense to rotate it so we don't bang
+	 * always on the same inodes in case they're
+	 * unfreeable for whatever reason.
+	 */
+	if (entry != &inode_unused) {
+		list_del(&inode_unused);
+		list_add(&inode_unused, entry);
+	}
 	spin_unlock(&inode_lock);
 
 	dispose_list(freeable);
diff -urNp 2.4.19pre9/mm/filemap.c 2.4.19pre9aa2/mm/filemap.c
--- 2.4.19pre9/mm/filemap.c	Wed May 29 02:12:46 2002
+++ 2.4.19pre9aa2/mm/filemap.c	Fri May 31 04:44:29 2002
@@ -194,7 +194,7 @@ void invalidate_inode_pages(struct inode
 		if (TryLockPage(page))
 			continue;
 
-		if (page->buffers && !try_to_free_buffers(page, 0))
+		if (page->buffers && !try_to_release_page(page, 0))
 			goto unlock;
 
 		if (page_count(page) != 1)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2003-02-27  8:42 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-04  9:29 filesystem access slowing system to a crawl Thomas Bätzler
2003-02-05  9:03 ` Denis Vlasenko
2003-02-05  9:39 ` Andrew Morton
2003-02-19 16:42   ` Marc-Christian Petersen
2003-02-19 17:49     ` Andrea Arcangeli
2003-02-20 15:29       ` Marc-Christian Petersen
2003-02-20 18:35         ` Andrew Morton
2003-02-20 21:32           ` Marc-Christian Petersen
2003-02-20 21:41             ` Andrew Morton
2003-02-20 22:08               ` Andrea Arcangeli
2003-02-20 21:54           ` xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl] Andrea Arcangeli
2003-02-20 22:56             ` Trond Myklebust
2003-02-20 23:04               ` Jeff Garzik
2003-02-20 23:12                 ` Trond Myklebust
2003-02-21  9:41                   ` Andrea Arcangeli
2003-02-22  0:40                     ` David S. Miller
2003-02-23 15:22                       ` Andrea Arcangeli
2003-02-21  9:41                 ` Andrea Arcangeli
2003-02-21  9:37               ` Andrea Arcangeli
2003-02-21 20:52               ` Andrew Morton
2003-02-21 21:32                 ` Trond Myklebust
2003-02-20 23:15             ` Andreas Dilger
2003-02-21  9:46               ` Andrea Arcangeli
2003-02-21 19:41                 ` Andreas Dilger
2003-02-21 19:46                   ` Andrea Arcangeli
2003-02-26 23:17       ` filesystem access slowing system to a crawl Marc-Christian Petersen
2003-02-27  8:51         ` Marc-Christian Petersen
2003-02-20 19:30 ` William Stearns

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox