Fix knfsd readahead cache in 2.4.15

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Fix knfsd readahead cache in 2.4.15
@ 2001-11-26 13:50 Trond Myklebust
  2001-11-26 23:35 ` Neil Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2001-11-26 13:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: NFS maillist, Linux Kernel

Hi Neil,

  The following patch fixes a bug in the knfsd readahead code. The
memset() that is referenced in the patch below is clobbering the
pointer to the next list element (ra->p_next), thus reducing the inode
readahead cache to 1 entry upon the very first call to
nfsd_get_raparms().

  BTW: looking at the choice of cache size. Why is this set to number
of threads * 2? Isn't it better to have a minimum cache size? After
all, the fact that I have 8 threads running does not at all reflect
the number of inodes that I might have open on my various clients...

Cheers,
   Trond

--- linux-2.4.16-pre1/fs/nfsd/vfs.c.orig	Fri Oct  5 21:23:53 2001
+++ linux-2.4.16-pre1/fs/nfsd/vfs.c	Mon Nov 26 14:32:09 2001
@@ -560,9 +560,13 @@
 		return NULL;
 	rap = frap;
 	ra = *frap;
-	memset(ra, 0, sizeof(*ra));
 	ra->p_dev = dev;
 	ra->p_ino = ino;
+	ra->p_reada = 0;
+	ra->p_ramax = 0;
+	ra->p_raend = 0;
+	ra->p_ralen = 0;
+	ra->p_rawin = 0;
 found:
 	if (rap != &raparm_cache) {
 		*rap = ra->p_next;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix knfsd readahead cache in 2.4.15
  2001-11-26 13:50 Fix knfsd readahead cache in 2.4.15 Trond Myklebust
@ 2001-11-26 23:35 ` Neil Brown
  2001-11-26 23:53   ` David S. Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2001-11-26 23:35 UTC (permalink / raw)
  To: trond.myklebust; +Cc: NFS maillist, Linux Kernel

On Monday November 26, trond.myklebust@fys.uio.no wrote:
> Hi Neil,
> 
>   The following patch fixes a bug in the knfsd readahead code. The
> memset() that is referenced in the patch below is clobbering the
> pointer to the next list element (ra->p_next), thus reducing the inode
> readahead cache to 1 entry upon the very first call to
> nfsd_get_raparms().

Thanks.
This is definately a bug, but I don't think it is quite as dramatic as
you suggest.

The "struct raparms" that ra points to will almost always be the last
one on the list, so ra->p_next will almost always be NULL anyway.
Nevertheless, it is a bug and your fix looks good.

> 
>   BTW: looking at the choice of cache size. Why is this set to number
> of threads * 2? Isn't it better to have a minimum cache size? After
> all, the fact that I have 8 threads running does not at all reflect
> the number of inodes that I might have open on my various clients...

Historical reasons... That's the way it was when I started looking at
the code, and I never saw any reason to change it.  
I did add some statistics gathering so that you could watch the usage
of the racache to see if it was big enough, but I haven't looked at
any numbers yet...

I have a fairly busy nfs server that has been running 2.4.something
for 20 days with 32 nfsd threads.  The "ra" line from the stats file
( /proc/net/rpc/nfsd ) is:

ra 64 8921281 74404 43817 25150 18806 17611 12894 11859 8342 6919 1605176

which says that 64 entries were allocated,
8921281 searches found a hit in the first 10% of the cache
74404 searches found a hit in the next 10%
and so on.
6919 searches found a hit in the last 10%
1605176 searches (20%) didn't find a hit.

Obviously these numbers would vary a lot with different loads.  I
suspect that 20% miss rate reflects average file size, as the first
read request on a file will always miss.  Maybe we have an average
file size of 40K (5 times 8K read size).

However this set of numbers suggests to me that 64 entries is
reasonable for us.  Doubling the number of entries would reduce the
miss rate by less than 10*6919.  Probably only 30,000 at most, which
is 2%.

Ideally the size should be tunable.  Then you could monitor these
stats over several days, and adjust the size until you get the number
of searches that hit in the last 10% suitably low.

But then again, maybe we should store the cache in a hash table for
faster lookup.  It would make the stats harder to collect, but would
make searches faster which is more important.

The bottom line is that it is a heuristic that seems to work well
enough.  The cache needs to be at least as big as the number of
servers so that each thread can always find an entry when it needs
one.  Making it twice that big seems to work.

NeilBrown

> 
> Cheers,
>    Trond
> 
> --- linux-2.4.16-pre1/fs/nfsd/vfs.c.orig	Fri Oct  5 21:23:53 2001
> +++ linux-2.4.16-pre1/fs/nfsd/vfs.c	Mon Nov 26 14:32:09 2001
> @@ -560,9 +560,13 @@
>  		return NULL;
>  	rap = frap;
>  	ra = *frap;
> -	memset(ra, 0, sizeof(*ra));
>  	ra->p_dev = dev;
>  	ra->p_ino = ino;
> +	ra->p_reada = 0;
> +	ra->p_ramax = 0;
> +	ra->p_raend = 0;
> +	ra->p_ralen = 0;
> +	ra->p_rawin = 0;
>  found:
>  	if (rap != &raparm_cache) {
>  		*rap = ra->p_next;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix knfsd readahead cache in 2.4.15
  2001-11-26 23:35 ` Neil Brown
@ 2001-11-26 23:53   ` David S. Miller
  2001-11-27  0:05     ` David S. Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: David S. Miller @ 2001-11-26 23:53 UTC (permalink / raw)
  To: neilb; +Cc: trond.myklebust, nfs, linux-kernel

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Tue, 27 Nov 2001 10:35:26 +1100 (EST)

   This is definately a bug, but I don't think it is quite as dramatic as
   you suggest.

   The "struct raparms" that ra points to will almost always be the last
   one on the list, so ra->p_next will almost always be NULL anyway.
   Nevertheless, it is a bug and your fix looks good.

There are other problems remaining, this function is a logical
mess.

1) depth is computed in the loop, but thrown away.
   It is basically reassigned to a constant before
   being used to index the statistics it is for.

2) raparm_cache is reassigned, and since the ra param list is
   NULL terminated this can make the list shorter and shorter
   and shorter until it is one entry deep.

Yikes :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix knfsd readahead cache in 2.4.15
  2001-11-26 23:53   ` David S. Miller
@ 2001-11-27  0:05     ` David S. Miller
  2001-11-27  0:29     ` Neil Brown
  2001-11-27  1:25     ` [NFS] " Benjamin LaHaise
  2 siblings, 0 replies; 11+ messages in thread
From: David S. Miller @ 2001-11-27  0:05 UTC (permalink / raw)
  To: neilb; +Cc: trond.myklebust, nfs, linux-kernel

   From: "David S. Miller" <davem@redhat.com>
   Date: Mon, 26 Nov 2001 15:53:47 -0800 (PST)
   
   There are other problems remaining, this function is a logical
   mess.

Replying to myself, I'd suggest a patch like the following to clear up
all the issues discovered:

--- fs/nfsd/vfs.c.~1~	Sun Oct 21 02:47:53 2001
+++ fs/nfsd/vfs.c	Mon Nov 26 16:03:40 2001
@@ -545,32 +545,43 @@
 static inline struct raparms *
 nfsd_get_raparms(dev_t dev, ino_t ino)
 {
-	struct raparms	*ra, **rap, **frap = NULL;
+	struct raparms	*ra, *fra;
 	int depth = 0;
 	
-	for (rap = &raparm_cache; (ra = *rap); rap = &ra->p_next) {
+	ra = raparm_cache;
+	fra = NULL;
+	do {
 		if (ra->p_ino == ino && ra->p_dev == dev)
 			goto found;
+
 		depth++;
+
 		if (ra->p_count == 0)
-			frap = rap;
-	}
-	depth = nfsdstats.ra_size*11/10;
-	if (!frap)
+			fra = ra;
+
+		ra = ra->p_next;
+	} while (ra != raparm_cache);
+
+	if (!fra)
 		return NULL;
-	rap = frap;
-	ra = *frap;
-	memset(ra, 0, sizeof(*ra));
+
+	ra = fra;
 	ra->p_dev = dev;
 	ra->p_ino = ino;
+	ra->p_reada = 0;
+	ra->p_ramax = 0;
+	ra->p_raend = 0;
+	ra->p_ralen = 0;
+	ra->p_rawin = 0;
+
 found:
-	if (rap != &raparm_cache) {
-		*rap = ra->p_next;
-		ra->p_next   = raparm_cache;
+	if (ra != raparm_cache)
 		raparm_cache = ra;
-	}
+
 	ra->p_count++;
-	nfsdstats.ra_depth[depth*10/nfsdstats.ra_size]++;
+
+	nfsdstats.ra_depth[(depth <= 10 ? depth : 10)]++;
+
 	return ra;
 }
 
@@ -1576,9 +1587,13 @@
 		dprintk("nfsd: allocating %d readahead buffers.\n",
 			cache_size);
 		memset(raparml, 0, sizeof(struct raparms) * cache_size);
+
+		/* Circular list. */
 		for (i = 0; i < cache_size - 1; i++) {
 			raparml[i].p_next = raparml + i + 1;
 		}
+		raparml[i].p_next = raparml;
+
 		raparm_cache = raparml;
 	} else {
 		printk(KERN_WARNING

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix knfsd readahead cache in 2.4.15
  2001-11-26 23:53   ` David S. Miller
  2001-11-27  0:05     ` David S. Miller
@ 2001-11-27  0:29     ` Neil Brown
  2001-11-27  0:33       ` David S. Miller
  2001-11-27  1:25     ` [NFS] " Benjamin LaHaise
  2 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2001-11-27  0:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: trond.myklebust, nfs, linux-kernel

On Monday November 26, davem@redhat.com wrote:
>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Tue, 27 Nov 2001 10:35:26 +1100 (EST)
>    
>    This is definately a bug, but I don't think it is quite as dramatic as
>    you suggest.
>    
>    The "struct raparms" that ra points to will almost always be the last
>    one on the list, so ra->p_next will almost always be NULL anyway.
>    Nevertheless, it is a bug and your fix looks good.
> 
> There are other problems remaining, this function is a logical
> mess.
> 
> 1) depth is computed in the loop, but thrown away.
>    It is basically reassigned to a constant before
>    being used to index the statistics it is for.

Only if the match is not found.  If it is found, then we "goto" over
the re-assignment and use the counted value.

> 
> 2) raparm_cache is reassigned, and since the ra param list is
>    NULL terminated this can make the list shorter and shorter
>    and shorter until it is one entry deep.

But it doesn't.
If the entry that was found is not at the head of the list, we delete
it from the list, and then insert it at the head of the list.  Nothing
gets lost.

Given that the stats on my machine still show that it is sometimes
finding entries in the last 10% of the 64 entry cache, I am quite
confident that there are atleast 58 entryies still in the cache.
Nothing has been lost.

I don't think your patch adds anything.
Re-writing the code to use list.h lists would probably be useful
But currently (except of the problem Trond found) the code is correct.

NeilBrown

> 
> Yikes :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix knfsd readahead cache in 2.4.15
  2001-11-27  0:29     ` Neil Brown
@ 2001-11-27  0:33       ` David S. Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David S. Miller @ 2001-11-27  0:33 UTC (permalink / raw)
  To: neilb; +Cc: trond.myklebust, nfs, linux-kernel

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Tue, 27 Nov 2001 11:29:47 +1100 (EST)

   I don't think your patch adds anything.
   Re-writing the code to use list.h lists would probably be useful
   But currently (except of the problem Trond found) the code is correct.

You're right.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15
  2001-11-26 23:53   ` David S. Miller
  2001-11-27  0:05     ` David S. Miller
  2001-11-27  0:29     ` Neil Brown
@ 2001-11-27  1:25     ` Benjamin LaHaise
  2001-11-27 13:54       ` Trond Myklebust
  2 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2001-11-27  1:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: neilb, trond.myklebust, nfs, linux-kernel

On Mon, Nov 26, 2001 at 03:53:47PM -0800, David S. Miller wrote:
> There are other problems remaining, this function is a logical
> mess.

Hint: readahead via the page cache is the way to go...

		-ben
-- 
Fish.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15
  2001-11-27  1:25     ` [NFS] " Benjamin LaHaise
@ 2001-11-27 13:54       ` Trond Myklebust
  2001-11-27 15:23         ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2001-11-27 13:54 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: nfs, linux-kernel

>>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes:

     > Hint: readahead via the page cache is the way to go...

That's not the problem: knfsd has always done readahead via the page
cache.

The problem is that the generic_file_read() stuff caches info about
the status of readaheads in the 'struct file', which is a volatile
object in NFS (it is released after the NFS request has been serviced).
To work around that, knfsd copies that readahead status data from the
struct file into a private cache, and trots it out again upon the next
READ call to reference that particular inode.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15
  2001-11-27 13:54       ` Trond Myklebust
@ 2001-11-27 15:23         ` Benjamin LaHaise
  2001-11-27 15:44           ` Trond Myklebust
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2001-11-27 15:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, linux-kernel

On Tue, Nov 27, 2001 at 02:54:46PM +0100, Trond Myklebust wrote:
> >>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes:
> 
>      > Hint: readahead via the page cache is the way to go...
> 
> That's not the problem: knfsd has always done readahead via the page
> cache.

Sorry for not being clear, but what I was referring to is making the 
decision about how to read ahead by what is already in the page cache.  
It has a number of benefits that database people are after as it allows 
multiple threads using pread/pwrite to obtain the benefits of readahead.

		-ben

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15
  2001-11-27 15:23         ` Benjamin LaHaise
@ 2001-11-27 15:44           ` Trond Myklebust
  2001-11-28 21:03             ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2001-11-27 15:44 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: nfs, linux-kernel

>>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes:

     > Sorry for not being clear, but what I was referring to is
     > making the decision about how to read ahead by what is already
     > in the page cache.  It has a number of benefits that database
     > people are after as it allows multiple threads using
     > pread/pwrite to obtain the benefits of readahead.

Ah, OK. Sorry for being so dense ;-)

Yeah, that would be an improvement for knfsd too. The current scheme
can only manage readahead for 1 reader per inode.
Are there any records of a more detailed discussion of how to go about
implementing such a readahead strategy for Linux?

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15
  2001-11-27 15:44           ` Trond Myklebust
@ 2001-11-28 21:03             ` Benjamin LaHaise
  0 siblings, 0 replies; 11+ messages in thread
From: Benjamin LaHaise @ 2001-11-28 21:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, linux-kernel

On Tue, Nov 27, 2001 at 04:44:30PM +0100, Trond Myklebust wrote:
> Ah, OK. Sorry for being so dense ;-)
> 
> Yeah, that would be an improvement for knfsd too. The current scheme
> can only manage readahead for 1 reader per inode.
> Are there any records of a more detailed discussion of how to go about
> implementing such a readahead strategy for Linux?

Not yet.  I'm hoping to write such a patch in the next couple of months 
for 2.5.

		-ben
-- 
Fish.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-11-28 21:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-11-26 13:50 Fix knfsd readahead cache in 2.4.15 Trond Myklebust
2001-11-26 23:35 ` Neil Brown
2001-11-26 23:53   ` David S. Miller
2001-11-27  0:05     ` David S. Miller
2001-11-27  0:29     ` Neil Brown
2001-11-27  0:33       ` David S. Miller
2001-11-27  1:25     ` [NFS] " Benjamin LaHaise
2001-11-27 13:54       ` Trond Myklebust
2001-11-27 15:23         ` Benjamin LaHaise
2001-11-27 15:44           ` Trond Myklebust
2001-11-28 21:03             ` Benjamin LaHaise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox