* Fix knfsd readahead cache in 2.4.15
@ 2001-11-26 13:50 Trond Myklebust
2001-11-26 23:35 ` Neil Brown
0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2001-11-26 13:50 UTC (permalink / raw)
To: Neil Brown; +Cc: NFS maillist, Linux Kernel
Hi Neil,
The following patch fixes a bug in the knfsd readahead code. The
memset() that is referenced in the patch below is clobbering the
pointer to the next list element (ra->p_next), thus reducing the inode
readahead cache to 1 entry upon the very first call to
nfsd_get_raparms().
BTW: looking at the choice of cache size. Why is this set to number
of threads * 2? Isn't it better to have a minimum cache size? After
all, the fact that I have 8 threads running does not at all reflect
the number of inodes that I might have open on my various clients...
Cheers,
Trond
--- linux-2.4.16-pre1/fs/nfsd/vfs.c.orig Fri Oct 5 21:23:53 2001
+++ linux-2.4.16-pre1/fs/nfsd/vfs.c Mon Nov 26 14:32:09 2001
@@ -560,9 +560,13 @@
return NULL;
rap = frap;
ra = *frap;
- memset(ra, 0, sizeof(*ra));
ra->p_dev = dev;
ra->p_ino = ino;
+ ra->p_reada = 0;
+ ra->p_ramax = 0;
+ ra->p_raend = 0;
+ ra->p_ralen = 0;
+ ra->p_rawin = 0;
found:
if (rap != &raparm_cache) {
*rap = ra->p_next;
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: Fix knfsd readahead cache in 2.4.15 2001-11-26 13:50 Fix knfsd readahead cache in 2.4.15 Trond Myklebust @ 2001-11-26 23:35 ` Neil Brown 2001-11-26 23:53 ` David S. Miller 0 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2001-11-26 23:35 UTC (permalink / raw) To: trond.myklebust; +Cc: NFS maillist, Linux Kernel On Monday November 26, trond.myklebust@fys.uio.no wrote: > Hi Neil, > > The following patch fixes a bug in the knfsd readahead code. The > memset() that is referenced in the patch below is clobbering the > pointer to the next list element (ra->p_next), thus reducing the inode > readahead cache to 1 entry upon the very first call to > nfsd_get_raparms(). Thanks. This is definately a bug, but I don't think it is quite as dramatic as you suggest. The "struct raparms" that ra points to will almost always be the last one on the list, so ra->p_next will almost always be NULL anyway. Nevertheless, it is a bug and your fix looks good. > > BTW: looking at the choice of cache size. Why is this set to number > of threads * 2? Isn't it better to have a minimum cache size? After > all, the fact that I have 8 threads running does not at all reflect > the number of inodes that I might have open on my various clients... Historical reasons... That's the way it was when I started looking at the code, and I never saw any reason to change it. I did add some statistics gathering so that you could watch the usage of the racache to see if it was big enough, but I haven't looked at any numbers yet... I have a fairly busy nfs server that has been running 2.4.something for 20 days with 32 nfsd threads. The "ra" line from the stats file ( /proc/net/rpc/nfsd ) is: ra 64 8921281 74404 43817 25150 18806 17611 12894 11859 8342 6919 1605176 which says that 64 entries were allocated, 8921281 searches found a hit in the first 10% of the cache 74404 searches found a hit in the next 10% and so on. 6919 searches found a hit in the last 10% 1605176 searches (20%) didn't find a hit. Obviously these numbers would vary a lot with different loads. I suspect that 20% miss rate reflects average file size, as the first read request on a file will always miss. Maybe we have an average file size of 40K (5 times 8K read size). However this set of numbers suggests to me that 64 entries is reasonable for us. Doubling the number of entries would reduce the miss rate by less than 10*6919. Probably only 30,000 at most, which is 2%. Ideally the size should be tunable. Then you could monitor these stats over several days, and adjust the size until you get the number of searches that hit in the last 10% suitably low. But then again, maybe we should store the cache in a hash table for faster lookup. It would make the stats harder to collect, but would make searches faster which is more important. The bottom line is that it is a heuristic that seems to work well enough. The cache needs to be at least as big as the number of servers so that each thread can always find an entry when it needs one. Making it twice that big seems to work. NeilBrown > > Cheers, > Trond > > --- linux-2.4.16-pre1/fs/nfsd/vfs.c.orig Fri Oct 5 21:23:53 2001 > +++ linux-2.4.16-pre1/fs/nfsd/vfs.c Mon Nov 26 14:32:09 2001 > @@ -560,9 +560,13 @@ > return NULL; > rap = frap; > ra = *frap; > - memset(ra, 0, sizeof(*ra)); > ra->p_dev = dev; > ra->p_ino = ino; > + ra->p_reada = 0; > + ra->p_ramax = 0; > + ra->p_raend = 0; > + ra->p_ralen = 0; > + ra->p_rawin = 0; > found: > if (rap != &raparm_cache) { > *rap = ra->p_next; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fix knfsd readahead cache in 2.4.15 2001-11-26 23:35 ` Neil Brown @ 2001-11-26 23:53 ` David S. Miller 2001-11-27 0:05 ` David S. Miller ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: David S. Miller @ 2001-11-26 23:53 UTC (permalink / raw) To: neilb; +Cc: trond.myklebust, nfs, linux-kernel From: Neil Brown <neilb@cse.unsw.edu.au> Date: Tue, 27 Nov 2001 10:35:26 +1100 (EST) This is definately a bug, but I don't think it is quite as dramatic as you suggest. The "struct raparms" that ra points to will almost always be the last one on the list, so ra->p_next will almost always be NULL anyway. Nevertheless, it is a bug and your fix looks good. There are other problems remaining, this function is a logical mess. 1) depth is computed in the loop, but thrown away. It is basically reassigned to a constant before being used to index the statistics it is for. 2) raparm_cache is reassigned, and since the ra param list is NULL terminated this can make the list shorter and shorter and shorter until it is one entry deep. Yikes :) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fix knfsd readahead cache in 2.4.15 2001-11-26 23:53 ` David S. Miller @ 2001-11-27 0:05 ` David S. Miller 2001-11-27 0:29 ` Neil Brown 2001-11-27 1:25 ` [NFS] " Benjamin LaHaise 2 siblings, 0 replies; 11+ messages in thread From: David S. Miller @ 2001-11-27 0:05 UTC (permalink / raw) To: neilb; +Cc: trond.myklebust, nfs, linux-kernel From: "David S. Miller" <davem@redhat.com> Date: Mon, 26 Nov 2001 15:53:47 -0800 (PST) There are other problems remaining, this function is a logical mess. Replying to myself, I'd suggest a patch like the following to clear up all the issues discovered: --- fs/nfsd/vfs.c.~1~ Sun Oct 21 02:47:53 2001 +++ fs/nfsd/vfs.c Mon Nov 26 16:03:40 2001 @@ -545,32 +545,43 @@ static inline struct raparms * nfsd_get_raparms(dev_t dev, ino_t ino) { - struct raparms *ra, **rap, **frap = NULL; + struct raparms *ra, *fra; int depth = 0; - for (rap = &raparm_cache; (ra = *rap); rap = &ra->p_next) { + ra = raparm_cache; + fra = NULL; + do { if (ra->p_ino == ino && ra->p_dev == dev) goto found; + depth++; + if (ra->p_count == 0) - frap = rap; - } - depth = nfsdstats.ra_size*11/10; - if (!frap) + fra = ra; + + ra = ra->p_next; + } while (ra != raparm_cache); + + if (!fra) return NULL; - rap = frap; - ra = *frap; - memset(ra, 0, sizeof(*ra)); + + ra = fra; ra->p_dev = dev; ra->p_ino = ino; + ra->p_reada = 0; + ra->p_ramax = 0; + ra->p_raend = 0; + ra->p_ralen = 0; + ra->p_rawin = 0; + found: - if (rap != &raparm_cache) { - *rap = ra->p_next; - ra->p_next = raparm_cache; + if (ra != raparm_cache) raparm_cache = ra; - } + ra->p_count++; - nfsdstats.ra_depth[depth*10/nfsdstats.ra_size]++; + + nfsdstats.ra_depth[(depth <= 10 ? depth : 10)]++; + return ra; } @@ -1576,9 +1587,13 @@ dprintk("nfsd: allocating %d readahead buffers.\n", cache_size); memset(raparml, 0, sizeof(struct raparms) * cache_size); + + /* Circular list. */ for (i = 0; i < cache_size - 1; i++) { raparml[i].p_next = raparml + i + 1; } + raparml[i].p_next = raparml; + raparm_cache = raparml; } else { printk(KERN_WARNING ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fix knfsd readahead cache in 2.4.15 2001-11-26 23:53 ` David S. Miller 2001-11-27 0:05 ` David S. Miller @ 2001-11-27 0:29 ` Neil Brown 2001-11-27 0:33 ` David S. Miller 2001-11-27 1:25 ` [NFS] " Benjamin LaHaise 2 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2001-11-27 0:29 UTC (permalink / raw) To: David S. Miller; +Cc: trond.myklebust, nfs, linux-kernel On Monday November 26, davem@redhat.com wrote: > From: Neil Brown <neilb@cse.unsw.edu.au> > Date: Tue, 27 Nov 2001 10:35:26 +1100 (EST) > > This is definately a bug, but I don't think it is quite as dramatic as > you suggest. > > The "struct raparms" that ra points to will almost always be the last > one on the list, so ra->p_next will almost always be NULL anyway. > Nevertheless, it is a bug and your fix looks good. > > There are other problems remaining, this function is a logical > mess. > > 1) depth is computed in the loop, but thrown away. > It is basically reassigned to a constant before > being used to index the statistics it is for. Only if the match is not found. If it is found, then we "goto" over the re-assignment and use the counted value. > > 2) raparm_cache is reassigned, and since the ra param list is > NULL terminated this can make the list shorter and shorter > and shorter until it is one entry deep. But it doesn't. If the entry that was found is not at the head of the list, we delete it from the list, and then insert it at the head of the list. Nothing gets lost. Given that the stats on my machine still show that it is sometimes finding entries in the last 10% of the 64 entry cache, I am quite confident that there are atleast 58 entryies still in the cache. Nothing has been lost. I don't think your patch adds anything. Re-writing the code to use list.h lists would probably be useful But currently (except of the problem Trond found) the code is correct. NeilBrown > > Yikes :) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fix knfsd readahead cache in 2.4.15 2001-11-27 0:29 ` Neil Brown @ 2001-11-27 0:33 ` David S. Miller 0 siblings, 0 replies; 11+ messages in thread From: David S. Miller @ 2001-11-27 0:33 UTC (permalink / raw) To: neilb; +Cc: trond.myklebust, nfs, linux-kernel From: Neil Brown <neilb@cse.unsw.edu.au> Date: Tue, 27 Nov 2001 11:29:47 +1100 (EST) I don't think your patch adds anything. Re-writing the code to use list.h lists would probably be useful But currently (except of the problem Trond found) the code is correct. You're right. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15 2001-11-26 23:53 ` David S. Miller 2001-11-27 0:05 ` David S. Miller 2001-11-27 0:29 ` Neil Brown @ 2001-11-27 1:25 ` Benjamin LaHaise 2001-11-27 13:54 ` Trond Myklebust 2 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2001-11-27 1:25 UTC (permalink / raw) To: David S. Miller; +Cc: neilb, trond.myklebust, nfs, linux-kernel On Mon, Nov 26, 2001 at 03:53:47PM -0800, David S. Miller wrote: > There are other problems remaining, this function is a logical > mess. Hint: readahead via the page cache is the way to go... -ben -- Fish. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15 2001-11-27 1:25 ` [NFS] " Benjamin LaHaise @ 2001-11-27 13:54 ` Trond Myklebust 2001-11-27 15:23 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Trond Myklebust @ 2001-11-27 13:54 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: nfs, linux-kernel >>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes: > Hint: readahead via the page cache is the way to go... That's not the problem: knfsd has always done readahead via the page cache. The problem is that the generic_file_read() stuff caches info about the status of readaheads in the 'struct file', which is a volatile object in NFS (it is released after the NFS request has been serviced). To work around that, knfsd copies that readahead status data from the struct file into a private cache, and trots it out again upon the next READ call to reference that particular inode. Cheers, Trond ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15 2001-11-27 13:54 ` Trond Myklebust @ 2001-11-27 15:23 ` Benjamin LaHaise 2001-11-27 15:44 ` Trond Myklebust 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2001-11-27 15:23 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs, linux-kernel On Tue, Nov 27, 2001 at 02:54:46PM +0100, Trond Myklebust wrote: > >>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes: > > > Hint: readahead via the page cache is the way to go... > > That's not the problem: knfsd has always done readahead via the page > cache. Sorry for not being clear, but what I was referring to is making the decision about how to read ahead by what is already in the page cache. It has a number of benefits that database people are after as it allows multiple threads using pread/pwrite to obtain the benefits of readahead. -ben ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15 2001-11-27 15:23 ` Benjamin LaHaise @ 2001-11-27 15:44 ` Trond Myklebust 2001-11-28 21:03 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Trond Myklebust @ 2001-11-27 15:44 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: nfs, linux-kernel >>>>> " " == Benjamin LaHaise <bcrl@redhat.com> writes: > Sorry for not being clear, but what I was referring to is > making the decision about how to read ahead by what is already > in the page cache. It has a number of benefits that database > people are after as it allows multiple threads using > pread/pwrite to obtain the benefits of readahead. Ah, OK. Sorry for being so dense ;-) Yeah, that would be an improvement for knfsd too. The current scheme can only manage readahead for 1 reader per inode. Are there any records of a more detailed discussion of how to go about implementing such a readahead strategy for Linux? Cheers, Trond ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [NFS] Re: Fix knfsd readahead cache in 2.4.15 2001-11-27 15:44 ` Trond Myklebust @ 2001-11-28 21:03 ` Benjamin LaHaise 0 siblings, 0 replies; 11+ messages in thread From: Benjamin LaHaise @ 2001-11-28 21:03 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs, linux-kernel On Tue, Nov 27, 2001 at 04:44:30PM +0100, Trond Myklebust wrote: > Ah, OK. Sorry for being so dense ;-) > > Yeah, that would be an improvement for knfsd too. The current scheme > can only manage readahead for 1 reader per inode. > Are there any records of a more detailed discussion of how to go about > implementing such a readahead strategy for Linux? Not yet. I'm hoping to write such a patch in the next couple of months for 2.5. -ben -- Fish. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2001-11-28 21:03 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-11-26 13:50 Fix knfsd readahead cache in 2.4.15 Trond Myklebust 2001-11-26 23:35 ` Neil Brown 2001-11-26 23:53 ` David S. Miller 2001-11-27 0:05 ` David S. Miller 2001-11-27 0:29 ` Neil Brown 2001-11-27 0:33 ` David S. Miller 2001-11-27 1:25 ` [NFS] " Benjamin LaHaise 2001-11-27 13:54 ` Trond Myklebust 2001-11-27 15:23 ` Benjamin LaHaise 2001-11-27 15:44 ` Trond Myklebust 2001-11-28 21:03 ` Benjamin LaHaise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox