* nfs client readdir caching issue? @ 2008-07-02 11:03 Andy Chittenden 2008-07-02 17:37 ` Trond Myklebust 0 siblings, 1 reply; 4+ messages in thread From: Andy Chittenden @ 2008-07-02 11:03 UTC (permalink / raw) To: linux-kernel Very rarely, we're seeing various problems on a linux kernel client (seen on various versions) with ls on directories from an NFS server that haven't changed: * looping ls (strace -v shows getdents returning the same names over again). * duplicate directory entries. * missing directory entries. I've hunted google but can only see problems where NFS servers have returned duplicate cookies. I've packet captured the readdirplus on one of the directories and see no duplicate cookies. The problems remain until the directory is touched, the NFS server is unmounted or some other event happens (the data is flushed from the cache?). I think we then got lucky and got two packet captures from different clients running the same linux kernel. On these clients, the ls output was ok - no loops, no duplicates, no missing entries. Both captures showed two readdirplus requests returning the same entries in the same order but the amount of data in the responses was different. One capture showed the server returned 1724 bytes, 10 entries, last cookie of 12, followed by the next readdirplus returning a length of 948 bytes, 5 entries, a first cookie value of 13. In the other capture, the responses returned 2204 bytes, 13 entries, a last cookie of 17 and 468 bytes, 2 entries, a first cookie of 19. In the past we've found that ls has returned duplicate entries on this directory (but didn't have a capture at the time) and those duplicate entries are the ones that are returned as the last 3 entries in the first response of the second capture and the first 3 entries in the second response of the first capture. So what I think has happened in this particular case, is that at some point in the past, the directory was read OK with packets similar to the first capture. Next, the client decided to get rid of the first page of cached readdir responses from memory for some reason (running low on memory?) but kept the second page. Subsequently, the readdir cache needs repopulating so the client sends a readdirplus specifying cookie of 0 and this time it gets a response which is similar to the first packet of the second capture and thus we now have in cache duplicate names and cookie values. So is this possible? Is there some easy way to provoke it? Does this mean the client's readdir cache is broken? Please cc me on any response. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nfs client readdir caching issue? 2008-07-02 11:03 nfs client readdir caching issue? Andy Chittenden @ 2008-07-02 17:37 ` Trond Myklebust 2008-07-03 8:47 ` Andy Chittenden 0 siblings, 1 reply; 4+ messages in thread From: Trond Myklebust @ 2008-07-02 17:37 UTC (permalink / raw) To: Andy Chittenden; +Cc: linux-kernel On Wed, 2008-07-02 at 12:03 +0100, Andy Chittenden wrote: > Very rarely, we're seeing various problems on a linux kernel client > (seen on various versions) with ls on directories from an NFS server > that haven't changed: > > * looping ls (strace -v shows getdents returning the same names over > again). > * duplicate directory entries. > * missing directory entries. > > I've hunted google but can only see problems where NFS servers have > returned duplicate cookies. I've packet captured the readdirplus on one > of the directories and see no duplicate cookies. The problems remain > until the directory is touched, the NFS server is unmounted or some > other event happens (the data is flushed from the cache?). > > I think we then got lucky and got two packet captures from different > clients running the same linux kernel. On these clients, the ls output > was ok - no loops, no duplicates, no missing entries. Both captures > showed two readdirplus requests returning the same entries in the same > order but the amount of data in the responses was different. One capture > showed the server returned 1724 bytes, 10 entries, last cookie of 12, > followed by the next readdirplus returning a length of 948 bytes, 5 > entries, a first cookie value of 13. In the other capture, the responses > returned 2204 bytes, 13 entries, a last cookie of 17 and 468 bytes, 2 > entries, a first cookie of 19. > > In the past we've found that ls has returned duplicate entries on this > directory (but didn't have a capture at the time) and those duplicate > entries are the ones that are returned as the last 3 entries in the > first response of the second capture and the first 3 entries in the > second response of the first capture. > > So what I think has happened in this particular case, is that at some > point in the past, the directory was read OK with packets similar to the > first capture. Next, the client decided to get rid of the first page of > cached readdir responses from memory for some reason (running low on > memory?) but kept the second page. Subsequently, the readdir cache needs > repopulating so the client sends a readdirplus specifying cookie of 0 > and this time it gets a response which is similar to the first packet of > the second capture and thus we now have in cache duplicate names and > cookie values. > > So is this possible? Is there some easy way to provoke it? Does this > mean the client's readdir cache is broken? If so, then invalidate_inode_pages2_range() would have to be broken: we always clear the readdir cache immediately after reading in the page with index 0 (i.e. the first readdir page). It shouldn't be possible for another thread to race with that cache invalidation either since the entire readdir() call is protected by the parent directory's inode->i_mutex. Cheers Trond ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: nfs client readdir caching issue? 2008-07-02 17:37 ` Trond Myklebust @ 2008-07-03 8:47 ` Andy Chittenden 2008-07-07 17:20 ` Trond Myklebust 0 siblings, 1 reply; 4+ messages in thread From: Andy Chittenden @ 2008-07-03 8:47 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-kernel > If so, then invalidate_inode_pages2_range() would have to be broken: we > always clear the readdir cache immediately after reading in the page > with index 0 (i.e. the first readdir page). I'm confused by the call to invalidate_inode_pages2_range: if (page->index == 0) invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1); That's passing in a pgoff_t of 4096 as the start page offset from which to invalidate. And an enormous number for the end page to invalidate to. So it looks like the nfs client is trying to invalidate from a *byte* offset of 4096 (but if that's true, the first page could contain less than 4096 bytes depending on the size of the readdir response it received but I'll leave that to one side for the moment). What's confusing me is that when I look at the implementation of invalidate_inode_pages2_range, I see this call to pagevec_lookup: pagevec_lookup(&pvec, mapping, next, min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { looking at the pagevec_lookup comments, it claims the fourth parameter is the number of pages: * @nr_pages: The maximum number of pages So how can (end - next) be a number of pages? (next will be 4096 in the call from the nfs client). IE it looks like invalidate_inode_pages2_range is expecting a page range (as the name suggests). IE I'm wondering whether the call to invalidate_inode_pages2_range should be: if (page->index == 0) invalidate_inode_pages2_range(inode->i_mapping, 1, -1); FWIW We've also seen this on larger directories so I'm wondering what would happen if a readdir part way down the cookie chain returned more data (or less) than it did last time. IE if the above is correct, then replace the two lines with: invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1); IE purge the rest of the pages for the inode. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: nfs client readdir caching issue? 2008-07-03 8:47 ` Andy Chittenden @ 2008-07-07 17:20 ` Trond Myklebust 0 siblings, 0 replies; 4+ messages in thread From: Trond Myklebust @ 2008-07-07 17:20 UTC (permalink / raw) To: Andy Chittenden; +Cc: linux-kernel On Thu, 2008-07-03 at 09:47 +0100, Andy Chittenden wrote: > > If so, then invalidate_inode_pages2_range() would have to be broken: > we > > always clear the readdir cache immediately after reading in the page > > with index 0 (i.e. the first readdir page). > > I'm confused by the call to invalidate_inode_pages2_range: > > if (page->index == 0) > invalidate_inode_pages2_range(inode->i_mapping, > PAGE_CACHE_SIZE, -1); > > That's passing in a pgoff_t of 4096 as the start page offset from which > to invalidate. And an enormous number for the end page to invalidate to. > So it looks like the nfs client is trying to invalidate from a *byte* > offset of 4096 (but if that's true, the first page could contain less > than 4096 bytes depending on the size of the readdir response it > received but I'll leave that to one side for the moment). > > What's confusing me is that when I look at the implementation of > invalidate_inode_pages2_range, I see this call to pagevec_lookup: > > pagevec_lookup(&pvec, mapping, next, > min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) > { > > looking at the pagevec_lookup comments, it claims the fourth parameter > is the number of pages: > > * @nr_pages: The maximum number of pages > > So how can (end - next) be a number of pages? (next will be 4096 in the > call from the nfs client). IE it looks like > invalidate_inode_pages2_range is expecting a page range (as the name > suggests). IE I'm wondering whether the call to > invalidate_inode_pages2_range should be: > > if (page->index == 0) > invalidate_inode_pages2_range(inode->i_mapping, 1, -1); > > FWIW We've also seen this on larger directories so I'm wondering what > would happen if a readdir part way down the cookie chain returned more > data (or less) than it did last time. IE if the above is correct, then > replace the two lines with: > > invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, > -1); > > IE purge the rest of the pages for the inode. Hi Andy, Apologies for the late response. I was travelling on Thursday, and got caught up with the long weekend due to the July 4th holiday in the US. Yes, I agree with your comment above: the invalidate_inode_pages2_range() call is specifying a byte range instead of a page index range, and is therefore wrong. Another thought is that individual pages might perhaps get evicted by VM pressure, in which case we might perhaps want to re-read not only the evicted page, but all subsequent pages too (in case the server returns more/less data per page so that the alignment of the next entry changes). Cheers Trond ------------------------------------------------------------------------- From: Trond Myklebust <Trond.Myklebust@netapp.com> NFS: Fix readdir cache invalidation invalidate_inode_pages2_range() takes page offset arguments, not byte ranges. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 58d43da..982a206 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -204,7 +204,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page *page) * Note: assumes we have exclusive access to this mapping either * through inode->i_mutex or some other mechanism. */ - if (page->index == 0 && invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1) < 0) { + if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) { /* Should never happen */ nfs_zap_mapping(inode, inode->i_mapping); } ^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-07-07 17:20 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-02 11:03 nfs client readdir caching issue? Andy Chittenden 2008-07-02 17:37 ` Trond Myklebust 2008-07-03 8:47 ` Andy Chittenden 2008-07-07 17:20 ` Trond Myklebust
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox