* [PATCH] NFS: Fix a readdirplus bug @ 2010-11-30 17:42 Trond Myklebust 2010-11-30 22:10 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-11-30 17:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-nfs, Trond Myklebust When comparing filehandles in the helper nfs_same_file(), we should not be using 'strncmp()': filehandles are not null terminated strings. Instead, we should just use the existing helper nfs_compare_fh(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 6 +----- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 8ea4a41..f0a384e 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -395,13 +395,9 @@ int xdr_decode(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, struct x static int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry) { - struct nfs_inode *node; if (dentry->d_inode == NULL) goto different; - node = NFS_I(dentry->d_inode); - if (node->fh.size != entry->fh->size) - goto different; - if (strncmp(node->fh.data, entry->fh->data, node->fh.size) != 0) + if (nfs_compare_fh(entry->fh, NFS_FH(dentry->d_inode)) != 0) goto different; return 1; different: -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH] NFS: Fix a readdirplus bug 2010-11-30 17:42 [PATCH] NFS: Fix a readdirplus bug Trond Myklebust @ 2010-11-30 22:10 ` Linus Torvalds 2010-11-30 22:13 ` Trond Myklebust ` (4 more replies) 0 siblings, 5 replies; 70+ messages in thread From: Linus Torvalds @ 2010-11-30 22:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel Mailing List, linux-nfs On Tue, Nov 30, 2010 at 9:42 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > When comparing filehandles in the helper nfs_same_file(), we should not be > using 'strncmp()': filehandles are not null terminated strings. Hmm. Applied, but sadly Nick Bowler <nbowler@elliptictech.com> reports that it doesn't actually fix his problem, so there is still a serious regression there. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] NFS: Fix a readdirplus bug 2010-11-30 22:10 ` Linus Torvalds @ 2010-11-30 22:13 ` Trond Myklebust 2010-12-01 3:47 ` [PATCH 0/3] Fix more NFS readdir regressions Trond Myklebust ` (3 subsequent siblings) 4 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-11-30 22:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-nfs On Tue, 2010-11-30 at 14:10 -0800, Linus Torvalds wrote: > On Tue, Nov 30, 2010 at 9:42 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > When comparing filehandles in the helper nfs_same_file(), we should not be > > using 'strncmp()': filehandles are not null terminated strings. > > Hmm. Applied, but sadly Nick Bowler <nbowler@elliptictech.com> reports > that it doesn't actually fix his problem, so there is still a serious > regression there. > > Linus OK. I'll follow up that problem with him. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH 0/3] Fix more NFS readdir regressions 2010-11-30 22:10 ` Linus Torvalds 2010-11-30 22:13 ` Trond Myklebust @ 2010-12-01 3:47 ` Trond Myklebust 2010-12-01 3:47 ` [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust ` (2 subsequent siblings) 4 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 3:47 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs OK. The first patch in this series fixes the regression reported by Nick Bowler when I apply it to my setup. The 2 remaining patches are needed in order to ensure that the VM doesn't free the readdir page cache page while we're trying to read from it, and to ensure that we don't leak memory... Linus, please don't apply these patches quite yet. I'd like to continue tests for a couple more days before I send you the pull request. Cheers Trond Trond Myklebust (3): NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler NFS: lock the readdir page while it is in use NFS: Fix a memory leak in nfs_readdir fs/nfs/dir.c | 25 ++++++++++++++++++++++--- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 3 files changed, 24 insertions(+), 3 deletions(-) -- 1.7.3.2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler 2010-11-30 22:10 ` Linus Torvalds 2010-11-30 22:13 ` Trond Myklebust 2010-12-01 3:47 ` [PATCH 0/3] Fix more NFS readdir regressions Trond Myklebust @ 2010-12-01 3:47 ` Trond Myklebust 2010-12-01 15:04 ` Nick Bowler 2010-12-01 3:47 ` [PATCH 2/3] NFS: lock the readdir page while it is in use Trond Myklebust 2010-12-01 3:47 ` [PATCH 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 4 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 3:47 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs We need to use the cookie from the previous array entry, not the actual cookie that we are searching for. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index f0a384e..4a78d2b 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -178,6 +178,7 @@ typedef struct { struct page *page; unsigned long page_index; u64 *dir_cookie; + u64 last_cookie; loff_t current_index; decode_dirent_t decode; @@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc) else status = nfs_readdir_search_for_cookie(array, desc); + if (status == -EAGAIN) + desc->last_cookie = array->last_cookie; nfs_readdir_release_array(desc->page); out: return status; @@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, unsigned int array_size = ARRAY_SIZE(pages); entry.prev_cookie = 0; - entry.cookie = *desc->dir_cookie; + entry.cookie = desc->last_cookie; entry.eof = 0; entry.fh = nfs_alloc_fhandle(); entry.fattr = nfs_alloc_fattr(); @@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc) { int res; - if (desc->page_index == 0) + if (desc->page_index == 0) { desc->current_index = 0; + desc->last_cookie = 0; + } while (1) { res = find_cache_page(desc); if (res != -EAGAIN) -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler 2010-12-01 3:47 ` [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust @ 2010-12-01 15:04 ` Nick Bowler 2010-12-01 15:36 ` [PATCH v2 0/3] Fix more NFS readdir regressions Trond Myklebust ` (3 more replies) 0 siblings, 4 replies; 70+ messages in thread From: Nick Bowler @ 2010-12-01 15:04 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-nfs On 2010-11-30 22:47 -0500, Trond Myklebust wrote: > We need to use the cookie from the previous array entry, not the > actual cookie that we are searching for. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > fs/nfs/dir.c | 9 +++++++-- > 1 files changed, 7 insertions(+), 2 deletions(-) Indeed, my test script passes with this patch on top of Linus' master, thanks. I'll wait for the updated series before trying the other ones. Tested-by: Nick Bowler <nbowler@elliptictech.com> -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/) ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v2 0/3] Fix more NFS readdir regressions 2010-12-01 15:04 ` Nick Bowler @ 2010-12-01 15:36 ` Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust ` (2 subsequent siblings) 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 15:36 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs > OK. The first patch in this series fixes the regression reported by > Nick Bowler when I apply it to my setup. > The 2 remaining patches are needed in order to ensure that the > VM doesn't free the readdir page cache page while we're trying to > read from it, and to ensure that we don't leak memory... > Linus, please don't apply these patches quite yet. I'd like to continue > tests for a couple more days before I send you the pull request. v2 fixes the following issues: - Fix up the cookie usage in uncached_readdir() - Changelog fixups for the second patch - .releasepage should use kmap_atomic() so that it can be called from all direct reclaim contexts. - Ensure that .releasepage also clears Pg_uptodate - Set/clear the Pg_private flag to ensure .releasepage gets called when appropriate. - Add a .invalidatepage to ensure truncate_inode_pages() works correctly - Ensure that the anonymous page that is generated by uncached_readdir() doesn't leak memory. Trond Myklebust (3): NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler NFS: lock the readdir page while it is in use NFS: Fix a memory leak in nfs_readdir fs/nfs/dir.c | 48 ++++++++++++++++++++++++++++++++++++++++-------- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 3 files changed, 42 insertions(+), 8 deletions(-) -- 1.7.3.2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v2 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler 2010-12-01 15:04 ` Nick Bowler 2010-12-01 15:36 ` [PATCH v2 0/3] Fix more NFS readdir regressions Trond Myklebust @ 2010-12-01 15:36 ` Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 2/3] NFS: lock the readdir page while it is in use Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 15:36 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs We need to use the cookie from the previous array entry, not the actual cookie that we are searching for (except for the case of uncached_readdir). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index f0a384e..e03537f 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -178,6 +178,7 @@ typedef struct { struct page *page; unsigned long page_index; u64 *dir_cookie; + u64 last_cookie; loff_t current_index; decode_dirent_t decode; @@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc) else status = nfs_readdir_search_for_cookie(array, desc); + if (status == -EAGAIN) + desc->last_cookie = array->last_cookie; nfs_readdir_release_array(desc->page); out: return status; @@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, unsigned int array_size = ARRAY_SIZE(pages); entry.prev_cookie = 0; - entry.cookie = *desc->dir_cookie; + entry.cookie = desc->last_cookie; entry.eof = 0; entry.fh = nfs_alloc_fhandle(); entry.fattr = nfs_alloc_fattr(); @@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc) { int res; - if (desc->page_index == 0) + if (desc->page_index == 0) { desc->current_index = 0; + desc->last_cookie = 0; + } while (1) { res = find_cache_page(desc); if (res != -EAGAIN) @@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent, } desc->page_index = 0; + desc->last_cookie = *desc->dir_cookie; desc->page = page; status = nfs_readdir_xdr_to_array(desc, page, inode); -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH v2 2/3] NFS: lock the readdir page while it is in use 2010-12-01 15:04 ` Nick Bowler 2010-12-01 15:36 ` [PATCH v2 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust @ 2010-12-01 15:36 ` Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 15:36 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs Before we can fix the memory leak due to not freeing up the contents of the nfs_cache_array when clearing the page cache, we need to ensure that shrink_page_list can't just lock the page and call try_to_release_page() while we're accessing the array. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 16 +++++++++++++++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index e03537f..3ec3f1c 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -639,6 +639,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) static void cache_page_release(nfs_readdir_descriptor_t *desc) { + unlock_page(desc->page); page_cache_release(desc->page); desc->page = NULL; } @@ -646,8 +647,20 @@ void cache_page_release(nfs_readdir_descriptor_t *desc) static struct page *get_cache_page(nfs_readdir_descriptor_t *desc) { - return read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping, + struct page *page; + + for (;;) { + page = read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping, desc->page_index, (filler_t *)nfs_readdir_filler, desc); + if (IS_ERR(page)) + break; + lock_page(page); + if (page->mapping != NULL && PageUptodate(page)) + break; + unlock_page(page); + page_cache_release(page); + } + return page; } /* @@ -771,6 +784,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent, desc->page_index = 0; desc->last_cookie = *desc->dir_cookie; desc->page = page; + lock_page(page); status = nfs_readdir_xdr_to_array(desc, page, inode); if (status < 0) -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 15:04 ` Nick Bowler ` (2 preceding siblings ...) 2010-12-01 15:36 ` [PATCH v2 2/3] NFS: lock the readdir page while it is in use Trond Myklebust @ 2010-12-01 15:36 ` Trond Myklebust 2010-12-01 16:17 ` Linus Torvalds 3 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 15:36 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs We need to ensure that the entries in the nfs_cache_array get cleared when the page is removed from the page cache. To do so, we use the releasepage address_space operation (which also requires us to set the Pg_private flag). Change nfs_readdir_clear_array to use kmap_atomic(), so that the function can be safely called from all direct reclaim contexts. Finally, modify the cache_page_release helper to call nfs_readdir_clear_array directly, when dealing with an anonymous page from 'uncached_readdir'. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 22 +++++++++++++++++----- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 3 files changed, 19 insertions(+), 5 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 3ec3f1c..4c6319e 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -58,6 +58,7 @@ static int nfs_rename(struct inode *, struct dentry *, static int nfs_fsync_dir(struct file *, int); static loff_t nfs_llseek_dir(struct file *, loff_t, int); static int nfs_readdir_clear_array(struct page*, gfp_t); +static void nfs_readdir_invalidatepage(struct page*, unsigned long); const struct file_operations nfs_dir_operations = { .llseek = nfs_llseek_dir, @@ -85,6 +86,7 @@ const struct inode_operations nfs_dir_inode_operations = { const struct address_space_operations nfs_dir_addr_space_ops = { .releasepage = nfs_readdir_clear_array, + .invalidatepage = nfs_readdir_invalidatepage, }; #ifdef CONFIG_NFS_V3 @@ -216,15 +218,22 @@ void nfs_readdir_release_array(struct page *page) static int nfs_readdir_clear_array(struct page *page, gfp_t mask) { - struct nfs_cache_array *array = nfs_readdir_get_array(page); + struct nfs_cache_array *array; int i; - if (IS_ERR(array)) - return PTR_ERR(array); + array = kmap_atomic(page, KM_USER0); for (i = 0; i < array->size; i++) kfree(array->array[i].string.name); - nfs_readdir_release_array(page); - return 0; + kunmap_atomic(array, KM_USER0); + ClearPageUptodate(page); + ClearPagePrivate(page); + return 1; +} + +static +void nfs_readdir_invalidatepage(struct page *page, unsigned long offset) +{ + nfs_readdir_clear_array(page, 0); } /* @@ -624,6 +633,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) if (ret < 0) goto error; SetPageUptodate(page); + SetPagePrivate(page); if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) { /* Should never happen */ @@ -639,6 +649,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) static void cache_page_release(nfs_readdir_descriptor_t *desc) { + if (!desc->page->mapping) + nfs_readdir_clear_array(desc->page, GFP_KERNEL); unlock_page(desc->page); page_cache_release(desc->page); desc->page = NULL; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 314f571..0018e07 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; + inode->i_data.a_ops = &nfs_dir_addr_space_ops; if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS)) set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags); /* Deal with crossing mountpoints */ diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index c66fdb7..b5d3ab0 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations; #endif /* CONFIG_NFS_V3 */ extern const struct file_operations nfs_file_operations; extern const struct address_space_operations nfs_file_aops; +extern const struct address_space_operations nfs_dir_addr_space_ops; static inline struct nfs_open_context *nfs_file_open_context(struct file *filp) { -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 15:36 ` [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust @ 2010-12-01 16:17 ` Linus Torvalds 2010-12-01 16:35 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 16:17 UTC (permalink / raw) To: Trond Myklebust Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro [-- Attachment #1: Type: text/plain, Size: 1741 bytes --] On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > We need to ensure that the entries in the nfs_cache_array get cleared > when the page is removed from the page cache. To do so, we use the > releasepage address_space operation (which also requires us to set > the Pg_private flag). So I really think that the whole "releasepage" use in NFS is simply overly complicated and was obviously too subtle. The whole need for odd return values, for the page lock, and for the addition of clearing the up-to-date bit comes from the fact that this wasn't really what releasepage was designed for. 'releasepage' was really designed for the filesystem having its own version of 'try_to_free_buffers()', which is just an optimistic "ok, we may be releasing this page, so try to get rid of any IO structures you have cached". It wasn't really a memory management thing. And the thing is, it looks trivial to do the memory management approach by adding a new callback that gets called after the page is actually removed from the page cache. If we do that, then there are no races with any other users, since we remove things from the page cache atomically wrt page cache lookup. So the need for playing games with page locking and 'uptodate' simply goes away. As does the PG_private thing or the interaction with invalidatepage() etc. So this is a TOTALLY UNTESTED trivial patch that just adds another callback. Does this work? I dunno. But I get the feeling that instead of having NFS work around the odd semantics that don't actually match what NFS wants, introducing a new callback with much simpler semantics would be simpler for everybody, and avoid the need for subtle code. Hmm? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1007 bytes --] include/linux/fs.h | 1 + mm/vmscan.c | 3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index c9e06cc..090f0ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,7 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, gfp_t); + void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); int (*get_xip_mem)(struct address_space *, pgoff_t, int, diff --git a/mm/vmscan.c b/mm/vmscan.c index d31d7ce..1accb01 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) mem_cgroup_uncharge_cache_page(page); } + if (mapping->a_ops->freepage) + mapping->a_ops->freepage(page); + return 1; cannot_free: ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:17 ` Linus Torvalds @ 2010-12-01 16:35 ` Rik van Riel 2010-12-01 16:45 ` Benny Halevy 2010-12-01 16:47 ` Linus Torvalds 2010-12-01 17:58 ` Trond Myklebust ` (2 subsequent siblings) 3 siblings, 2 replies; 70+ messages in thread From: Rik van Riel @ 2010-12-01 16:35 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Christoph Hellwig, Al Viro On 12/01/2010 11:17 AM, Linus Torvalds wrote: > So this is a TOTALLY UNTESTED trivial patch that just adds another > callback. Does this work? I dunno. But I get the feeling that instead > of having NFS work around the odd semantics that don't actually match > what NFS wants, introducing a new callback with much simpler semantics > would be simpler for everybody, and avoid the need for subtle code. Surely somebody can have just looked up the page and gotten a reference count, right before your ->freepage call is invoked? CPU A CPU B look up page grab refcount ->freepage use contents of page Am I overlooking something obvious? -- All rights reversed ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:35 ` Rik van Riel @ 2010-12-01 16:45 ` Benny Halevy 2010-12-01 16:47 ` Linus Torvalds 1 sibling, 0 replies; 70+ messages in thread From: Benny Halevy @ 2010-12-01 16:45 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Trond Myklebust, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Christoph Hellwig, Al Viro On 2010-12-01 18:35, Rik van Riel wrote: > On 12/01/2010 11:17 AM, Linus Torvalds wrote: > >> So this is a TOTALLY UNTESTED trivial patch that just adds another >> callback. Does this work? I dunno. But I get the feeling that instead >> of having NFS work around the odd semantics that don't actually match >> what NFS wants, introducing a new callback with much simpler semantics >> would be simpler for everybody, and avoid the need for subtle code. > > Surely somebody can have just looked up the page and > gotten a reference count, right before your ->freepage > call is invoked? > > CPU A CPU B > > look up page > grab refcount > ->freepage > > use contents of page > > Am I overlooking something obvious? > The page is not cached any more at this point therefore looking it up won't find it. Benny ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:35 ` Rik van Riel 2010-12-01 16:45 ` Benny Halevy @ 2010-12-01 16:47 ` Linus Torvalds 2010-12-01 17:02 ` Rik van Riel 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 16:47 UTC (permalink / raw) To: Rik van Riel Cc: Trond Myklebust, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 8:35 AM, Rik van Riel <riel@redhat.com> wrote: > > Surely somebody can have just looked up the page and > gotten a reference count, right before your ->freepage > call is invoked? No. The removal from the page cache is atomic, even in the presence of the lockless lookup. The page cache lookup does a "get_page_unless_zero()" on the count, so when __remove_mapping() has removed the page using "page_freeze_refs()", it's really gone, and cannot be looked up. And if that is broken, then we have much more serious problems (like aliasing the same page when doing mmap/read etc), so that's more than just an implementation detail, it's a fundamental requirement of the whole page-cache design. And that's the whole point of adding this callback to the __remove_mapping() stage: that's the _only_ point where we really end up knowing that "yes, we really removed that page, and there are no more users". Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:47 ` Linus Torvalds @ 2010-12-01 17:02 ` Rik van Riel 0 siblings, 0 replies; 70+ messages in thread From: Rik van Riel @ 2010-12-01 17:02 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Christoph Hellwig, Al Viro On 12/01/2010 11:47 AM, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 8:35 AM, Rik van Riel<riel@redhat.com> wrote: >> >> Surely somebody can have just looked up the page and >> gotten a reference count, right before your ->freepage >> call is invoked? > > No. > > The removal from the page cache is atomic, even in the presence of the > lockless lookup. > > The page cache lookup does a "get_page_unless_zero()" on the count, so > when __remove_mapping() has removed the page using > "page_freeze_refs()", it's really gone, and cannot be looked up. Doh, you're right. I forgot to look at all the stuff that __remove_mapping does nowadays and remembered some very old code from vmscan.c instead. Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:17 ` Linus Torvalds 2010-12-01 16:35 ` Rik van Riel @ 2010-12-01 17:58 ` Trond Myklebust 2010-12-01 18:29 ` Miklos Szeredi 2010-12-01 18:54 ` Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 17:58 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > We need to ensure that the entries in the nfs_cache_array get cleared > > when the page is removed from the page cache. To do so, we use the > > releasepage address_space operation (which also requires us to set > > the Pg_private flag). > > So I really think that the whole "releasepage" use in NFS is simply > overly complicated and was obviously too subtle. > > The whole need for odd return values, for the page lock, and for the > addition of clearing the up-to-date bit comes from the fact that this > wasn't really what releasepage was designed for. > > 'releasepage' was really designed for the filesystem having its own > version of 'try_to_free_buffers()', which is just an optimistic "ok, > we may be releasing this page, so try to get rid of any IO structures > you have cached". It wasn't really a memory management thing. > > And the thing is, it looks trivial to do the memory management > approach by adding a new callback that gets called after the page is > actually removed from the page cache. If we do that, then there are no > races with any other users, since we remove things from the page cache > atomically wrt page cache lookup. So the need for playing games with > page locking and 'uptodate' simply goes away. As does the PG_private > thing or the interaction with invalidatepage() etc. > > So this is a TOTALLY UNTESTED trivial patch that just adds another > callback. Does this work? I dunno. But I get the feeling that instead > of having NFS work around the odd semantics that don't actually match > what NFS wants, introducing a new callback with much simpler semantics > would be simpler for everybody, and avoid the need for subtle code. > > Hmm? > > Linus This works for me, and I can code up a patch that uses it if you like... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:17 ` Linus Torvalds 2010-12-01 16:35 ` Rik van Riel 2010-12-01 17:58 ` Trond Myklebust @ 2010-12-01 18:29 ` Miklos Szeredi 2010-12-01 18:54 ` Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Miklos Szeredi @ 2010-12-01 18:29 UTC (permalink / raw) To: Linus Torvalds Cc: Trond.Myklebust, nbowler, linux-kernel, linux-nfs, akpm, hughd, riel, hch, viro On Wed, 1 Dec 2010, Linus Torvalds wrote: > --00032557542a0c5e6004965ba615 > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > We need to ensure that the entries in the nfs_cache_array get cleared > > when the page is removed from the page cache. To do so, we use the > > releasepage address_space operation (which also requires us to set > > the Pg_private flag). > > So I really think that the whole "releasepage" use in NFS is simply > overly complicated and was obviously too subtle. > > The whole need for odd return values, for the page lock, and for the > addition of clearing the up-to-date bit comes from the fact that this > wasn't really what releasepage was designed for. > > 'releasepage' was really designed for the filesystem having its own > version of 'try_to_free_buffers()', which is just an optimistic "ok, > we may be releasing this page, so try to get rid of any IO structures > you have cached". It wasn't really a memory management thing. > > And the thing is, it looks trivial to do the memory management > approach by adding a new callback that gets called after the page is > actually removed from the page cache. If we do that, then there are no > races with any other users, since we remove things from the page cache > atomically wrt page cache lookup. So the need for playing games with > page locking and 'uptodate' simply goes away. As does the PG_private > thing or the interaction with invalidatepage() etc. > > So this is a TOTALLY UNTESTED trivial patch that just adds another > callback. Does this work? I dunno. But I get the feeling that instead > of having NFS work around the odd semantics that don't actually match > what NFS wants, introducing a new callback with much simpler semantics > would be simpler for everybody, and avoid the need for subtle code. > > Hmm? > > Linus > > --00032557542a0c5e6004965ba615 > Content-Type: text/x-patch; charset=US-ASCII; name="patch.diff" > Content-Disposition: attachment; filename="patch.diff" > Content-Transfer-Encoding: base64 > X-Attachment-Id: f_gh6f5ghm0 > > IGluY2x1ZGUvbGludXgvZnMuaCB8ICAgIDEgKwogbW0vdm1zY2FuLmMgICAgICAgIHwgICAgMyAr > KysKIDIgZmlsZXMgY2hhbmdlZCwgNCBpbnNlcnRpb25zKCspLCAwIGRlbGV0aW9ucygtKQoKZGlm > ZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvZnMuaCBiL2luY2x1ZGUvbGludXgvZnMuaAppbmRleCBj > OWUwNmNjLi4wOTBmMGVhIDEwMDY0NAotLS0gYS9pbmNsdWRlL2xpbnV4L2ZzLmgKKysrIGIvaW5j > bHVkZS9saW51eC9mcy5oCkBAIC02MDIsNiArNjAyLDcgQEAgc3RydWN0IGFkZHJlc3Nfc3BhY2Vf > b3BlcmF0aW9ucyB7CiAJc2VjdG9yX3QgKCpibWFwKShzdHJ1Y3QgYWRkcmVzc19zcGFjZSAqLCBz > ZWN0b3JfdCk7CiAJdm9pZCAoKmludmFsaWRhdGVwYWdlKSAoc3RydWN0IHBhZ2UgKiwgdW5zaWdu > ZWQgbG9uZyk7CiAJaW50ICgqcmVsZWFzZXBhZ2UpIChzdHJ1Y3QgcGFnZSAqLCBnZnBfdCk7CisJ > dm9pZCAoKmZyZWVwYWdlKShzdHJ1Y3QgcGFnZSAqKTsKIAlzc2l6ZV90ICgqZGlyZWN0X0lPKShp > bnQsIHN0cnVjdCBraW9jYiAqLCBjb25zdCBzdHJ1Y3QgaW92ZWMgKmlvdiwKIAkJCWxvZmZfdCBv > ZmZzZXQsIHVuc2lnbmVkIGxvbmcgbnJfc2Vncyk7CiAJaW50ICgqZ2V0X3hpcF9tZW0pKHN0cnVj > dCBhZGRyZXNzX3NwYWNlICosIHBnb2ZmX3QsIGludCwKZGlmZiAtLWdpdCBhL21tL3Ztc2Nhbi5j > IGIvbW0vdm1zY2FuLmMKaW5kZXggZDMxZDdjZS4uMWFjY2IwMSAxMDA2NDQKLS0tIGEvbW0vdm1z > Y2FuLmMKKysrIGIvbW0vdm1zY2FuLmMKQEAgLTQ5OSw2ICs0OTksOSBAQCBzdGF0aWMgaW50IF9f > cmVtb3ZlX21hcHBpbmcoc3RydWN0IGFkZHJlc3Nfc3BhY2UgKm1hcHBpbmcsIHN0cnVjdCBwYWdl > ICpwYWdlKQogCQltZW1fY2dyb3VwX3VuY2hhcmdlX2NhY2hlX3BhZ2UocGFnZSk7CiAJfQogCisJ > aWYgKG1hcHBpbmctPmFfb3BzLT5mcmVlcGFnZSkKKwkJbWFwcGluZy0+YV9vcHMtPmZyZWVwYWdl > KHBhZ2UpOworCiAJcmV0dXJuIDE7CiAKIGNhbm5vdF9mcmVlOgo= > --00032557542a0c5e6004965ba615-- This violates the "Really. Don't send patches as attachments." lkml-faq set forth by yourself ;) Am I the only one who still uses a mail reader that doesn't decode mime by default? Maybe it's time to move on... Thanks, Miklos ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 16:17 ` Linus Torvalds ` (2 preceding siblings ...) 2010-12-01 18:29 ` Miklos Szeredi @ 2010-12-01 18:54 ` Trond Myklebust 2010-12-01 19:23 ` Hugh Dickins 2010-12-01 19:47 ` Linus Torvalds 3 siblings, 2 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 18:54 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > We need to ensure that the entries in the nfs_cache_array get cleared > > when the page is removed from the page cache. To do so, we use the > > releasepage address_space operation (which also requires us to set > > the Pg_private flag). > > So I really think that the whole "releasepage" use in NFS is simply > overly complicated and was obviously too subtle. > > The whole need for odd return values, for the page lock, and for the > addition of clearing the up-to-date bit comes from the fact that this > wasn't really what releasepage was designed for. > > 'releasepage' was really designed for the filesystem having its own > version of 'try_to_free_buffers()', which is just an optimistic "ok, > we may be releasing this page, so try to get rid of any IO structures > you have cached". It wasn't really a memory management thing. > > And the thing is, it looks trivial to do the memory management > approach by adding a new callback that gets called after the page is > actually removed from the page cache. If we do that, then there are no > races with any other users, since we remove things from the page cache > atomically wrt page cache lookup. So the need for playing games with > page locking and 'uptodate' simply goes away. As does the PG_private > thing or the interaction with invalidatepage() etc. > > So this is a TOTALLY UNTESTED trivial patch that just adds another > callback. Does this work? I dunno. But I get the feeling that instead > of having NFS work around the odd semantics that don't actually match > what NFS wants, introducing a new callback with much simpler semantics > would be simpler for everybody, and avoid the need for subtle code. > > Hmm? > > Linus > include/linux/fs.h | 1 + > mm/vmscan.c | 3 +++ > 2 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index c9e06cc..090f0ea 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > void (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, gfp_t); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec > *iov, > loff_t offset, unsigned long nr_segs); > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d31d7ce..1accb01 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space > *mapping, struct page *page) > mem_cgroup_uncharge_cache_page(page); > } > > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); Hmm... Looking again at the problem, it appears that the same callback needs to be added to truncate_complete_page() and invalidate_complete_page2(). Otherwise we end up in a situation where the page can sometimes be removed from the page cache without calling freepage(). > + > return 1; > > cannot_free: Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 18:54 ` Trond Myklebust @ 2010-12-01 19:23 ` Hugh Dickins 2010-12-01 19:52 ` Linus Torvalds 2010-12-01 20:05 ` Trond Myklebust 2010-12-01 19:47 ` Linus Torvalds 1 sibling, 2 replies; 70+ messages in thread From: Hugh Dickins @ 2010-12-01 19:23 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Wed, 1 Dec 2010, Trond Myklebust wrote: > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > > On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust > > <Trond.Myklebust@netapp.com> wrote: > > > > > > We need to ensure that the entries in the nfs_cache_array get cleared > > > when the page is removed from the page cache. To do so, we use the > > > releasepage address_space operation (which also requires us to set > > > the Pg_private flag). > > > > So I really think that the whole "releasepage" use in NFS is simply > > overly complicated and was obviously too subtle. > > > > The whole need for odd return values, for the page lock, and for the > > addition of clearing the up-to-date bit comes from the fact that this > > wasn't really what releasepage was designed for. > > > > 'releasepage' was really designed for the filesystem having its own > > version of 'try_to_free_buffers()', which is just an optimistic "ok, > > we may be releasing this page, so try to get rid of any IO structures > > you have cached". It wasn't really a memory management thing. > > > > And the thing is, it looks trivial to do the memory management > > approach by adding a new callback that gets called after the page is > > actually removed from the page cache. If we do that, then there are no > > races with any other users, since we remove things from the page cache > > atomically wrt page cache lookup. So the need for playing games with > > page locking and 'uptodate' simply goes away. As does the PG_private > > thing or the interaction with invalidatepage() etc. > > > > So this is a TOTALLY UNTESTED trivial patch that just adds another > > callback. Does this work? I dunno. But I get the feeling that instead > > of having NFS work around the odd semantics that don't actually match > > what NFS wants, introducing a new callback with much simpler semantics > > would be simpler for everybody, and avoid the need for subtle code. > > > > Hmm? > > > > Linus > > > > include/linux/fs.h | 1 + > > mm/vmscan.c | 3 +++ > > 2 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index c9e06cc..090f0ea 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -602,6 +602,7 @@ struct address_space_operations { > > sector_t (*bmap)(struct address_space *, sector_t); > > void (*invalidatepage) (struct page *, unsigned long); > > int (*releasepage) (struct page *, gfp_t); > > + void (*freepage)(struct page *); > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec > > *iov, > > loff_t offset, unsigned long nr_segs); > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index d31d7ce..1accb01 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space > > *mapping, struct page *page) > > mem_cgroup_uncharge_cache_page(page); > > } > > > > + if (mapping->a_ops->freepage) > > + mapping->a_ops->freepage(page); > > Hmm... Looking again at the problem, it appears that the same callback > needs to be added to truncate_complete_page() and > invalidate_complete_page2(). Otherwise we end up in a situation where > the page can sometimes be removed from the page cache without calling > freepage(). > > > + > > return 1; > > > > cannot_free: Yes, I was wondering quite how we would define this ->freepage thing, if it gets called from one place that removes from page cache and not from others. Another minor problem with it: it would probably need to take the struct address_space *mapping as arg as well as struct page *page: because by this time page->mapping has been reset to NULL. But I'm not at all keen on adding a calllback in this very special frozen-to-0-references place: please let's not do it without an okay from Nick Piggin (now Cc'ed). I agree completely with what Linus said originally about how the page cannot be freed while there's a reference to it, and it should be possible to work this without your additional page locks. Your ->releasepage should be able to judge whether the page is likely (not certain) to be freed - page_count 3? 1 for the page cache, 1 for the page_private reference, 1 for vmscan's reference, I think. Then it can mark !PageUptodate and proceed with freeing the stuff you had allocated, undo page_has_private and its reference, and return 1 (or return 0 if it decides to hold on to the page). If something races in and grabs another reference to prevent removal from page cache and freeing, then won't read_cache_page(), seeing !Uptodate, do the right thing and set up the required info again? Or perhaps I haven't looked far enough, and you do have races which actually need your page locks, I can see they make it easier to think about. But I'd prefer us not to throw in another callback if it's well workable with the ->releasepage we already have. (If it helps, perhaps we could adjust shrink_page_list() to allow for the page being removed from page cache inside try_to_release_page() - but I don't think that should be necessary.) Hugh ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 19:23 ` Hugh Dickins @ 2010-12-01 19:52 ` Linus Torvalds 2010-12-01 20:05 ` Trond Myklebust 1 sibling, 0 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 19:52 UTC (permalink / raw) To: Hugh Dickins Cc: Trond Myklebust, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Wed, Dec 1, 2010 at 11:23 AM, Hugh Dickins <hughd@google.com> wrote: > > But I'd prefer us not to throw in another callback if it's well > workable with the ->releasepage we already have. The thing is, NFS already really uses releasepage for its "real" designed usage, in the form of the fscache case (which the directory inodes don't use). I really hate mixing things up. The NFS directory case really hasn't got anything to do with releasepage, and taking the page lock on the read side is just really sad. As is marking the page not up-to-date, just so that it will get filled again. In fact, I don't even know if it's kosher by VFS standards to clear the up-to-date bit in the first place after it has already gotten set. It absolutely isn't for anything that can be mmap'ed. Obviously, mmap doesn't work on the directory cache anyway, but the point is that the work-arounds for the semantic gap are really quite ugly. Certainly much uglier than just adding some much simpler semantics to the "now I'm releasing the page" case in the VM. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 19:23 ` Hugh Dickins 2010-12-01 19:52 ` Linus Torvalds @ 2010-12-01 20:05 ` Trond Myklebust 2010-12-01 20:39 ` Andrew Morton 1 sibling, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 20:05 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote: > On Wed, 1 Dec 2010, Trond Myklebust wrote: > > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > > > include/linux/fs.h | 1 + > > > mm/vmscan.c | 3 +++ > > > 2 files changed, 4 insertions(+), 0 deletions(-) > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index c9e06cc..090f0ea 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > sector_t (*bmap)(struct address_space *, sector_t); > > > void (*invalidatepage) (struct page *, unsigned long); > > > int (*releasepage) (struct page *, gfp_t); > > > + void (*freepage)(struct page *); > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec > > > *iov, > > > loff_t offset, unsigned long nr_segs); > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index d31d7ce..1accb01 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space > > > *mapping, struct page *page) > > > mem_cgroup_uncharge_cache_page(page); > > > } > > > > > > + if (mapping->a_ops->freepage) > > > + mapping->a_ops->freepage(page); > > > > Hmm... Looking again at the problem, it appears that the same callback > > needs to be added to truncate_complete_page() and > > invalidate_complete_page2(). Otherwise we end up in a situation where > > the page can sometimes be removed from the page cache without calling > > freepage(). > > > > > + > > > return 1; > > > > > > cannot_free: > > Yes, I was wondering quite how we would define this ->freepage thing, > if it gets called from one place that removes from page cache and not > from others. > > Another minor problem with it: it would probably need to take the > struct address_space *mapping as arg as well as struct page *page: > because by this time page->mapping has been reset to NULL. > > But I'm not at all keen on adding a calllback in this very special > frozen-to-0-references place: please let's not do it without an okay > from Nick Piggin (now Cc'ed). > > I agree completely with what Linus said originally about how the > page cannot be freed while there's a reference to it, and it should > be possible to work this without your additional page locks. > > Your ->releasepage should be able to judge whether the page is likely > (not certain) to be freed - page_count 3? 1 for the page cache, 1 for > the page_private reference, 1 for vmscan's reference, I think. Then > it can mark !PageUptodate and proceed with freeing the stuff you had > allocated, undo page_has_private and its reference, and return 1 (or > return 0 if it decides to hold on to the page). That is very brittle. I'd prefer not to have to scan linux-mm every week in order to find out if someone changed the page_count. However, while reading Documentation/filesystems/vfs.txt (in order to add documentation for freepage) I was surprised to read that the ->releasepage() is itself supposed to be allowed to actually remove the page from the address space if it so desires. Looking at the actual code in shrink_page_list() and friends I can't see how that can possibly fail to break things, but if it were true, then that might enable us to call remove_mapping() in order to safely free the page before it gets cleared. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:05 ` Trond Myklebust @ 2010-12-01 20:39 ` Andrew Morton 2010-12-01 21:29 ` Neil Brown 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 20:39 UTC (permalink / raw) To: Trond Myklebust Cc: Hugh Dickins, Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin, Neil Brown On Wed, 01 Dec 2010 15:05:38 -0500 Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote: > > On Wed, 1 Dec 2010, Trond Myklebust wrote: > > > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > > > > include/linux/fs.h | 1 + > > > > mm/vmscan.c | 3 +++ > > > > 2 files changed, 4 insertions(+), 0 deletions(-) > > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > index c9e06cc..090f0ea 100644 > > > > --- a/include/linux/fs.h > > > > +++ b/include/linux/fs.h > > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > > sector_t (*bmap)(struct address_space *, sector_t); > > > > void (*invalidatepage) (struct page *, unsigned long); > > > > int (*releasepage) (struct page *, gfp_t); > > > > + void (*freepage)(struct page *); > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec > > > > *iov, > > > > loff_t offset, unsigned long nr_segs); > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index d31d7ce..1accb01 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space > > > > *mapping, struct page *page) > > > > mem_cgroup_uncharge_cache_page(page); > > > > } > > > > > > > > + if (mapping->a_ops->freepage) > > > > + mapping->a_ops->freepage(page); > > > > > > Hmm... Looking again at the problem, it appears that the same callback > > > needs to be added to truncate_complete_page() and > > > invalidate_complete_page2(). Otherwise we end up in a situation where > > > the page can sometimes be removed from the page cache without calling > > > freepage(). > > > > > > > + > > > > return 1; > > > > > > > > cannot_free: > > > > Yes, I was wondering quite how we would define this ->freepage thing, > > if it gets called from one place that removes from page cache and not > > from others. > > > > Another minor problem with it: it would probably need to take the > > struct address_space *mapping as arg as well as struct page *page: > > because by this time page->mapping has been reset to NULL. > > > > But I'm not at all keen on adding a calllback in this very special > > frozen-to-0-references place: please let's not do it without an okay > > from Nick Piggin (now Cc'ed). > > > > I agree completely with what Linus said originally about how the > > page cannot be freed while there's a reference to it, and it should > > be possible to work this without your additional page locks. > > > > Your ->releasepage should be able to judge whether the page is likely > > (not certain) to be freed - page_count 3? 1 for the page cache, 1 for > > the page_private reference, 1 for vmscan's reference, I think. Then > > it can mark !PageUptodate and proceed with freeing the stuff you had > > allocated, undo page_has_private and its reference, and return 1 (or > > return 0 if it decides to hold on to the page). > > That is very brittle. I'd prefer not to have to scan linux-mm every week > in order to find out if someone changed the page_count. > > However, while reading Documentation/filesystems/vfs.txt (in order to > add documentation for freepage) I was surprised to read that the > ->releasepage() is itself supposed to be allowed to actually remove the > page from the address space if it so desires. That doesn't sound right. It came from Neil in 2006. Neil, what were you thinking there? Did you find such a ->releasepage()? > Looking at the actual code in shrink_page_list() and friends I can't see > how that can possibly fail to break things, but if it were true, then > that might enable us to call remove_mapping() in order to safely free > the page before it gets cleared. > ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:39 ` Andrew Morton @ 2010-12-01 21:29 ` Neil Brown 2010-12-01 22:43 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-12-01 21:29 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Wed, 1 Dec 2010 12:39:29 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 01 Dec 2010 15:05:38 -0500 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote: > > > On Wed, 1 Dec 2010, Trond Myklebust wrote: > > > > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote: > > > > > include/linux/fs.h | 1 + > > > > > mm/vmscan.c | 3 +++ > > > > > 2 files changed, 4 insertions(+), 0 deletions(-) > > > > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > index c9e06cc..090f0ea 100644 > > > > > --- a/include/linux/fs.h > > > > > +++ b/include/linux/fs.h > > > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > > > sector_t (*bmap)(struct address_space *, sector_t); > > > > > void (*invalidatepage) (struct page *, unsigned long); > > > > > int (*releasepage) (struct page *, gfp_t); > > > > > + void (*freepage)(struct page *); > > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec > > > > > *iov, > > > > > loff_t offset, unsigned long nr_segs); > > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index d31d7ce..1accb01 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space > > > > > *mapping, struct page *page) > > > > > mem_cgroup_uncharge_cache_page(page); > > > > > } > > > > > > > > > > + if (mapping->a_ops->freepage) > > > > > + mapping->a_ops->freepage(page); > > > > > > > > Hmm... Looking again at the problem, it appears that the same callback > > > > needs to be added to truncate_complete_page() and > > > > invalidate_complete_page2(). Otherwise we end up in a situation where > > > > the page can sometimes be removed from the page cache without calling > > > > freepage(). > > > > > > > > > + > > > > > return 1; > > > > > > > > > > cannot_free: > > > > > > Yes, I was wondering quite how we would define this ->freepage thing, > > > if it gets called from one place that removes from page cache and not > > > from others. > > > > > > Another minor problem with it: it would probably need to take the > > > struct address_space *mapping as arg as well as struct page *page: > > > because by this time page->mapping has been reset to NULL. > > > > > > But I'm not at all keen on adding a calllback in this very special > > > frozen-to-0-references place: please let's not do it without an okay > > > from Nick Piggin (now Cc'ed). > > > > > > I agree completely with what Linus said originally about how the > > > page cannot be freed while there's a reference to it, and it should > > > be possible to work this without your additional page locks. > > > > > > Your ->releasepage should be able to judge whether the page is likely > > > (not certain) to be freed - page_count 3? 1 for the page cache, 1 for > > > the page_private reference, 1 for vmscan's reference, I think. Then > > > it can mark !PageUptodate and proceed with freeing the stuff you had > > > allocated, undo page_has_private and its reference, and return 1 (or > > > return 0 if it decides to hold on to the page). > > > > That is very brittle. I'd prefer not to have to scan linux-mm every week > > in order to find out if someone changed the page_count. > > > > However, while reading Documentation/filesystems/vfs.txt (in order to > > add documentation for freepage) I was surprised to read that the > > ->releasepage() is itself supposed to be allowed to actually remove the > > page from the address space if it so desires. > > That doesn't sound right. It came from Neil in 2006. > > Neil, what were you thinking there? Did you find such a ->releasepage()? Nope, no idea, sorry. No releasepage functions do anything like that, and no call sites suggest it could be a possibility. Quite the reverse - they are likely to remove the page from the mapping without checking that it is still in the mapping. So that sentence should be deleted. (Getting good review for doco updates is sooo hard :-) NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 21:29 ` Neil Brown @ 2010-12-01 22:43 ` Andrew Morton 2010-12-01 23:01 ` Neil Brown 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 22:43 UTC (permalink / raw) To: Neil Brown Cc: Trond Myklebust, Hugh Dickins, Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Thu, 2 Dec 2010 08:29:13 +1100 Neil Brown <neilb@suse.de> wrote: > > > > > > However, while reading Documentation/filesystems/vfs.txt (in order to > > > add documentation for freepage) I was surprised to read that the > > > ->releasepage() is itself supposed to be allowed to actually remove the > > > page from the address space if it so desires. > > > > That doesn't sound right. It came from Neil in 2006. > > > > Neil, what were you thinking there? Did you find such a ->releasepage()? > > Nope, no idea, sorry. > > No releasepage functions do anything like that, and no call sites suggest it > could be a possibility. Quite the reverse - they are likely to remove the > page from the mapping without checking that it is still in the mapping. > > So that sentence should be deleted. This? From: Andrew Morton <akpm@linux-foundation.org> ->releasepage() does not remove the page from the mapping. Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- Documentation/filesystems/vfs.txt | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff -puN Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description Documentation/filesystems/vfs.txt --- a/Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description +++ a/Documentation/filesystems/vfs.txt @@ -660,11 +660,10 @@ struct address_space_operations { releasepage: releasepage is called on PagePrivate pages to indicate that the page should be freed if possible. ->releasepage should remove any private data from the page and clear the - PagePrivate flag. It may also remove the page from the - address_space. If this fails for some reason, it may indicate - failure with a 0 return value. - This is used in two distinct though related cases. The first - is when the VM finds a clean page with no active users and + PagePrivate flag. If releasepage() fails for some reason, it must + indicate failure with a 0 return value. + releasepage() is used in two distinct though related cases. The + first is when the VM finds a clean page with no active users and wants to make it a free page. If ->releasepage succeeds, the page will be removed from the address_space and become free. _ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:43 ` Andrew Morton @ 2010-12-01 23:01 ` Neil Brown 0 siblings, 0 replies; 70+ messages in thread From: Neil Brown @ 2010-12-01 23:01 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro, Nick Piggin On Wed, 1 Dec 2010 14:43:14 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 2 Dec 2010 08:29:13 +1100 > Neil Brown <neilb@suse.de> wrote: > > > > > > > > > However, while reading Documentation/filesystems/vfs.txt (in order to > > > > add documentation for freepage) I was surprised to read that the > > > > ->releasepage() is itself supposed to be allowed to actually remove the > > > > page from the address space if it so desires. > > > > > > That doesn't sound right. It came from Neil in 2006. > > > > > > Neil, what were you thinking there? Did you find such a ->releasepage()? > > > > Nope, no idea, sorry. > > > > No releasepage functions do anything like that, and no call sites suggest it > > could be a possibility. Quite the reverse - they are likely to remove the > > page from the mapping without checking that it is still in the mapping. > > > > So that sentence should be deleted. > > This? Perfect, thanks. NeilBrown > > From: Andrew Morton <akpm@linux-foundation.org> > > ->releasepage() does not remove the page from the mapping. > > Cc: Neil Brown <neilb@suse.de> > Cc: Trond Myklebust <Trond.Myklebust@netapp.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > Documentation/filesystems/vfs.txt | 9 ++++----- > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff -puN Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description Documentation/filesystems/vfs.txt > --- a/Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description > +++ a/Documentation/filesystems/vfs.txt > @@ -660,11 +660,10 @@ struct address_space_operations { > releasepage: releasepage is called on PagePrivate pages to indicate > that the page should be freed if possible. ->releasepage > should remove any private data from the page and clear the > - PagePrivate flag. It may also remove the page from the > - address_space. If this fails for some reason, it may indicate > - failure with a 0 return value. > - This is used in two distinct though related cases. The first > - is when the VM finds a clean page with no active users and > + PagePrivate flag. If releasepage() fails for some reason, it must > + indicate failure with a 0 return value. > + releasepage() is used in two distinct though related cases. The > + first is when the VM finds a clean page with no active users and > wants to make it a free page. If ->releasepage succeeds, the > page will be removed from the address_space and become free. > > _ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 18:54 ` Trond Myklebust 2010-12-01 19:23 ` Hugh Dickins @ 2010-12-01 19:47 ` Linus Torvalds 2010-12-01 20:10 ` Trond Myklebust 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 19:47 UTC (permalink / raw) To: Trond Myklebust Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Hmm... Looking again at the problem, it appears that the same callback > needs to be added to truncate_complete_page() and > invalidate_complete_page2(). Otherwise we end up in a situation where > the page can sometimes be removed from the page cache without calling > freepage(). Yes, I think any caller of __remove_from_page_cache() should do it once it has dropped all locks. And to be consistent with that rule, even in the __remove_mapping() function I suspect the code to call ->freepage() might as well be done only in the __remove_from_page_cache() case (ie not in the PageSwapCache() case). Then, add the case to the end of "remove_page_cache()" itself, and now it's really easy to just grep for __remove_from_page_cache() and make sure they all do it. That sounds sane, no? Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 19:47 ` Linus Torvalds @ 2010-12-01 20:10 ` Trond Myklebust 2010-12-01 20:18 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 20:10 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 11:47 -0800, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > Hmm... Looking again at the problem, it appears that the same callback > > needs to be added to truncate_complete_page() and > > invalidate_complete_page2(). Otherwise we end up in a situation where > > the page can sometimes be removed from the page cache without calling > > freepage(). > > Yes, I think any caller of __remove_from_page_cache() should do it > once it has dropped all locks. > > And to be consistent with that rule, even in the __remove_mapping() > function I suspect the code to call ->freepage() might as well be done > only in the __remove_from_page_cache() case (ie not in the > PageSwapCache() case). > > Then, add the case to the end of "remove_page_cache()" itself, and now > it's really easy to just grep for __remove_from_page_cache() and make > sure they all do it. > > That sounds sane, no? > > Linus Something like the following then? ----------------------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:10 ` Trond Myklebust @ 2010-12-01 20:18 ` Linus Torvalds 2010-12-01 20:31 ` Hugh Dickins 2010-12-01 20:33 ` Andrew Morton 2 siblings, 0 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 20:18 UTC (permalink / raw) To: Trond Myklebust Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 12:10 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Something like the following then? > diff --git a/mm/truncate.c b/mm/truncate.c > index ba887bf..76ab2a8 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > clear_page_mlock(page); > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 0; > } I'd do this in "remove_from_page_cache()" instead of in "truncate_complete_page()". Otherwise it will miss any other callers that use "remove_from_page_cache()" (not that there are many, and I doubt NFS cares about the ones that do exists, but..) Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:10 ` Trond Myklebust 2010-12-01 20:18 ` Linus Torvalds @ 2010-12-01 20:31 ` Hugh Dickins 2010-12-01 20:33 ` Andrew Morton 2 siblings, 0 replies; 70+ messages in thread From: Hugh Dickins @ 2010-12-01 20:31 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Rik van Riel, Christoph Hellwig, Al Viro Adding Nick Piggin back to the Cc. I do think it needs to be freepage(struct address_space *, struct page *), even if you happen not to need to know the mapping in the NFS case. On Wed, 1 Dec 2010, Trond Myklebust wrote: > On Wed, 2010-12-01 at 11:47 -0800, Linus Torvalds wrote: > > On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust > > <Trond.Myklebust@netapp.com> wrote: > > > > > > Hmm... Looking again at the problem, it appears that the same callback > > > needs to be added to truncate_complete_page() and > > > invalidate_complete_page2(). Otherwise we end up in a situation where > > > the page can sometimes be removed from the page cache without calling > > > freepage(). > > > > Yes, I think any caller of __remove_from_page_cache() should do it > > once it has dropped all locks. > > > > And to be consistent with that rule, even in the __remove_mapping() > > function I suspect the code to call ->freepage() might as well be done > > only in the __remove_from_page_cache() case (ie not in the > > PageSwapCache() case). > > > > Then, add the case to the end of "remove_page_cache()" itself, and now > > it's really easy to just grep for __remove_from_page_cache() and make > > sure they all do it. > > > > That sounds sane, no? > > > > Linus > > Something like the following then? > > ----------------------------------------------------------------------------------------- > From 3a46d5eab1ac6efe9dfaf873e23de7589e0acccc Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed, 1 Dec 2010 13:35:19 -0500 > Subject: [PATCH] Call the filesystem back whenever a page is removed from the page cache > > NFS needs to be able to release objects that are stored in the page > cache once the page itself is no longer visible from the page cache. > > This patch adds a callback to the address space operations that allows > filesystems to perform page cleanups once the page has been removed > from the page cache. > > Original patch by: Linus Torvalds <torvalds@linux-foundation.org> > [trondmy: cover the cases of invalidate_inode_pages2() and > truncate_inode_pages()] > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > Documentation/filesystems/Locking | 5 +++++ > Documentation/filesystems/vfs.txt | 5 +++++ > include/linux/fs.h | 1 + > mm/truncate.c | 8 ++++++++ > mm/vmscan.c | 3 +++ > 5 files changed, 22 insertions(+), 0 deletions(-) > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index a91f308..06d6b71 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -173,6 +173,7 @@ prototypes: > sector_t (*bmap)(struct address_space *, sector_t); > int (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, int); > + void (*freepage)(struct page *); > int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > int (*launder_page) (struct page *); > @@ -193,6 +194,7 @@ perform_write: no n/a yes > bmap: no > invalidatepage: no yes > releasepage: no yes > +freepage: no yes > direct_IO: no > launder_page: no yes > > @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to > indicate that the buffers are (or may be) freeable. If ->releasepage is zero, > the kernel assumes that the fs has no private interest in the buffers. > > + ->freepage() is called when the kernel is done dropping the page > +from the page cache. > + > ->launder_page() may be called prior to releasing a page if > it is still found to be dirty. It returns zero if the page was successfully > cleaned, or an error value if not. Note that in order to prevent the page > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt > index ed7e5ef..76de6fd 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -534,6 +534,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > int (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, int); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > struct page* (*get_xip_page)(struct address_space *, sector_t, > @@ -679,6 +680,10 @@ struct address_space_operations { > need to ensure this. Possibly it can clear the PageUptodate > bit if it cannot free private data yet. > > + freepage: freepage is called once the page is no longer visible in > + the page cache in order to allow the cleanup of any private > + data. > + > direct_IO: called by the generic read/write routines to perform > direct_IO - that is IO requests which bypass the page cache > and transfer data directly between the storage and the > diff --git a/include/linux/fs.h b/include/linux/fs.h > index c9e06cc..090f0ea 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > void (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, gfp_t); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > diff --git a/mm/truncate.c b/mm/truncate.c > index ba887bf..76ab2a8 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > clear_page_mlock(page); > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 0; > } > @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 1; > failed: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d31d7ce..c6fc55d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -497,6 +497,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > } > > return 1; > -- > 1.7.3.2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:10 ` Trond Myklebust 2010-12-01 20:18 ` Linus Torvalds 2010-12-01 20:31 ` Hugh Dickins @ 2010-12-01 20:33 ` Andrew Morton 2010-12-01 21:02 ` Hugh Dickins 2 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 20:33 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Hugh Dickins, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 01 Dec 2010 15:10:50 -0500 Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > void (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, gfp_t); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > int (*get_xip_mem)(struct address_space *, pgoff_t, int, It would be good to think about and then clearly spell out exactly what state the page is in here. It is locked, and I assume clean and not under writeback. What about its refcount, freezedness status and eligibility for lookups? And as Hugh pointed out, some callees might needs the address_space* although we can perhaps defer that until such a callee turns up. If/when that happens we might have a problem though: if this locked page is no longer attached to the address_space then what now pins the address_space, protecting it from inode reclaim? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 20:33 ` Andrew Morton @ 2010-12-01 21:02 ` Hugh Dickins 2010-12-01 21:15 ` Hugh Dickins 0 siblings, 1 reply; 70+ messages in thread From: Hugh Dickins @ 2010-12-01 21:02 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010, Andrew Morton wrote: > On Wed, 01 Dec 2010 15:10:50 -0500 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -602,6 +602,7 @@ struct address_space_operations { > > sector_t (*bmap)(struct address_space *, sector_t); > > void (*invalidatepage) (struct page *, unsigned long); > > int (*releasepage) (struct page *, gfp_t); > > + void (*freepage)(struct page *); > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > > loff_t offset, unsigned long nr_segs); > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > It would be good to think about and then clearly spell out exactly what > state the page is in here. It is locked, and I assume clean and not > under writeback. What about its refcount, freezedness status and > eligibility for lookups? > > And as Hugh pointed out, some callees might needs the address_space* > although we can perhaps defer that until such a callee turns up. > If/when that happens we might have a problem though: if this locked > page is no longer attached to the address_space then what now pins the > address_space, protecting it from inode reclaim? That's an excellent point and trumps mine: it would be actively wrong to provide the struct address_space *mapping arg I was asking for. (Bet someone then tries stashing it away via page->private though.) Hugh ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 21:02 ` Hugh Dickins @ 2010-12-01 21:15 ` Hugh Dickins 2010-12-01 21:38 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: Hugh Dickins @ 2010-12-01 21:15 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010, Hugh Dickins wrote: > On Wed, 1 Dec 2010, Andrew Morton wrote: > > On Wed, 01 Dec 2010 15:10:50 -0500 > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > sector_t (*bmap)(struct address_space *, sector_t); > > > void (*invalidatepage) (struct page *, unsigned long); > > > int (*releasepage) (struct page *, gfp_t); > > > + void (*freepage)(struct page *); > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > > > loff_t offset, unsigned long nr_segs); > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > It would be good to think about and then clearly spell out exactly what > > state the page is in here. It is locked, and I assume clean and not > > under writeback. What about its refcount, freezedness status and > > eligibility for lookups? > > > > And as Hugh pointed out, some callees might needs the address_space* > > although we can perhaps defer that until such a callee turns up. > > If/when that happens we might have a problem though: if this locked > > page is no longer attached to the address_space then what now pins the > > address_space, protecting it from inode reclaim? > > That's an excellent point and trumps mine: it would be actively wrong > to provide the struct address_space *mapping arg I was asking for. > (Bet someone then tries stashing it away via page->private though.) Hmm, thinking further along the same lines: can we even guarantee that the filesystem module is still loaded at that point? i.e. might mapping->freepage now be pointing off into the garbage heap? Hugh ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 21:15 ` Hugh Dickins @ 2010-12-01 21:38 ` Andrew Morton 2010-12-01 21:51 ` Trond Myklebust 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 21:38 UTC (permalink / raw) To: Hugh Dickins Cc: Trond Myklebust, Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010 13:15:07 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > On Wed, 1 Dec 2010, Hugh Dickins wrote: > > On Wed, 1 Dec 2010, Andrew Morton wrote: > > > On Wed, 01 Dec 2010 15:10:50 -0500 > > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > > > > > --- a/include/linux/fs.h > > > > +++ b/include/linux/fs.h > > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > > sector_t (*bmap)(struct address_space *, sector_t); > > > > void (*invalidatepage) (struct page *, unsigned long); > > > > int (*releasepage) (struct page *, gfp_t); > > > > + void (*freepage)(struct page *); > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > > > > loff_t offset, unsigned long nr_segs); > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > > > It would be good to think about and then clearly spell out exactly what > > > state the page is in here. It is locked, and I assume clean and not > > > under writeback. What about its refcount, freezedness status and > > > eligibility for lookups? > > > > > > And as Hugh pointed out, some callees might needs the address_space* > > > although we can perhaps defer that until such a callee turns up. > > > If/when that happens we might have a problem though: if this locked > > > page is no longer attached to the address_space then what now pins the > > > address_space, protecting it from inode reclaim? > > > > That's an excellent point and trumps mine: it would be actively wrong > > to provide the struct address_space *mapping arg I was asking for. > > (Bet someone then tries stashing it away via page->private though.) > > Hmm, thinking further along the same lines: can we even guarantee that > the filesystem module is still loaded at that point? i.e. might > mapping->freepage now be pointing off into the garbage heap? I don't see anything on the VFS side which would prevent a module unload. Or, more realistically, a concurrent unmount, freeing of the superblock and everything associated with it. All we have here is a page*. Probably on most call paths we'll be OK - if a process is in the middle of a file truncate, holdin a file* ref which holds an inode ref then nobody will be unmounting that fs and hence nobody will be unloading that module. However on the random_code->alloc_page->vmscan->releasepage path, none of that applies. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 21:38 ` Andrew Morton @ 2010-12-01 21:51 ` Trond Myklebust 2010-12-01 22:13 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 21:51 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: > On Wed, 1 Dec 2010 13:15:07 -0800 (PST) > Hugh Dickins <hughd@google.com> wrote: > > > On Wed, 1 Dec 2010, Hugh Dickins wrote: > > > On Wed, 1 Dec 2010, Andrew Morton wrote: > > > > On Wed, 01 Dec 2010 15:10:50 -0500 > > > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > > > > > > > --- a/include/linux/fs.h > > > > > +++ b/include/linux/fs.h > > > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > > > sector_t (*bmap)(struct address_space *, sector_t); > > > > > void (*invalidatepage) (struct page *, unsigned long); > > > > > int (*releasepage) (struct page *, gfp_t); > > > > > + void (*freepage)(struct page *); > > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > > > > > loff_t offset, unsigned long nr_segs); > > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > > > > > It would be good to think about and then clearly spell out exactly what > > > > state the page is in here. It is locked, and I assume clean and not > > > > under writeback. What about its refcount, freezedness status and > > > > eligibility for lookups? > > > > > > > > And as Hugh pointed out, some callees might needs the address_space* > > > > although we can perhaps defer that until such a callee turns up. > > > > If/when that happens we might have a problem though: if this locked > > > > page is no longer attached to the address_space then what now pins the > > > > address_space, protecting it from inode reclaim? > > > > > > That's an excellent point and trumps mine: it would be actively wrong > > > to provide the struct address_space *mapping arg I was asking for. > > > (Bet someone then tries stashing it away via page->private though.) > > > > Hmm, thinking further along the same lines: can we even guarantee that > > the filesystem module is still loaded at that point? i.e. might > > mapping->freepage now be pointing off into the garbage heap? > > I don't see anything on the VFS side which would prevent a module > unload. Or, more realistically, a concurrent unmount, freeing of the > superblock and everything associated with it. All we have here is a > page*. > > Probably on most call paths we'll be OK - if a process is in the middle > of a file truncate, holdin a file* ref which holds an inode ref then > nobody will be unmounting that fs and hence nobody will be unloading > that module. > > However on the random_code->alloc_page->vmscan->releasepage path, none > of that applies. Just out of interest, what ensures that the mapping is still around for the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? I'm clearly missing whatever mechanism prevents iput_final() from racing with vmscan if the latter clears out the last page from the mapping. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 21:51 ` Trond Myklebust @ 2010-12-01 22:13 ` Andrew Morton 2010-12-01 22:24 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 22:13 UTC (permalink / raw) To: Trond Myklebust Cc: Hugh Dickins, Linus Torvalds, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 01 Dec 2010 16:51:12 -0500 Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: > > On Wed, 1 Dec 2010 13:15:07 -0800 (PST) > > Hugh Dickins <hughd@google.com> wrote: > > > > > On Wed, 1 Dec 2010, Hugh Dickins wrote: > > > > On Wed, 1 Dec 2010, Andrew Morton wrote: > > > > > On Wed, 01 Dec 2010 15:10:50 -0500 > > > > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > > > > > > > > > --- a/include/linux/fs.h > > > > > > +++ b/include/linux/fs.h > > > > > > @@ -602,6 +602,7 @@ struct address_space_operations { > > > > > > sector_t (*bmap)(struct address_space *, sector_t); > > > > > > void (*invalidatepage) (struct page *, unsigned long); > > > > > > int (*releasepage) (struct page *, gfp_t); > > > > > > + void (*freepage)(struct page *); > > > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > > > > > > loff_t offset, unsigned long nr_segs); > > > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > > > > > > > > > > It would be good to think about and then clearly spell out exactly what > > > > > state the page is in here. It is locked, and I assume clean and not > > > > > under writeback. What about its refcount, freezedness status and > > > > > eligibility for lookups? > > > > > > > > > > And as Hugh pointed out, some callees might needs the address_space* > > > > > although we can perhaps defer that until such a callee turns up. > > > > > If/when that happens we might have a problem though: if this locked > > > > > page is no longer attached to the address_space then what now pins the > > > > > address_space, protecting it from inode reclaim? > > > > > > > > That's an excellent point and trumps mine: it would be actively wrong > > > > to provide the struct address_space *mapping arg I was asking for. > > > > (Bet someone then tries stashing it away via page->private though.) > > > > > > Hmm, thinking further along the same lines: can we even guarantee that > > > the filesystem module is still loaded at that point? i.e. might > > > mapping->freepage now be pointing off into the garbage heap? > > > > I don't see anything on the VFS side which would prevent a module > > unload. Or, more realistically, a concurrent unmount, freeing of the > > superblock and everything associated with it. All we have here is a > > page*. > > > > Probably on most call paths we'll be OK - if a process is in the middle > > of a file truncate, holdin a file* ref which holds an inode ref then > > nobody will be unmounting that fs and hence nobody will be unloading > > that module. > > > > However on the random_code->alloc_page->vmscan->releasepage path, none > > of that applies. > > Just out of interest, what ensures that the mapping is still around for > the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? Nothing, afacit. I think this was the race which I taunted the mm developers about a couple of months back (can't find the email) and nobody contradicted me at that time. > I'm clearly missing whatever mechanism prevents iput_final() from racing > with vmscan if the latter clears out the last page from the mapping. The mechanism is called "luck". Way back in the 2.5.late days there was such a bug in the kernel (inode was reclaimed while vmscan was playing with the address_space) and I was able to trigger oopses from it. It required really massive, withering amounts of memory pressure. Stupid amounts. I should dig out those tools and remember how to operate them... ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:13 ` Andrew Morton @ 2010-12-01 22:24 ` Linus Torvalds 2010-12-01 22:38 ` Andrew Morton 2010-12-01 22:43 ` Hugh Dickins 0 siblings, 2 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 22:24 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 01 Dec 2010 16:51:12 -0500 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: >> > Probably on most call paths we'll be OK - if a process is in the middle >> > of a file truncate, holdin a file* ref which holds an inode ref then >> > nobody will be unmounting that fs and hence nobody will be unloading >> > that module. >> > >> > However on the random_code->alloc_page->vmscan->releasepage path, none >> > of that applies. >> >> Just out of interest, what ensures that the mapping is still around for >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? > > Nothing, afacit. No, we're good. Module unload has to go through a "stop_machine()" cycle, and that in turn requires an idle period for everything. And just a preemption reschedule isn't enough for that. So what is sufficient is that - we had the page locked and on the mapping This implies that we had an inode reference to the module, and the page lock means that the inode reference cannot go away (because it will involve invalidate-pages etc) - we're not sleeping after __remove_mapping, so unload can't happen afterwards. A _lot_ of the module races depend on that latter thing. We have almost no cases that are strictly about actual reference counts etc. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:24 ` Linus Torvalds @ 2010-12-01 22:38 ` Andrew Morton 2010-12-01 22:47 ` Trond Myklebust 2010-12-01 23:31 ` Linus Torvalds 2010-12-01 22:43 ` Hugh Dickins 1 sibling, 2 replies; 70+ messages in thread From: Andrew Morton @ 2010-12-01 22:38 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010 14:24:42 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 01 Dec 2010 16:51:12 -0500 > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: > >> > Probably on most call paths we'll be OK - if a process is in the middle > >> > of a file truncate, holdin a file* ref which holds an inode ref then > >> > nobody will be unmounting that fs and hence nobody will be unloading > >> > that module. > >> > > >> > However on the random_code->alloc_page->vmscan->releasepage path, none > >> > of that applies. > >> > >> Just out of interest, what ensures that the mapping is still around for > >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? > > > > Nothing, afacit. > > No, we're good. > > Module unload has to go through a "stop_machine()" cycle, and that in > turn requires an idle period for everything. And just a preemption > reschedule isn't enough for that. > > So what is sufficient is that > > - we had the page locked and on the mapping > > This implies that we had an inode reference to the module, and the > page lock means that the inode reference cannot go away (because it > will involve invalidate-pages etc) > > - we're not sleeping after __remove_mapping, so unload can't happen afterwards. > > A _lot_ of the module races depend on that latter thing. We have > almost no cases that are strictly about actual reference counts etc. > OK, the stop_machine() plugs a lot of potential race-vs-module-unload things. But Trond is referring to races against vmscan inode reclaim, unmount, etc. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:38 ` Andrew Morton @ 2010-12-01 22:47 ` Trond Myklebust 2010-12-01 23:21 ` Trond Myklebust 2010-12-01 23:31 ` Linus Torvalds 1 sibling, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 22:47 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 14:38 -0800, Andrew Morton wrote: > On Wed, 1 Dec 2010 14:24:42 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Wed, 01 Dec 2010 16:51:12 -0500 > > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > > > >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: > > >> > Probably on most call paths we'll be OK - if a process is in the middle > > >> > of a file truncate, holdin a file* ref which holds an inode ref then > > >> > nobody will be unmounting that fs and hence nobody will be unloading > > >> > that module. > > >> > > > >> > However on the random_code->alloc_page->vmscan->releasepage path, none > > >> > of that applies. > > >> > > >> Just out of interest, what ensures that the mapping is still around for > > >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? > > > > > > Nothing, afacit. > > > > No, we're good. > > > > Module unload has to go through a "stop_machine()" cycle, and that in > > turn requires an idle period for everything. And just a preemption > > reschedule isn't enough for that. > > > > So what is sufficient is that > > > > - we had the page locked and on the mapping > > > > This implies that we had an inode reference to the module, and the > > page lock means that the inode reference cannot go away (because it > > will involve invalidate-pages etc) > > > > - we're not sleeping after __remove_mapping, so unload can't happen afterwards. > > > > A _lot_ of the module races depend on that latter thing. We have > > almost no cases that are strictly about actual reference counts etc. > > > > OK, the stop_machine() plugs a lot of potential race-vs-module-unload > things. But Trond is referring to races against vmscan inode reclaim, > unmount, etc. > Yes, that was my question. However, Linus' explanation appears to answer one of Hugh's objections: we can apparently use a preempt-disable() in order to ensure that the module containing the mapping->a_ops is not unloaded until after the ->freepage() call is complete. That would imply that ->freepage() cannot sleep, but I don't think that is too nasty a restriction. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:47 ` Trond Myklebust @ 2010-12-01 23:21 ` Trond Myklebust 2010-12-01 23:46 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 23:21 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 17:47 -0500, Trond Myklebust wrote: > However, Linus' explanation appears to answer one of Hugh's objections: > we can apparently use a preempt-disable() in order to ensure that the > module containing the mapping->a_ops is not unloaded until after the > ->freepage() call is complete. > > That would imply that ->freepage() cannot sleep, but I don't think that > is too nasty a restriction. Revised patch would be as follows... ---------------------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 23:21 ` Trond Myklebust @ 2010-12-01 23:46 ` Andrew Morton 2010-12-01 23:56 ` Trond Myklebust 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 23:46 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 01 Dec 2010 18:21:16 -0500 Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > clear_page_mlock(page); > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 0; > } So here we're assuming that `mapping' was pinned by other means. Fair enough, although subtle. Even drop_pagecache_sb() got it right ;) > @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 1; > failed: And here. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 23:46 ` Andrew Morton @ 2010-12-01 23:56 ` Trond Myklebust 0 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 23:56 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 15:46 -0800, Andrew Morton wrote: > On Wed, 01 Dec 2010 18:21:16 -0500 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > --- a/mm/truncate.c > > +++ b/mm/truncate.c > > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > > clear_page_mlock(page); > > remove_from_page_cache(page); > > ClearPageMappedToDisk(page); > > + > > + if (mapping->a_ops->freepage) > > + mapping->a_ops->freepage(page); > > + > > page_cache_release(page); /* pagecache ref */ > > return 0; > > } > > So here we're assuming that `mapping' was pinned by other means. > > Fair enough, although subtle. Even drop_pagecache_sb() got it right ;) > > > @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > > __remove_from_page_cache(page); > > spin_unlock_irq(&mapping->tree_lock); > > mem_cgroup_uncharge_cache_page(page); > > + > > + if (mapping->a_ops->freepage) > > + mapping->a_ops->freepage(page); > > + > > page_cache_release(page); /* pagecache ref */ > > return 1; > > failed: > > And here. Yes. Both these functions are static, and their callers are assuming that something is already pinning the underlying inode, so the above should be quite safe. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:38 ` Andrew Morton 2010-12-01 22:47 ` Trond Myklebust @ 2010-12-01 23:31 ` Linus Torvalds 2010-12-01 23:36 ` Andrew Morton 2010-12-01 23:43 ` Trond Myklebust 1 sibling, 2 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 23:31 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > OK, the stop_machine() plugs a lot of potential race-vs-module-unload > things. But Trond is referring to races against vmscan inode reclaim, > unmount, etc. So? A filesystem module cannot be unloaded while it's still mounted. And unmount doesn't succeed until all inodes are gone. And getting rid of an inode doesn't succeed until all pages associated with it are gone. And getting rid of the pages involves locking them (whether in truncate or vmscan) and removing them from all lists. Ergo: vmscan has a locked page leads to the filesystem being guaranteed to not be unmounted. And that, in turn, guarantees that the module won't be unloaded until the machine has gone through an idle cycle. It really is that simple. There's nothing subtle there. The reason spin_unlock(&mapping->tree_lock) is safe is exactly the above trivial chain of dependencies. And it's also exactly why mapping->a_ops->freepage() would also be safe. This is pretty much how all the module races are handled. Doing module ref-counts per page (or per packet in flight for things like networking) would be prohibitively expensive. There's no way we can ever do that. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 23:31 ` Linus Torvalds @ 2010-12-01 23:36 ` Andrew Morton 2010-12-02 1:05 ` Linus Torvalds 2010-12-01 23:43 ` Trond Myklebust 1 sibling, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-01 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010 15:31:11 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > OK, the stop_machine() plugs a lot of potential race-vs-module-unload > > things. __But Trond is referring to races against vmscan inode reclaim, > > unmount, etc. > > So? > > A filesystem module cannot be unloaded while it's still mounted. > > And unmount doesn't succeed until all inodes are gone. > > And getting rid of an inode doesn't succeed until all pages associated > with it are gone. > > And getting rid of the pages involves locking them (whether in > truncate or vmscan) and removing them from all lists. > > Ergo: vmscan has a locked page leads to the filesystem being > guaranteed to not be unmounted. And that, in turn, guarantees that > the module won't be unloaded until the machine has gone through an > idle cycle. The page isn't attached to the address_space any more: static int __remove_mapping(struct address_space *mapping, struct page *page) { ... __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); ... } ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 23:36 ` Andrew Morton @ 2010-12-02 1:05 ` Linus Torvalds 2010-12-02 1:22 ` Andrew Morton 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2010-12-02 1:05 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 3:36 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 1 Dec 2010 15:31:11 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> Ergo: vmscan has a locked page leads to the filesystem being >> guaranteed to not be unmounted. And that, in turn, guarantees that >> the module won't be unloaded until the machine has gone through an >> idle cycle. > > The page isn't attached to the address_space any more: Did you even read the email? Here, let me quote the important parts: "module won't be unloaded until the machine has gone through an idle cycle" "This is pretty much how all the module races are handled. Doing module ref-counts per page (or per packet in flight for things like networking) would be prohibitively expensive." IOW, the whole "stop_machine()" part is fundamental. That whole "module unload won't happen until we've gone through an idle cycle" is EXACTLY why we don't need to have the page attached or ref-counted - we're still safe. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 1:05 ` Linus Torvalds @ 2010-12-02 1:22 ` Andrew Morton 2010-12-02 1:42 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Andrew Morton @ 2010-12-02 1:22 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010 17:05:43 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Dec 1, 2010 at 3:36 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 1 Dec 2010 15:31:11 -0800 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> > >> Ergo: vmscan has a locked page leads to the filesystem being > >> guaranteed to not be unmounted. __And that, in turn, guarantees that > >> the module won't be unloaded until the machine has gone through an > >> idle cycle. > > > > The page isn't attached to the address_space any more: > > Did you even read the email? We stopped talking about module unload hours ago. That matter is settled. What we're talking about is races against memory reclaim, unmount, etc. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 1:22 ` Andrew Morton @ 2010-12-02 1:42 ` Linus Torvalds 2010-12-02 2:05 ` Andrew Morton ` (5 more replies) 0 siblings, 6 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-02 1:42 UTC (permalink / raw) To: Andrew Morton Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > What we're talking about is races against memory reclaim, unmount, etc. Ahh. Those I can believe in. Although I think they'd almost incidentally be fixed by making inode freeing (which is where the 'struct address_space' is embedded) RCU-safe, which we're going to do anyway in 38. Then we could make the vmscan code just be a rcu-read section. Of course, I do think the race is basically impossible to hit in practice regardless. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 1:42 ` Linus Torvalds @ 2010-12-02 2:05 ` Andrew Morton 2010-12-02 3:08 ` [PATCH v3 0/3] Fix more NFS readdir regressions Trond Myklebust ` (4 subsequent siblings) 5 siblings, 0 replies; 70+ messages in thread From: Andrew Morton @ 2010-12-02 2:05 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010 17:42:08 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > What we're talking about is races against memory reclaim, unmount, etc. > > Ahh. Those I can believe in. Although I think they'd almost > incidentally be fixed by making inode freeing (which is where the > 'struct address_space' is embedded) RCU-safe, which we're going to do > anyway in 38. Then we could make the vmscan code just be a rcu-read > section. I didn't know that aspect of it. It will be nice to plug this race - it's been there for so long because nobody was able to think of an acceptable way of fixing it by direct means (synchronous locking, refcounting, etc). Taking a ref on the inode doesn't work, because we can't run iput_final() in direct-reclaim contexts (lock ordering snafus). vmscan is the problematic path - I _think_ all other code paths which remove pagecache have an inode ref. But this assumes that inode->i_mapping points at inode->i_data! Need to think about the situation where it points at a different inode's i_data - in that case these callers may have a ref on the wrong inode. > Of course, I do think the race is basically impossible to hit in > practice regardless. Actually I was able to hit the race back in late 2.5 or thereabouts. Really massive memory pressure caused vmscan->icache_shrinker to free the inode/address_space while another CPU in vmscan was playing with the address_space. That was quite a debugging session ;) ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v3 0/3] Fix more NFS readdir regressions 2010-12-02 1:42 ` Linus Torvalds 2010-12-02 2:05 ` Andrew Morton @ 2010-12-02 3:08 ` Trond Myklebust 2010-12-02 3:08 ` [PATCH v3 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust ` (3 subsequent siblings) 5 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-02 3:08 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro > OK. The first patch in this series fixes the regression reported by > Nick Bowler when I apply it to my setup. > The 2 remaining patches are needed in order to ensure that the > VM doesn't free the readdir page cache page while we're trying to > read from it, and to ensure that we don't leak memory... > Linus, please don't apply these patches quite yet. I'd like to continue > tests for a couple more days before I send you the pull request. v2 fixes the following issues: - Fix up the cookie usage in uncached_readdir() - Changelog fixups for the second patch - .releasepage should use kmap_atomic() so that it can be called from all direct reclaim contexts. - Ensure that .releasepage also clears Pg_uptodate - Set/clear the Pg_private flag to ensure .releasepage gets called when appropriate. - Add a .invalidatepage to ensure truncate_inode_pages() works correctly - Ensure that the anonymous page that is generated by uncached_readdir() doesn't leak memory. v3: - add the freepage() address space operation. - Dump the page locking - rewrite patch 'Fix a memory leak in nfs_readdir' to work with the new ->freepage() callback. Linus Torvalds (1): Call the filesystem back whenever a page is removed from the page cache Trond Myklebust (2): NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler NFS: Fix a memory leak in nfs_readdir Documentation/filesystems/Locking | 7 ++++++- Documentation/filesystems/vfs.txt | 7 +++++++ fs/nfs/dir.c | 28 +++++++++++++++++----------- fs/nfs/inode.c | 1 + include/linux/fs.h | 1 + include/linux/nfs_fs.h | 1 + mm/truncate.c | 8 ++++++++ mm/vmscan.c | 11 +++++++++++ 8 files changed, 52 insertions(+), 12 deletions(-) -- 1.7.3.2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v3 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler 2010-12-02 1:42 ` Linus Torvalds 2010-12-02 2:05 ` Andrew Morton 2010-12-02 3:08 ` [PATCH v3 0/3] Fix more NFS readdir regressions Trond Myklebust @ 2010-12-02 3:08 ` Trond Myklebust 2010-12-02 3:08 ` [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust ` (2 subsequent siblings) 5 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-02 3:08 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro We need to use the cookie from the previous array entry, not the actual cookie that we are searching for (except for the case of uncached_readdir). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index f0a384e..e03537f 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -178,6 +178,7 @@ typedef struct { struct page *page; unsigned long page_index; u64 *dir_cookie; + u64 last_cookie; loff_t current_index; decode_dirent_t decode; @@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc) else status = nfs_readdir_search_for_cookie(array, desc); + if (status == -EAGAIN) + desc->last_cookie = array->last_cookie; nfs_readdir_release_array(desc->page); out: return status; @@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, unsigned int array_size = ARRAY_SIZE(pages); entry.prev_cookie = 0; - entry.cookie = *desc->dir_cookie; + entry.cookie = desc->last_cookie; entry.eof = 0; entry.fh = nfs_alloc_fhandle(); entry.fattr = nfs_alloc_fattr(); @@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc) { int res; - if (desc->page_index == 0) + if (desc->page_index == 0) { desc->current_index = 0; + desc->last_cookie = 0; + } while (1) { res = find_cache_page(desc); if (res != -EAGAIN) @@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent, } desc->page_index = 0; + desc->last_cookie = *desc->dir_cookie; desc->page = page; status = nfs_readdir_xdr_to_array(desc, page, inode); -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-02 1:42 ` Linus Torvalds ` (2 preceding siblings ...) 2010-12-02 3:08 ` [PATCH v3 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust @ 2010-12-02 3:08 ` Trond Myklebust 2010-12-02 3:34 ` Hugh Dickins 2010-12-02 3:08 ` [PATCH v3 " Trond Myklebust 2010-12-03 9:12 ` [PATCH v2 " Nick Piggin 5 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-02 3:08 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro From: Linus Torvalds <torvalds@linux-foundation.org> NFS needs to be able to release objects that are stored in the page cache once the page itself is no longer visible from the page cache. This patch adds a callback to the address space operations that allows filesystems to perform page cleanups once the page has been removed from the page cache. Original patch by: Linus Torvalds <torvalds@linux-foundation.org> [trondmy: cover the cases of invalidate_inode_pages2() and truncate_inode_pages()] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- Documentation/filesystems/Locking | 7 ++++++- Documentation/filesystems/vfs.txt | 7 +++++++ include/linux/fs.h | 1 + mm/truncate.c | 8 ++++++++ mm/vmscan.c | 11 +++++++++++ 5 files changed, 33 insertions(+), 1 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index a91f308..b6426f1 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -173,12 +173,13 @@ prototypes: sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); + void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); int (*launder_page) (struct page *); locking rules: - All except set_page_dirty may block + All except set_page_dirty and freepage may block BKL PageLocked(page) i_mutex writepage: no yes, unlocks (see below) @@ -193,6 +194,7 @@ perform_write: no n/a yes bmap: no invalidatepage: no yes releasepage: no yes +freepage: no yes direct_IO: no launder_page: no yes @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to indicate that the buffers are (or may be) freeable. If ->releasepage is zero, the kernel assumes that the fs has no private interest in the buffers. + ->freepage() is called when the kernel is done dropping the page +from the page cache. + ->launder_page() may be called prior to releasing a page if it is still found to be dirty. It returns zero if the page was successfully cleaned, or an error value if not. Note that in order to prevent the page diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ed7e5ef..3b14a55 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -534,6 +534,7 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); + void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, @@ -679,6 +680,12 @@ struct address_space_operations { need to ensure this. Possibly it can clear the PageUptodate bit if it cannot free private data yet. + freepage: freepage is called once the page is no longer visible in + the page cache in order to allow the cleanup of any private + data. Since it may be called by the memory reclaimer, it + should not assume that the original address_space mapping still + exists, and it should not block. + direct_IO: called by the generic read/write routines to perform direct_IO - that is IO requests which bypass the page cache and transfer data directly between the storage and the diff --git a/include/linux/fs.h b/include/linux/fs.h index c9e06cc..090f0ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,7 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, gfp_t); + void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); int (*get_xip_mem)(struct address_space *, pgoff_t, int, diff --git a/mm/truncate.c b/mm/truncate.c index ba887bf..76ab2a8 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) clear_page_mlock(page); remove_from_page_cache(page); ClearPageMappedToDisk(page); + + if (mapping->a_ops->freepage) + mapping->a_ops->freepage(page); + page_cache_release(page); /* pagecache ref */ return 0; } @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); mem_cgroup_uncharge_cache_page(page); + + if (mapping->a_ops->freepage) + mapping->a_ops->freepage(page); + page_cache_release(page); /* pagecache ref */ return 1; failed: diff --git a/mm/vmscan.c b/mm/vmscan.c index d31d7ce..9e218e7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); + preempt_disable(); spin_lock_irq(&mapping->tree_lock); /* * The non racy check for a busy page. @@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) swp_entry_t swap = { .val = page_private(page) }; __delete_from_swap_cache(page); spin_unlock_irq(&mapping->tree_lock); + preempt_enable(); swapcache_free(swap, page); } else { + void (*freepage)(struct page *); + + freepage = mapping->a_ops->freepage; + __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); + if (freepage != NULL) + freepage(page); + preempt_enable(); + mem_cgroup_uncharge_cache_page(page); } @@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) cannot_free: spin_unlock_irq(&mapping->tree_lock); + preempt_enable(); return 0; } -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-02 3:08 ` [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust @ 2010-12-02 3:34 ` Hugh Dickins 2010-12-02 3:53 ` Trond Myklebust 0 siblings, 1 reply; 70+ messages in thread From: Hugh Dickins @ 2010-12-02 3:34 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010, Trond Myklebust wrote: > > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > clear_page_mlock(page); > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 0; > } I think Linus recommended that one be done in remove_from_page_cache() to catch all instances: did that get overruled later for some reason? > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > BUG_ON(!PageLocked(page)); > BUG_ON(mapping != page_mapping(page)); > > + preempt_disable(); > spin_lock_irq(&mapping->tree_lock); > /* > * The non racy check for a busy page. > @@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > swp_entry_t swap = { .val = page_private(page) }; > __delete_from_swap_cache(page); > spin_unlock_irq(&mapping->tree_lock); > + preempt_enable(); > swapcache_free(swap, page); > } else { > + void (*freepage)(struct page *); > + > + freepage = mapping->a_ops->freepage; > + > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > + if (freepage != NULL) > + freepage(page); > + preempt_enable(); > + > mem_cgroup_uncharge_cache_page(page); > } > > @@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > > cannot_free: > spin_unlock_irq(&mapping->tree_lock); > + preempt_enable(); > return 0; > } I took his "stop_machine()" explanation ("an idle period for everything. And just a preemption reschedule isn't enough for that") to imply that there's no need for your preempt_disable/preempt_enable there: they don't add anything to the module unload case, and they don't help the spin_unlock_irq issue (and you're already being rightly careful to note freepage in advance). But maybe I misunderstood. Hugh ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-02 3:34 ` Hugh Dickins @ 2010-12-02 3:53 ` Trond Myklebust 2010-12-02 3:58 ` Linus Torvalds 0 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-02 3:53 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 19:34 -0800, Hugh Dickins wrote: > On Wed, 1 Dec 2010, Trond Myklebust wrote: > > > > --- a/mm/truncate.c > > +++ b/mm/truncate.c > > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page) > > clear_page_mlock(page); > > remove_from_page_cache(page); > > ClearPageMappedToDisk(page); > > + > > + if (mapping->a_ops->freepage) > > + mapping->a_ops->freepage(page); > > + > > page_cache_release(page); /* pagecache ref */ > > return 0; > > } > > I think Linus recommended that one be done in remove_from_page_cache() > to catch all instances: did that get overruled later for some reason? I'm fine with doing that as long as everyone is happy that there is no chance of races. I was a bit nervous given the discussion that ensued from the vmscan case. > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > > BUG_ON(!PageLocked(page)); > > BUG_ON(mapping != page_mapping(page)); > > > > + preempt_disable(); > > spin_lock_irq(&mapping->tree_lock); > > /* > > * The non racy check for a busy page. > > @@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > > swp_entry_t swap = { .val = page_private(page) }; > > __delete_from_swap_cache(page); > > spin_unlock_irq(&mapping->tree_lock); > > + preempt_enable(); > > swapcache_free(swap, page); > > } else { > > + void (*freepage)(struct page *); > > + > > + freepage = mapping->a_ops->freepage; > > + > > __remove_from_page_cache(page); > > spin_unlock_irq(&mapping->tree_lock); > > + if (freepage != NULL) > > + freepage(page); > > + preempt_enable(); > > + > > mem_cgroup_uncharge_cache_page(page); > > } > > > > @@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > > > > cannot_free: > > spin_unlock_irq(&mapping->tree_lock); > > + preempt_enable(); > > return 0; > > } > > I took his "stop_machine()" explanation ("an idle period for everything. > And just a preemption reschedule isn't enough for that") to imply that > there's no need for your preempt_disable/preempt_enable there: they > don't add anything to the module unload case, and they don't help the > spin_unlock_irq issue (and you're already being rightly careful to note > freepage in advance). > > But maybe I misunderstood. Again, I was being cautious (I freely admit to not having studied stop_machine()). If nobody disagrees with your interpretation, then I'm very happy to drop the preempt disable/enable crud. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-02 3:53 ` Trond Myklebust @ 2010-12-02 3:58 ` Linus Torvalds 2010-12-06 16:59 ` [PATCH v4 0/3] Fix more NFS readdir regressions Trond Myklebust ` (3 more replies) 0 siblings, 4 replies; 70+ messages in thread From: Linus Torvalds @ 2010-12-02 3:58 UTC (permalink / raw) To: Trond Myklebust Cc: Hugh Dickins, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 1, 2010 at 7:53 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Again, I was being cautious (I freely admit to not having studied > stop_machine()). If nobody disagrees with your interpretation, then I'm > very happy to drop the preempt disable/enable crud. Just remove it. It won't matter for the stop_machine case, and the other case that Andrew pointed out (just unmap with memory pressure) is an independent thing that doesn't do stop_machine anyway. In the next merge window we'll see the whole RCU name lookup (let's see if people can agree on it, or whether I'll just have to do an executive decision and push it through even without consensus), which then rcu-delays inode freeing, and that will fix it for real. When that happens, doing a "rcu_read_lock()" in vmscan.c before the removal of the page from the mapping, and then unlocking after the callback will be a real fix, but it will require the rcu release to be that real fix anyway, so doing something else now will just confuse things. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v4 0/3] Fix more NFS readdir regressions 2010-12-02 3:58 ` Linus Torvalds @ 2010-12-06 16:59 ` Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust ` (2 subsequent siblings) 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-06 16:59 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Simon Kirby, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro > OK. The first patch in this series fixes the regression reported by > Nick Bowler when I apply it to my setup. > The 2 remaining patches are needed in order to ensure that the > VM doesn't free the readdir page cache page while we're trying to > read from it, and to ensure that we don't leak memory... > Linus, please don't apply these patches quite yet. I'd like to continue > tests for a couple more days before I send you the pull request. v2 fixes the following issues: - Fix up the cookie usage in uncached_readdir() - Changelog fixups for the second patch - .releasepage should use kmap_atomic() so that it can be called from all direct reclaim contexts. - Ensure that .releasepage also clears Pg_uptodate - Set/clear the Pg_private flag to ensure .releasepage gets called when appropriate. - Add a .invalidatepage to ensure truncate_inode_pages() works correctly - Ensure that the anonymous page that is generated by uncached_readdir() doesn't leak memory. v3: - add the freepage() address space operation. - Dump the page locking - rewrite patch 'Fix a memory leak in nfs_readdir' to work with the new ->freepage() callback. v4: - Remove the preempt disable/enable protection in __remove_mapping - Add a freepage() call to remove_from_page_cache() instead of open-coding it in truncate_complete_page(). Cheers Trond Linus Torvalds (1): Call the filesystem back whenever a page is removed from the page cache Trond Myklebust (2): NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler NFS: Fix a memory leak in nfs_readdir Documentation/filesystems/Locking | 7 ++++++- Documentation/filesystems/vfs.txt | 7 +++++++ fs/nfs/dir.c | 28 +++++++++++++++++----------- fs/nfs/inode.c | 1 + include/linux/fs.h | 1 + include/linux/nfs_fs.h | 1 + mm/filemap.c | 5 +++++ mm/truncate.c | 4 ++++ mm/vmscan.c | 7 +++++++ 9 files changed, 49 insertions(+), 12 deletions(-) -- 1.7.3.2 ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v4 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler 2010-12-02 3:58 ` Linus Torvalds 2010-12-06 16:59 ` [PATCH v4 0/3] Fix more NFS readdir regressions Trond Myklebust @ 2010-12-06 16:59 ` Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-06 16:59 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Simon Kirby, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro We need to use the cookie from the previous array entry, not the actual cookie that we are searching for (except for the case of uncached_readdir). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index f0a384e..e03537f 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -178,6 +178,7 @@ typedef struct { struct page *page; unsigned long page_index; u64 *dir_cookie; + u64 last_cookie; loff_t current_index; decode_dirent_t decode; @@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc) else status = nfs_readdir_search_for_cookie(array, desc); + if (status == -EAGAIN) + desc->last_cookie = array->last_cookie; nfs_readdir_release_array(desc->page); out: return status; @@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, unsigned int array_size = ARRAY_SIZE(pages); entry.prev_cookie = 0; - entry.cookie = *desc->dir_cookie; + entry.cookie = desc->last_cookie; entry.eof = 0; entry.fh = nfs_alloc_fhandle(); entry.fattr = nfs_alloc_fattr(); @@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc) { int res; - if (desc->page_index == 0) + if (desc->page_index == 0) { desc->current_index = 0; + desc->last_cookie = 0; + } while (1) { res = find_cache_page(desc); if (res != -EAGAIN) @@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent, } desc->page_index = 0; + desc->last_cookie = *desc->dir_cookie; desc->page = page; status = nfs_readdir_xdr_to_array(desc, page, inode); -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-02 3:58 ` Linus Torvalds 2010-12-06 16:59 ` [PATCH v4 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust @ 2010-12-06 16:59 ` Trond Myklebust 2010-12-07 7:08 ` Nick Piggin 2010-12-06 16:59 ` [PATCH v4 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 3 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-06 16:59 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Simon Kirby, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro From: Linus Torvalds <torvalds@linux-foundation.org> NFS needs to be able to release objects that are stored in the page cache once the page itself is no longer visible from the page cache. This patch adds a callback to the address space operations that allows filesystems to perform page cleanups once the page has been removed from the page cache. Original patch by: Linus Torvalds <torvalds@linux-foundation.org> [trondmy: cover the cases of invalidate_inode_pages2() and truncate_inode_pages()] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- Documentation/filesystems/Locking | 7 ++++++- Documentation/filesystems/vfs.txt | 7 +++++++ include/linux/fs.h | 1 + mm/filemap.c | 5 +++++ mm/truncate.c | 4 ++++ mm/vmscan.c | 7 +++++++ 6 files changed, 30 insertions(+), 1 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index a91f308..b6426f1 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -173,12 +173,13 @@ prototypes: sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); + void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); int (*launder_page) (struct page *); locking rules: - All except set_page_dirty may block + All except set_page_dirty and freepage may block BKL PageLocked(page) i_mutex writepage: no yes, unlocks (see below) @@ -193,6 +194,7 @@ perform_write: no n/a yes bmap: no invalidatepage: no yes releasepage: no yes +freepage: no yes direct_IO: no launder_page: no yes @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to indicate that the buffers are (or may be) freeable. If ->releasepage is zero, the kernel assumes that the fs has no private interest in the buffers. + ->freepage() is called when the kernel is done dropping the page +from the page cache. + ->launder_page() may be called prior to releasing a page if it is still found to be dirty. It returns zero if the page was successfully cleaned, or an error value if not. Note that in order to prevent the page diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ed7e5ef..3b14a55 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -534,6 +534,7 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); + void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, @@ -679,6 +680,12 @@ struct address_space_operations { need to ensure this. Possibly it can clear the PageUptodate bit if it cannot free private data yet. + freepage: freepage is called once the page is no longer visible in + the page cache in order to allow the cleanup of any private + data. Since it may be called by the memory reclaimer, it + should not assume that the original address_space mapping still + exists, and it should not block. + direct_IO: called by the generic read/write routines to perform direct_IO - that is IO requests which bypass the page cache and transfer data directly between the storage and the diff --git a/include/linux/fs.h b/include/linux/fs.h index c9e06cc..090f0ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,7 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, gfp_t); + void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); int (*get_xip_mem)(struct address_space *, pgoff_t, int, diff --git a/mm/filemap.c b/mm/filemap.c index ea89840..6b9aee2 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -143,13 +143,18 @@ void __remove_from_page_cache(struct page *page) void remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + void (*freepage)(struct page *); BUG_ON(!PageLocked(page)); + freepage = mapping->a_ops->freepage; spin_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); mem_cgroup_uncharge_cache_page(page); + + if (freepage) + freepage(page); } EXPORT_SYMBOL(remove_from_page_cache); diff --git a/mm/truncate.c b/mm/truncate.c index ba887bf..3c2d5dd 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -390,6 +390,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); mem_cgroup_uncharge_cache_page(page); + + if (mapping->a_ops->freepage) + mapping->a_ops->freepage(page); + page_cache_release(page); /* pagecache ref */ return 1; failed: diff --git a/mm/vmscan.c b/mm/vmscan.c index d31d7ce..9ca587c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -494,9 +494,16 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) spin_unlock_irq(&mapping->tree_lock); swapcache_free(swap, page); } else { + void (*freepage)(struct page *); + + freepage = mapping->a_ops->freepage; + __remove_from_page_cache(page); spin_unlock_irq(&mapping->tree_lock); mem_cgroup_uncharge_cache_page(page); + + if (freepage != NULL) + freepage(page); } return 1; -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache 2010-12-06 16:59 ` [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust @ 2010-12-07 7:08 ` Nick Piggin 0 siblings, 0 replies; 70+ messages in thread From: Nick Piggin @ 2010-12-07 7:08 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Bowler, Simon Kirby, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro On Mon, Dec 06, 2010 at 11:59:07AM -0500, Trond Myklebust wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > > NFS needs to be able to release objects that are stored in the page > cache once the page itself is no longer visible from the page cache. > > This patch adds a callback to the address space operations that allows > filesystems to perform page cleanups once the page has been removed > from the page cache. > > Original patch by: Linus Torvalds <torvalds@linux-foundation.org> > [trondmy: cover the cases of invalidate_inode_pages2() and > truncate_inode_pages()] > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > Documentation/filesystems/Locking | 7 ++++++- > Documentation/filesystems/vfs.txt | 7 +++++++ > include/linux/fs.h | 1 + > mm/filemap.c | 5 +++++ > mm/truncate.c | 4 ++++ > mm/vmscan.c | 7 +++++++ > 6 files changed, 30 insertions(+), 1 deletions(-) > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index a91f308..b6426f1 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -173,12 +173,13 @@ prototypes: > sector_t (*bmap)(struct address_space *, sector_t); > int (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, int); > + void (*freepage)(struct page *); > int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > int (*launder_page) (struct page *); > > locking rules: > - All except set_page_dirty may block > + All except set_page_dirty and freepage may block > > BKL PageLocked(page) i_mutex > writepage: no yes, unlocks (see below) > @@ -193,6 +194,7 @@ perform_write: no n/a yes > bmap: no > invalidatepage: no yes > releasepage: no yes > +freepage: no yes > direct_IO: no > launder_page: no yes > > @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to > indicate that the buffers are (or may be) freeable. If ->releasepage is zero, > the kernel assumes that the fs has no private interest in the buffers. > > + ->freepage() is called when the kernel is done dropping the page > +from the page cache. > + > ->launder_page() may be called prior to releasing a page if > it is still found to be dirty. It returns zero if the page was successfully > cleaned, or an error value if not. Note that in order to prevent the page > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt > index ed7e5ef..3b14a55 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -534,6 +534,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > int (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, int); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > struct page* (*get_xip_page)(struct address_space *, sector_t, > @@ -679,6 +680,12 @@ struct address_space_operations { > need to ensure this. Possibly it can clear the PageUptodate > bit if it cannot free private data yet. > > + freepage: freepage is called once the page is no longer visible in > + the page cache in order to allow the cleanup of any private > + data. Since it may be called by the memory reclaimer, it > + should not assume that the original address_space mapping still > + exists, and it should not block. Of course we still have bugs in this regard, without inode RCU and filesystem deregistration RCU, but when those things are implemented for RCU path-walk, this section should be updated somewhat, and we'll have to look at RCU protecting the final mapping manipulations after a page is removed from pagecache. But I'll help work on that after RCU inodes / filesystems is merged. > + > direct_IO: called by the generic read/write routines to perform > direct_IO - that is IO requests which bypass the page cache > and transfer data directly between the storage and the > diff --git a/include/linux/fs.h b/include/linux/fs.h > index c9e06cc..090f0ea 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,7 @@ struct address_space_operations { > sector_t (*bmap)(struct address_space *, sector_t); > void (*invalidatepage) (struct page *, unsigned long); > int (*releasepage) (struct page *, gfp_t); > + void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > int (*get_xip_mem)(struct address_space *, pgoff_t, int, > diff --git a/mm/filemap.c b/mm/filemap.c > index ea89840..6b9aee2 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -143,13 +143,18 @@ void __remove_from_page_cache(struct page *page) > void remove_from_page_cache(struct page *page) > { > struct address_space *mapping = page->mapping; > + void (*freepage)(struct page *); > > BUG_ON(!PageLocked(page)); > > + freepage = mapping->a_ops->freepage; > spin_lock_irq(&mapping->tree_lock); > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > + > + if (freepage) > + freepage(page); > } > EXPORT_SYMBOL(remove_from_page_cache); > > diff --git a/mm/truncate.c b/mm/truncate.c > index ba887bf..3c2d5dd 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -390,6 +390,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > __remove_from_page_cache(page); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > + > + if (mapping->a_ops->freepage) > + mapping->a_ops->freepage(page); > + > page_cache_release(page); /* pagecache ref */ > return 1; > failed: The generic parts of the code look OK to me, but why is there a difference in your sequences of loading the freepage function pointer here? ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v4 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 3:58 ` Linus Torvalds ` (2 preceding siblings ...) 2010-12-06 16:59 ` [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust @ 2010-12-06 16:59 ` Trond Myklebust 3 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-06 16:59 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Simon Kirby, Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro We need to ensure that the entries in the nfs_cache_array get cleared when the page is removed from the page cache. To do so, we use the freepage address_space operation. Change nfs_readdir_clear_array to use kmap_atomic(), so that the function can be safely called from all contexts. Finally, modify the cache_page_release helper to call nfs_readdir_clear_array directly, when dealing with an anonymous page from 'uncached_readdir'. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 18 +++++++++--------- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index e03537f..d529e0e 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -57,7 +57,7 @@ static int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); static int nfs_fsync_dir(struct file *, int); static loff_t nfs_llseek_dir(struct file *, loff_t, int); -static int nfs_readdir_clear_array(struct page*, gfp_t); +static void nfs_readdir_clear_array(struct page*); const struct file_operations nfs_dir_operations = { .llseek = nfs_llseek_dir, @@ -83,8 +83,8 @@ const struct inode_operations nfs_dir_inode_operations = { .setattr = nfs_setattr, }; -const struct address_space_operations nfs_dir_addr_space_ops = { - .releasepage = nfs_readdir_clear_array, +const struct address_space_operations nfs_dir_aops = { + .freepage = nfs_readdir_clear_array, }; #ifdef CONFIG_NFS_V3 @@ -214,17 +214,15 @@ void nfs_readdir_release_array(struct page *page) * we are freeing strings created by nfs_add_to_readdir_array() */ static -int nfs_readdir_clear_array(struct page *page, gfp_t mask) +void nfs_readdir_clear_array(struct page *page) { - struct nfs_cache_array *array = nfs_readdir_get_array(page); + struct nfs_cache_array *array; int i; - if (IS_ERR(array)) - return PTR_ERR(array); + array = kmap_atomic(page, KM_USER0); for (i = 0; i < array->size; i++) kfree(array->array[i].string.name); - nfs_readdir_release_array(page); - return 0; + kunmap_atomic(array, KM_USER0); } /* @@ -639,6 +637,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) static void cache_page_release(nfs_readdir_descriptor_t *desc) { + if (!desc->page->mapping) + nfs_readdir_clear_array(desc->page); page_cache_release(desc->page); desc->page = NULL; } diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 314f571..e67e31c 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; + inode->i_data.a_ops = &nfs_dir_aops; if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS)) set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags); /* Deal with crossing mountpoints */ diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index c66fdb7..29d504d 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations; #endif /* CONFIG_NFS_V3 */ extern const struct file_operations nfs_file_operations; extern const struct address_space_operations nfs_file_aops; +extern const struct address_space_operations nfs_dir_aops; static inline struct nfs_open_context *nfs_file_open_context(struct file *filp) { -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH v3 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 1:42 ` Linus Torvalds ` (3 preceding siblings ...) 2010-12-02 3:08 ` [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust @ 2010-12-02 3:08 ` Trond Myklebust 2010-12-03 9:12 ` [PATCH v2 " Nick Piggin 5 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-02 3:08 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler Cc: Linux Kernel Mailing List, linux-nfs, Andrew Morton, Hugh Dickins, Nick Piggin, Rik van Riel, Christoph Hellwig, Al Viro We need to ensure that the entries in the nfs_cache_array get cleared when the page is removed from the page cache. To do so, we use the freepage address_space operation. Change nfs_readdir_clear_array to use kmap_atomic(), so that the function can be safely called from all contexts. Finally, modify the cache_page_release helper to call nfs_readdir_clear_array directly, when dealing with an anonymous page from 'uncached_readdir'. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 18 +++++++++--------- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index e03537f..d529e0e 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -57,7 +57,7 @@ static int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); static int nfs_fsync_dir(struct file *, int); static loff_t nfs_llseek_dir(struct file *, loff_t, int); -static int nfs_readdir_clear_array(struct page*, gfp_t); +static void nfs_readdir_clear_array(struct page*); const struct file_operations nfs_dir_operations = { .llseek = nfs_llseek_dir, @@ -83,8 +83,8 @@ const struct inode_operations nfs_dir_inode_operations = { .setattr = nfs_setattr, }; -const struct address_space_operations nfs_dir_addr_space_ops = { - .releasepage = nfs_readdir_clear_array, +const struct address_space_operations nfs_dir_aops = { + .freepage = nfs_readdir_clear_array, }; #ifdef CONFIG_NFS_V3 @@ -214,17 +214,15 @@ void nfs_readdir_release_array(struct page *page) * we are freeing strings created by nfs_add_to_readdir_array() */ static -int nfs_readdir_clear_array(struct page *page, gfp_t mask) +void nfs_readdir_clear_array(struct page *page) { - struct nfs_cache_array *array = nfs_readdir_get_array(page); + struct nfs_cache_array *array; int i; - if (IS_ERR(array)) - return PTR_ERR(array); + array = kmap_atomic(page, KM_USER0); for (i = 0; i < array->size; i++) kfree(array->array[i].string.name); - nfs_readdir_release_array(page); - return 0; + kunmap_atomic(array, KM_USER0); } /* @@ -639,6 +637,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) static void cache_page_release(nfs_readdir_descriptor_t *desc) { + if (!desc->page->mapping) + nfs_readdir_clear_array(desc->page); page_cache_release(desc->page); desc->page = NULL; } diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 314f571..e67e31c 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; + inode->i_data.a_ops = &nfs_dir_aops; if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS)) set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags); /* Deal with crossing mountpoints */ diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index c66fdb7..29d504d 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations; #endif /* CONFIG_NFS_V3 */ extern const struct file_operations nfs_file_operations; extern const struct address_space_operations nfs_file_aops; +extern const struct address_space_operations nfs_dir_aops; static inline struct nfs_open_context *nfs_file_open_context(struct file *filp) { -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-02 1:42 ` Linus Torvalds ` (4 preceding siblings ...) 2010-12-02 3:08 ` [PATCH v3 " Trond Myklebust @ 2010-12-03 9:12 ` Nick Piggin 5 siblings, 0 replies; 70+ messages in thread From: Nick Piggin @ 2010-12-03 9:12 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Trond Myklebust, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, Dec 01, 2010 at 05:42:08PM -0800, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > What we're talking about is races against memory reclaim, unmount, etc. > > Ahh. Those I can believe in. Yeah, it's a tricky question. It would be solved if the inode reclaim code didn't have the nasty shortcuts for nr_pages == 0 sitting outisde the tree_lock... any time we have these kinds of optimisations checking things outside locks, we invaraibly find they have races or data races :( So doing those checks under lock would be a reasonable way to fix it if anyone cares for 2.6.37 or earlier (eg. distros). But it is another lock in inode freeing path which is nice to avoid, so let's just get it fixed with RCU in 2.6.38. > Although I think they'd almost > incidentally be fixed by making inode freeing (which is where the > 'struct address_space' is embedded) RCU-safe, which we're going to do > anyway in 38. Then we could make the vmscan code just be a rcu-read > section. Yep, good observation. Thanks, Nick ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 23:31 ` Linus Torvalds 2010-12-01 23:36 ` Andrew Morton @ 2010-12-01 23:43 ` Trond Myklebust 1 sibling, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 23:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Hugh Dickins, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 2010-12-01 at 15:31 -0800, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > OK, the stop_machine() plugs a lot of potential race-vs-module-unload > > things. But Trond is referring to races against vmscan inode reclaim, > > unmount, etc. > > So? > > A filesystem module cannot be unloaded while it's still mounted. > > And unmount doesn't succeed until all inodes are gone. > > And getting rid of an inode doesn't succeed until all pages associated > with it are gone. > > And getting rid of the pages involves locking them (whether in > truncate or vmscan) and removing them from all lists. > > Ergo: vmscan has a locked page leads to the filesystem being > guaranteed to not be unmounted. And that, in turn, guarantees that > the module won't be unloaded until the machine has gone through an > idle cycle. > > It really is that simple. There's nothing subtle there. The reason > spin_unlock(&mapping->tree_lock) is safe is exactly the above trivial > chain of dependencies. And it's also exactly why > mapping->a_ops->freepage() would also be safe. > > This is pretty much how all the module races are handled. Doing module > ref-counts per page (or per packet in flight for things like > networking) would be prohibitively expensive. There's no way we can > ever do that. Although the page is locked, it may no longer be visible to the lockless page lookup once the radix_tree_delete() completes in __remove_from_page_cache. Furthermore, if the same routine causes mapping->nr_pages to go to zero before iput_final() hits truncate_inode_pages(), then the latter exits immediately. Both these cases would appear to allow iput_final() to release the inode before vmscan gets round to unlocking the mapping->tree_lock since truncate_inode_pages() no longer thinks it has any work to do. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir 2010-12-01 22:24 ` Linus Torvalds 2010-12-01 22:38 ` Andrew Morton @ 2010-12-01 22:43 ` Hugh Dickins 1 sibling, 0 replies; 70+ messages in thread From: Hugh Dickins @ 2010-12-01 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Trond Myklebust, Nick Piggin, Nick Bowler, Linux Kernel Mailing List, linux-nfs, Rik van Riel, Christoph Hellwig, Al Viro On Wed, 1 Dec 2010, Linus Torvalds wrote: > On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 01 Dec 2010 16:51:12 -0500 > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote: > >> > Probably on most call paths we'll be OK - if a process is in the middle > >> > of a file truncate, holdin a file* ref which holds an inode ref then > >> > nobody will be unmounting that fs and hence nobody will be unloading > >> > that module. > >> > > >> > However on the random_code->alloc_page->vmscan->releasepage path, none > >> > of that applies. > >> > >> Just out of interest, what ensures that the mapping is still around for > >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()? > > > > Nothing, afacit. > > No, we're good. > > Module unload has to go through a "stop_machine()" cycle, and that in > turn requires an idle period for everything. And just a preemption > reschedule isn't enough for that. > > So what is sufficient is that > > - we had the page locked and on the mapping > > This implies that we had an inode reference to the module, and the > page lock means that the inode reference cannot go away (because it > will involve invalidate-pages etc) I'm not so sure of that: doesn't it test inode->i_data.nrpages in various places, and skip ahead if that is already 0? I don't see the necessary serialization when nrpages comes down to 0. > > - we're not sleeping after __remove_mapping, so unload can't happen afterwards. > > A _lot_ of the module races depend on that latter thing. We have > almost no cases that are strictly about actual reference counts etc. Okay, I'm reassured on my module unload point; but not on the question of safety of spin_unlock_irq(&mapping->tree_lock) which Trond lobbed back in return. Hugh ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-11-30 22:10 ` Linus Torvalds ` (2 preceding siblings ...) 2010-12-01 3:47 ` [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust @ 2010-12-01 3:47 ` Trond Myklebust 2010-12-01 4:10 ` Linus Torvalds 2010-12-01 3:47 ` [PATCH 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 4 siblings, 1 reply; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 3:47 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs Otherwise, the VM may end up removing it while we're reading from it. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 16 +++++++++++++++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 4a78d2b..6edef12 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -639,6 +639,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) static void cache_page_release(nfs_readdir_descriptor_t *desc) { + unlock_page(desc->page); page_cache_release(desc->page); desc->page = NULL; } @@ -646,8 +647,20 @@ void cache_page_release(nfs_readdir_descriptor_t *desc) static struct page *get_cache_page(nfs_readdir_descriptor_t *desc) { - return read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping, + struct page *page; + + for (;;) { + page = read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping, desc->page_index, (filler_t *)nfs_readdir_filler, desc); + if (IS_ERR(page)) + break; + lock_page(page); + if (page->mapping != NULL && PageUptodate(page)) + break; + unlock_page(page); + page_cache_release(page); + } + return page; } /* @@ -770,6 +783,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent, desc->page_index = 0; desc->page = page; + lock_page(page); status = nfs_readdir_xdr_to_array(desc, page, inode); if (status < 0) -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 3:47 ` [PATCH 2/3] NFS: lock the readdir page while it is in use Trond Myklebust @ 2010-12-01 4:10 ` Linus Torvalds 2010-12-01 4:29 ` Trond Myklebust 0 siblings, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 4:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs On Tue, Nov 30, 2010 at 7:47 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > Otherwise, the VM may end up removing it while we're reading from it. I don't think this is valid. Maybe it fixes a bug, but the commit description is misleading at best. Since you have a reference count to the page, the page is not going away. Locking may hide some other bug (due to serializing with other code you care about), but it is _not_ about the "VM may end up removing it". Even from a serialization angle, I think this patch is a bit suspect, since readdir() will always be called under the inode semaphore, so I think you'll always be serialized wrt other readdir users. Of course, you may have invalidation events etc that are outside of readdir, so ... Anyway if this patch matters, there's something else going on, and you need to describe that. Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 4:10 ` Linus Torvalds @ 2010-12-01 4:29 ` Trond Myklebust 2010-12-01 5:06 ` Linus Torvalds 2010-12-01 13:14 ` Rik van Riel 0 siblings, 2 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 4:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs On Tue, 2010-11-30 at 20:10 -0800, Linus Torvalds wrote: > On Tue, Nov 30, 2010 at 7:47 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > Otherwise, the VM may end up removing it while we're reading from it. > > I don't think this is valid. > > Maybe it fixes a bug, but the commit description is misleading at > best. Since you have a reference count to the page, the page is not > going away. Locking may hide some other bug (due to serializing with > other code you care about), but it is _not_ about the "VM may end up > removing it". > > Even from a serialization angle, I think this patch is a bit suspect, > since readdir() will always be called under the inode semaphore, so I > think you'll always be serialized wrt other readdir users. Of course, > you may have invalidation events etc that are outside of readdir, so > ... I'm not worried about other readdir calls invalidating the page. My concern is rather about the VM memory reclaimers ejecting the page from the page cache, and calling nfs_readdir_clear_array while we're referencing the page. This wasn't a problem with the previous readdir code, but it will be with the new incarnation because the actual filenames are stored outside the page itself. As far as I can see, the only way to protect against that is to lock the page, perform the usual tests and then release the page lock when we're done... > Anyway if this patch matters, there's something else going on, and you > need to describe that. No problem. I just wanted to get the patches out so that the people who are reporting regressions can start testing. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 4:29 ` Trond Myklebust @ 2010-12-01 5:06 ` Linus Torvalds 2010-12-01 14:49 ` Trond Myklebust 2010-12-01 13:14 ` Rik van Riel 1 sibling, 1 reply; 70+ messages in thread From: Linus Torvalds @ 2010-12-01 5:06 UTC (permalink / raw) To: Trond Myklebust; +Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs On Tue, Nov 30, 2010 at 8:29 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > I'm not worried about other readdir calls invalidating the page. My > concern is rather about the VM memory reclaimers ejecting the page from > the page cache, and calling nfs_readdir_clear_array while we're > referencing the page. I think you're making a fundamental mistake here, and you're confused by a much deeper problem. The thing is, the ".releasepage" callback gets called _before_ the page actually gets removed from the page cache, and there is no guarantee that it will always be removed at all! In fact, anybody holding a reference to it will result in page_freeze_refs() not successfully clearing all the refs, and that in turn will abort the actual freeing of the page. So while you hold the page count, your page will NOT be freed. Guaranteed. But it is true that the ".releasepage()" function may be called. So if your NFS release callback ends up invalidating the data on that page, that page lock thing will make a difference, yes. But at the same time, are you sure that you are able to then handle the case of that page still existing in the page cache and being used afterwards? Looking at the code, it doesn't look that way to me. So I think you're confused, and the NFS code totally incorrectly thinks that ".releasepage" is something that happens at the last use of the page. It simply is not so. In fact, you seem to return 0, which I think means "failure to release", so the VM will just mark it busy again afterwards. Now, I think you do have a few options: - keep the current model. BUT! In the page cache release function (nfs_readdir_clear_array), make sure that you also clear the up-to-date bit, so that the page gets read back in since it no longer contains any valid information. And return success for the "releasepage" operatioin. Alternatively: - introduce a callback for the case of the page actually being gone from the page cache, which gets called _after_ the removal. which seems to be what you really want, since for you the releasepage thing is about releasing the data structures associated with that cache. So you don't want to worry about the page lock, and you don't want to worry about the case of "maybe it won't get released at all after this because somebody still holds a ref-count". > As far as I can see, the only way to protect against that is to lock the > page, perform the usual tests and then release the page lock when we're > done... I think one of us is confused. And it's possible that it is me. It's been a _long_ time since I looked at that code, and I may well be missing something. But I get the strong feeling you're mis-using '.releasepage' and confused about the semantics. Linus "maybe confused myself" Torvalds ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 5:06 ` Linus Torvalds @ 2010-12-01 14:49 ` Trond Myklebust 0 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 14:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Bowler, Linux Kernel Mailing List, linux-nfs On Tue, 2010-11-30 at 21:06 -0800, Linus Torvalds wrote: > On Tue, Nov 30, 2010 at 8:29 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > I'm not worried about other readdir calls invalidating the page. My > > concern is rather about the VM memory reclaimers ejecting the page from > > the page cache, and calling nfs_readdir_clear_array while we're > > referencing the page. > > I think you're making a fundamental mistake here, and you're confused > by a much deeper problem. > > The thing is, the ".releasepage" callback gets called _before_ the > page actually gets removed from the page cache, and there is no > guarantee that it will always be removed at all! > > In fact, anybody holding a reference to it will result in > page_freeze_refs() not successfully clearing all the refs, and that in > turn will abort the actual freeing of the page. So while you hold the > page count, your page will NOT be freed. Guaranteed. > > But it is true that the ".releasepage()" function may be called. So if > your NFS release callback ends up invalidating the data on that page, > that page lock thing will make a difference, yes. > > But at the same time, are you sure that you are able to then handle > the case of that page still existing in the page cache and being used > afterwards? Looking at the code, it doesn't look that way to me. > > So I think you're confused, and the NFS code totally incorrectly > thinks that ".releasepage" is something that happens at the last use > of the page. It simply is not so. In fact, you seem to return 0, which > I think means "failure to release", so the VM will just mark it busy > again afterwards. > > Now, I think you do have a few options: > > - keep the current model. BUT! In the page cache release function > (nfs_readdir_clear_array), make sure that you also clear the > up-to-date bit, so that the page gets read back in since it no longer > contains any valid information. And return success for the > "releasepage" operatioin. This is the option I'm attempting for now. I'm quite fine with the page only being readable while under the page lock since this is readdir, and so we don't have to worry about mmap() and its ilk. My latest incarnation of the patches has added in all the PagePrivate() foo to make try_to_release_page() work, and also adds in an invalidatepage() address space operation (I rediscovered that truncate_inode_pages() will otherwise call block_invalidatepage, and subsequently Oops). I'm still in the process of testing, but the latest patchset appears to be doing all the right things for now. I'll post an update later today so that Nick and others can test too. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 4:29 ` Trond Myklebust 2010-12-01 5:06 ` Linus Torvalds @ 2010-12-01 13:14 ` Rik van Riel 2010-12-01 14:55 ` Trond Myklebust 1 sibling, 1 reply; 70+ messages in thread From: Rik van Riel @ 2010-12-01 13:14 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs On 11/30/2010 11:29 PM, Trond Myklebust wrote: > I'm not worried about other readdir calls invalidating the page. My > concern is rather about the VM memory reclaimers ejecting the page from > the page cache, and calling nfs_readdir_clear_array while we're > referencing the page. Wait, if nfs_readdir_clear_array can clear the page while somebody else has a refcount to it, what good is the refcount? Why is your .releasepage modifying the data on the page, instead of just the metadata in the struct page? -- All rights reversed ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 2/3] NFS: lock the readdir page while it is in use 2010-12-01 13:14 ` Rik van Riel @ 2010-12-01 14:55 ` Trond Myklebust 0 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 14:55 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Nick Bowler, Linux Kernel Mailing List, linux-nfs On Wed, 2010-12-01 at 08:14 -0500, Rik van Riel wrote: > On 11/30/2010 11:29 PM, Trond Myklebust wrote: > > > I'm not worried about other readdir calls invalidating the page. My > > concern is rather about the VM memory reclaimers ejecting the page from > > the page cache, and calling nfs_readdir_clear_array while we're > > referencing the page. > > Wait, if nfs_readdir_clear_array can clear the page while > somebody else has a refcount to it, what good is the > refcount? Hi Rik This is readdir, so there should normally be only one process accessing the page. If that process locks it (we don't have mmap() to worry about, so page locking is reasonable here) then the scheme is safe. Note that we do clear Pg_uptodate under the page lock in the latest version of the releasepage method (patches to be published later today after I finish testing). > Why is your .releasepage modifying the data on the page, > instead of just the metadata in the struct page? We want to be able to free up the data that is referenced by the array on the page before the page itself is freed. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH 3/3] NFS: Fix a memory leak in nfs_readdir 2010-11-30 22:10 ` Linus Torvalds ` (3 preceding siblings ...) 2010-12-01 3:47 ` [PATCH 2/3] NFS: lock the readdir page while it is in use Trond Myklebust @ 2010-12-01 3:47 ` Trond Myklebust 4 siblings, 0 replies; 70+ messages in thread From: Trond Myklebust @ 2010-12-01 3:47 UTC (permalink / raw) To: Linus Torvalds, Nick Bowler; +Cc: Linux Kernel Mailing List, linux-nfs Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/inode.c | 1 + include/linux/nfs_fs.h | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 314f571..0018e07 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; + inode->i_data.a_ops = &nfs_dir_addr_space_ops; if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS)) set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags); /* Deal with crossing mountpoints */ diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index c66fdb7..b5d3ab0 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations; #endif /* CONFIG_NFS_V3 */ extern const struct file_operations nfs_file_operations; extern const struct address_space_operations nfs_file_aops; +extern const struct address_space_operations nfs_dir_addr_space_ops; static inline struct nfs_open_context *nfs_file_open_context(struct file *filp) { -- 1.7.3.2 ^ permalink raw reply related [flat|nested] 70+ messages in thread
end of thread, other threads:[~2010-12-07 7:08 UTC | newest] Thread overview: 70+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-30 17:42 [PATCH] NFS: Fix a readdirplus bug Trond Myklebust 2010-11-30 22:10 ` Linus Torvalds 2010-11-30 22:13 ` Trond Myklebust 2010-12-01 3:47 ` [PATCH 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-01 3:47 ` [PATCH 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust 2010-12-01 15:04 ` Nick Bowler 2010-12-01 15:36 ` [PATCH v2 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 2/3] NFS: lock the readdir page while it is in use Trond Myklebust 2010-12-01 15:36 ` [PATCH v2 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 2010-12-01 16:17 ` Linus Torvalds 2010-12-01 16:35 ` Rik van Riel 2010-12-01 16:45 ` Benny Halevy 2010-12-01 16:47 ` Linus Torvalds 2010-12-01 17:02 ` Rik van Riel 2010-12-01 17:58 ` Trond Myklebust 2010-12-01 18:29 ` Miklos Szeredi 2010-12-01 18:54 ` Trond Myklebust 2010-12-01 19:23 ` Hugh Dickins 2010-12-01 19:52 ` Linus Torvalds 2010-12-01 20:05 ` Trond Myklebust 2010-12-01 20:39 ` Andrew Morton 2010-12-01 21:29 ` Neil Brown 2010-12-01 22:43 ` Andrew Morton 2010-12-01 23:01 ` Neil Brown 2010-12-01 19:47 ` Linus Torvalds 2010-12-01 20:10 ` Trond Myklebust 2010-12-01 20:18 ` Linus Torvalds 2010-12-01 20:31 ` Hugh Dickins 2010-12-01 20:33 ` Andrew Morton 2010-12-01 21:02 ` Hugh Dickins 2010-12-01 21:15 ` Hugh Dickins 2010-12-01 21:38 ` Andrew Morton 2010-12-01 21:51 ` Trond Myklebust 2010-12-01 22:13 ` Andrew Morton 2010-12-01 22:24 ` Linus Torvalds 2010-12-01 22:38 ` Andrew Morton 2010-12-01 22:47 ` Trond Myklebust 2010-12-01 23:21 ` Trond Myklebust 2010-12-01 23:46 ` Andrew Morton 2010-12-01 23:56 ` Trond Myklebust 2010-12-01 23:31 ` Linus Torvalds 2010-12-01 23:36 ` Andrew Morton 2010-12-02 1:05 ` Linus Torvalds 2010-12-02 1:22 ` Andrew Morton 2010-12-02 1:42 ` Linus Torvalds 2010-12-02 2:05 ` Andrew Morton 2010-12-02 3:08 ` [PATCH v3 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-02 3:08 ` [PATCH v3 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust 2010-12-02 3:08 ` [PATCH v3 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust 2010-12-02 3:34 ` Hugh Dickins 2010-12-02 3:53 ` Trond Myklebust 2010-12-02 3:58 ` Linus Torvalds 2010-12-06 16:59 ` [PATCH v4 0/3] Fix more NFS readdir regressions Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 1/3] NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler Trond Myklebust 2010-12-06 16:59 ` [PATCH v4 2/3] Call the filesystem back whenever a page is removed from the page cache Trond Myklebust 2010-12-07 7:08 ` Nick Piggin 2010-12-06 16:59 ` [PATCH v4 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust 2010-12-02 3:08 ` [PATCH v3 " Trond Myklebust 2010-12-03 9:12 ` [PATCH v2 " Nick Piggin 2010-12-01 23:43 ` Trond Myklebust 2010-12-01 22:43 ` Hugh Dickins 2010-12-01 3:47 ` [PATCH 2/3] NFS: lock the readdir page while it is in use Trond Myklebust 2010-12-01 4:10 ` Linus Torvalds 2010-12-01 4:29 ` Trond Myklebust 2010-12-01 5:06 ` Linus Torvalds 2010-12-01 14:49 ` Trond Myklebust 2010-12-01 13:14 ` Rik van Riel 2010-12-01 14:55 ` Trond Myklebust 2010-12-01 3:47 ` [PATCH 3/3] NFS: Fix a memory leak in nfs_readdir Trond Myklebust
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).