From: Scott Mayhew <smayhew@redhat.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
"anna@kernel.org" <anna@kernel.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()
Date: Wed, 26 Jul 2023 09:43:23 -0400 [thread overview]
Message-ID: <ZMEi+/3cmz3Vl7lq@aion> (raw)
In-Reply-To: <7b8e81b3ab44b5bc788a024dec6465adcc01d7a3.camel@kernel.org>
On Wed, 26 Jul 2023, Jeff Layton wrote:
> On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote:
> > On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> > > On Tue, 25 Jul 2023, Trond Myklebust wrote:
> > >
> > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > > > Once a folio's private data has been cleared, it's possible for
> > > > > another
> > > > > process to clear the folio->mapping (e.g. via
> > > > > invalidate_complete_folio2
> > > > > or evict_mapping_folio), so it wouldn't be safe to call
> > > > > nfs_page_to_inode() after that.
> > > > >
> > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > > > folios")
> > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > > > > ---
> > > > > fs/nfs/write.c | 4 +++-
> > > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index f4cca8f00c0c..489c3f9dae23 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > > > nfs_page
> > > > > *req)
> > > > > */
> > > > > static void nfs_inode_remove_request(struct nfs_page *req)
> > > > > {
> > > > > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > > > +
> > > > > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > > > > struct folio *folio = nfs_page_to_folio(req-
> > > > > > wb_head);
> > > > > struct address_space *mapping =
> > > > > folio_file_mapping(folio);
> > > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > > > nfs_page *req)
> > > > >
> > > > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > > > > nfs_release_request(req);
> > > > > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > > > nrequests);
> > > > > + atomic_long_dec(&nfsi->nrequests);
> > > >
> > > > Why not just invert the order of the atomic_long_dec() and the
> > > > nfs_release_request()? That way you are also ensuring that the
> > > > inode is
> > > > still pinned in memory by the open context.
> > >
> > > I'm not following. How does inverting the order prevent the
> > > folio->mapping from getting clobbered?
> > >
> >
> > The open/lock context is refcounted by the nfs_page until the latter is
> > released. That's why the inode is guaranteed to remain around at least
> > until the call to nfs_release_request().
> >
>
> The problem is not that the inode is going away, but rather that we
> can't guarantee that the page is still part of the mapping at this
> point, and so we can't safely dereference page->mapping there. I do see
> that nfs_release_request releases a reference to the page, but I don't
> think that's sufficient to ensure that it remains part of the mapping.
>
> AFAICT, once we clear page->private, the page is subject to be removed
> from the mapping. So, I *think* it's safe to just move the call to
> nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.
Yeah, the inode hasn't gone away. I can pick the nfs_commit_data
address off the stack in nfs_commit_release_pages:
crash> nfs_commit_data.inode c0000006774cae00
inode = 0xc00000006c1b05f8,
The nfs_inode is still allocated:
crash> kmem 0xc00000006c1b05f8
CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME
c000000030332600 1088 128 5959 101 64k nfs_inode_cache
SLAB MEMORY NODE TOTAL ALLOCATED FREE
c00c0000001b06c0 c00000006c1b0000 0 59 1 58
FREE / [ALLOCATED]
[c00000006c1b0448]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
c00c0000001b06c0 6c1b0000 c000000030332600 c00000006c1b4480 1 23ffff800000200 slab
The vfs_inode:
crash> px &((struct nfs_inode *)0xc00000006c1b0448)->vfs_inode
$7 = (struct inode *) 0xc00000006c1b05f8
Matches the inodes open by both nfs_flock programs from the test:
crash> foreach nfs_flock files
PID: 4006780 TASK: c00000009d436600 CPU: 43 COMMAND: "nfs_flock"
ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
FD FILE DENTRY INODE TYPE PATH
0 c000000196e9a000 c000000004090840 c00000000ae13bf0 CHR /dev/null
1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
3 c000000196e97700 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
^^^^^^^^^^^^^^^^
PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock"
ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
FD FILE DENTRY INODE TYPE PATH
0 c0000000f0812200 c000000004090840 c00000000ae13bf0 CHR /dev/null
1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
3 c0000000f0813c00 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
^^^^^^^^^^^^^^^^
The file->f_mapping for both struct files matches the inode->i_data:
crash> file.f_mapping c000000196e97700
f_mapping = 0xc00000006c1b0770,
crash> file.f_mapping c0000000f0813c00
f_mapping = 0xc00000006c1b0770,
crash> px &((struct inode *)0xc00000006c1b05f8)->i_data
$8 = (struct address_space *) 0xc00000006c1b0770
and if I look at one of those nfs_flock tasks, the folio
passed in to nfs_read_folio has the same mapping:
crash> bt 4006781
PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock"
#0 [c000000177053710] __schedule at c000000000f61d9c
#1 [c0000001770537d0] schedule at c000000000f621f4
#2 [c000000177053840] io_schedule at c000000000f62354
#3 [c000000177053870] folio_wait_bit_common at c00000000042dc60
#4 [c000000177053970] nfs_read_folio at c0080000050108a8 [nfs]
#5 [c000000177053a60] nfs_write_begin at c008000004fff06c [nfs]
#6 [c000000177053b10] generic_perform_write at c00000000042b044
#7 [c000000177053bc0] nfs_file_write at c008000004ffda08 [nfs]
#8 [c000000177053c60] new_sync_write at c00000000057fdd8
#9 [c000000177053d10] vfs_write at c000000000582fd4
#10 [c000000177053d60] ksys_write at c0000000005833a4
#11 [c000000177053db0] system_call_exception at c00000000002f434
#12 [c000000177053e10] system_call_vectored_common at c00000000000bfe8
crash> folio.mapping c00c000000564400
mapping = 0xc00000006c1b0770,
It's just that if we go back to the nfs_page being released by our panic
task, the folio->mapping has been cleared, so we panic when we try to go
folio->mapping->host.
crash> nfs_page c00000016fb2a600
struct nfs_page {
wb_list = {
next = 0xc00000016fb2a600,
prev = 0xc00000016fb2a600
},
{
wb_page = 0xc00c000001d49580,
wb_folio = 0xc00c000001d49580
},
wb_lock_context = 0xc00000010518b2c0,
wb_index = 0x1,
wb_offset = 0x6940,
wb_pgbase = 0x6940,
wb_bytes = 0x40,
wb_kref = {
refcount = {
refs = {
counter = 0x1
}
}
},
wb_flags = 0x5,
wb_verf = {
data = "\214\205_d\214\210W\036"
},
wb_this_page = 0xc00000016fb2a600,
wb_head = 0xc00000016fb2a600,
wb_nio = 0x0
}
crash> folio.mapping 0xc00c000001d49580
mapping = 0x0,
-Scott
> --
> Jeff Layton <jlayton@kernel.org>
>
next prev parent reply other threads:[~2023-07-26 13:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-25 15:08 [PATCH 0/1] Fix potential oops in nfs_inode_remove_request() Scott Mayhew
2023-07-25 15:08 ` [PATCH 1/1] NFS: " Scott Mayhew
2023-07-25 15:14 ` Trond Myklebust
2023-07-25 16:24 ` Scott Mayhew
2023-07-25 17:41 ` Trond Myklebust
2023-07-26 12:40 ` Jeff Layton
2023-07-26 13:43 ` Scott Mayhew [this message]
2023-10-02 20:16 ` Benjamin Coddington
2023-10-02 20:29 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZMEi+/3cmz3Vl7lq@aion \
--to=smayhew@redhat.com \
--cc=anna@kernel.org \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox