* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
@ 2011-03-04 16:44 Neil Horman
2011-03-04 16:58 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Neil Horman @ 2011-03-04 16:44 UTC (permalink / raw)
To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton
This oops was reported recently:
<IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91
[<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
[<ffffffff8022f28a>] skb_release_data+0x5f/0x98
[<ffffffff80028b88>] __kfree_skb+0x11/0x1a
[<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
[<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
[<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
[<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
[<ffffffff80099b45>] do_timer+0x2df/0x52c
[<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
[<ffffffff80035ab8>] ip_rcv+0x539/0x57c
[<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
[<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
[<ffffffff8000c979>] net_rx_action+0xac/0x1b3
[<ffffffff80012464>] __do_softirq+0x89/0x133
[<ffffffff8005e2fc>] call_softirq+0x1c/0x28
[<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
[<ffffffff8006d485>] do_IRQ+0xec/0xf5
[<ffffffff8006bdb5>] default_idle+0x0/0x50
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50
[<ffffffff800492fd>] cpu_idle+0x95/0xb8
[<ffffffff80461807>] start_kernel+0x220/0x225
[<ffffffff8046122f>] _sinittext+0x22f/0x236
It occurs, because an skb with a fraglist was freed from the tcp retransmit
queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
it was allocated from the Slab allocator (which means the free path above can't
safely free it via put_page)
We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
__nfs4_proc_set_acl just converts each page in the buffer to a page struct via
virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
bit is set. We can't create a buffer with kmalloc and free it later in the tcp
ack path with put_page, so we need to either:
1) ensure that when we create the list of pages, no page struct has PG_Slab set
or
2) not use a page list to send this data
Given that these buffers can be multiple pages and arbitrarily sized, I think
(1) is the right way to go. I've written the below patch to allocate a page
from the buddy allocator directly and copy the data over to it. This ensures
that we have a put_page free-able page for every entry that winds up on an skb
frag list, so it can be safely freed when the frame is acked. We do a put page
on each entry after the rpc_call_sync call so as to drop our own reference count
to the page, leaving only the ref count taken by tcp_sendpages. This way the
data will be properly freed when the ack comes in
Successfully tested by myself to solve the above oops.
Note, as this is the result of a setacl operation that exceeded a page of data,
I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
CCing security on this as well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
---
fs/nfs/nfs4proc.c | 42 ++++++++++++++++++++++++++++++++++++++++--
1 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 78936a8..e4bd9b5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -51,6 +51,7 @@
#include <linux/sunrpc/bc_xprt.h>
#include <linux/xattr.h>
#include <linux/utsname.h>
+#include <linux/mm.h>
#include "nfs4_fs.h"
#include "delegation.h"
@@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
}
}
+static int buf_to_pages_noslab(const void *buf, size_t buflen,
+ struct page **pages, unsigned int *pgbase)
+{
+ const void *p = buf;
+ struct page *page, *newpage, **spages;
+ int rc = -ENOMEM;
+
+ spages = pages;
+ *pgbase = offset_in_page(buf);
+ p -= *pgbase;
+ while (p < buf + buflen) {
+ page = virt_to_page(p);
+ newpage = alloc_page(GFP_KERNEL);
+ if (!newpage)
+ goto unwind;
+ memcpy(page_address(newpage), page_address(page),
+ PAGE_CACHE_SIZE);
+ *(pages++) = newpage;
+ p += PAGE_CACHE_SIZE;
+ }
+
+ rc = 0;
+ return rc;
+
+unwind:
+ while (*spages != NULL)
+ __free_page(*(spages++));
+ return rc;
+}
+
struct nfs4_cached_acl {
int cached;
size_t len;
@@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
.rpc_argp = &arg,
.rpc_resp = &res,
};
- int ret;
+ int ret, i;
if (!nfs4_server_supports_acls(server))
return -EOPNOTSUPP;
+ memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);
+ ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
+ if (ret)
+ return ret;
nfs_inode_return_delegation(inode);
- buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
+
+ for (i = 0; pages[i] != NULL; i++)
+ put_page(pages[i]);
+
/*
* Acl update can result in inode attribute update.
* so mark the attribute cache invalid.
--
1.7.4
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman @ 2011-03-04 16:58 ` Christoph Hellwig 2011-03-04 17:13 ` J. Bruce Fields 2011-03-04 19:01 ` Trond Myklebust 2 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2011-03-04 16:58 UTC (permalink / raw) To: Neil Horman Cc: linux-nfs, Trond Myklebust, security, Jeff Layton, linux-fsdevel > It occurs, because an skb with a fraglist was freed from the tcp retransmit > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating > it was allocated from the Slab allocator (which means the free path above can't > safely free it via put_page) > > We tracked this back to an nfsv4 setacl operation, in which the nfs code > attempted to fill convert the passed in buffer to an array of pages in > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab > bit is set. We can't create a buffer with kmalloc and free it later in the tcp > ack path with put_page, so we need to either: Note that the same thing can also happen with ext4 sending kmalloc'ed buffers through network based block storage. Long time ago we had the same issue with XFS and went to great length to avoid sending down kmalloc'ed pages, but recently we (fs developers) were told that's just fine. Consequently ext4 now actually uses slab pages for I/O, and we are planning to make use of that fact soon again in XFS, too. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman 2011-03-04 16:58 ` Christoph Hellwig @ 2011-03-04 17:13 ` J. Bruce Fields 2011-03-04 18:45 ` Neil Horman 2011-03-04 19:01 ` Trond Myklebust 2 siblings, 1 reply; 17+ messages in thread From: J. Bruce Fields @ 2011-03-04 17:13 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote: > This oops was reported recently: > > <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 > [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 > [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 > [<ffffffff80028b88>] __kfree_skb+0x11/0x1a > [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 > [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 > [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa > [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 > [<ffffffff80099b45>] do_timer+0x2df/0x52c > [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 > [<ffffffff80035ab8>] ip_rcv+0x539/0x57c > [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f > [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 > [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 > [<ffffffff80012464>] __do_softirq+0x89/0x133 > [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 > [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d > [<ffffffff8006d485>] do_IRQ+0xec/0xf5 > [<ffffffff8006bdb5>] default_idle+0x0/0x50 > [<ffffffff8005d615>] ret_from_intr+0x0/0xa > <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 > [<ffffffff800492fd>] cpu_idle+0x95/0xb8 > [<ffffffff80461807>] start_kernel+0x220/0x225 > [<ffffffff8046122f>] _sinittext+0x22f/0x236 > > It occurs, because an skb with a fraglist was freed from the tcp retransmit > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating > it was allocated from the Slab allocator (which means the free path above can't > safely free it via put_page) > > We tracked this back to an nfsv4 setacl operation, in which the nfs code > attempted to fill convert the passed in buffer to an array of pages in > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab > bit is set. My code originally, I think, so apologies, and thanks for the fix! > We can't create a buffer with kmalloc and free it later in the tcp > ack path with put_page, so we need to either: But out of curiosity: why is there this rule? > 1) ensure that when we create the list of pages, no page struct has PG_Slab set > or > 2) not use a page list to send this data > > Given that these buffers can be multiple pages and arbitrarily sized, I think > (1) is the right way to go. I've written the below patch to allocate a page > from the buddy allocator directly and copy the data over to it. This ensures > that we have a put_page free-able page for every entry that winds up on an skb > frag list, so it can be safely freed when the frame is acked. We do a put page > on each entry after the rpc_call_sync call so as to drop our own reference count > to the page, leaving only the ref count taken by tcp_sendpages. This way the > data will be properly freed when the ack comes in Seems fine to me. There's one thing I still don't understand: why was this only triggered for large ACLs? The setacl doesn't doesn't treat the first 4k specially that I can see: even a 1-byte acl would get translated into a page. Is it that the VFS creates the buffer it hands down differently in that case? --b. > Successfully tested by myself to solve the above oops. > > Note, as this is the result of a setacl operation that exceeded a page of data, > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm > CCing security on this as well. > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Trond Myklebust <Trond.Myklebust@netapp.com> > CC: security@kernel.org > CC: Jeff Layton <jlayton@redhat.com> > --- > fs/nfs/nfs4proc.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > 1 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > index 78936a8..e4bd9b5 100644 > --- a/fs/nfs/nfs4proc.c > +++ b/fs/nfs/nfs4proc.c > @@ -51,6 +51,7 @@ > #include <linux/sunrpc/bc_xprt.h> > #include <linux/xattr.h> > #include <linux/utsname.h> > +#include <linux/mm.h> > > #include "nfs4_fs.h" > #include "delegation.h" > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen, > } > } > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > + struct page **pages, unsigned int *pgbase) > +{ > + const void *p = buf; > + struct page *page, *newpage, **spages; > + int rc = -ENOMEM; > + > + spages = pages; > + *pgbase = offset_in_page(buf); > + p -= *pgbase; > + while (p < buf + buflen) { > + page = virt_to_page(p); > + newpage = alloc_page(GFP_KERNEL); > + if (!newpage) > + goto unwind; > + memcpy(page_address(newpage), page_address(page), > + PAGE_CACHE_SIZE); > + *(pages++) = newpage; > + p += PAGE_CACHE_SIZE; > + } > + > + rc = 0; > + return rc; > + > +unwind: > + while (*spages != NULL) > + __free_page(*(spages++)); > + return rc; > +} > + > struct nfs4_cached_acl { > int cached; > size_t len; > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl > .rpc_argp = &arg, > .rpc_resp = &res, > }; > - int ret; > + int ret, i; > > if (!nfs4_server_supports_acls(server)) > return -EOPNOTSUPP; > + memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES); > + ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > + if (ret) > + return ret; > nfs_inode_return_delegation(inode); > - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > ret = nfs4_call_sync(server, &msg, &arg, &res, 1); > + > + for (i = 0; pages[i] != NULL; i++) > + put_page(pages[i]); > + > /* > * Acl update can result in inode attribute update. > * so mark the attribute cache invalid. > -- > 1.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 17:13 ` J. Bruce Fields @ 2011-03-04 18:45 ` Neil Horman 2011-03-04 19:33 ` J. Bruce Fields 0 siblings, 1 reply; 17+ messages in thread From: Neil Horman @ 2011-03-04 18:45 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton On Fri, Mar 04, 2011 at 12:13:21PM -0500, J. Bruce Fields wrote: > On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote: > > This oops was reported recently: > > > > <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 > > [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 > > [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 > > [<ffffffff80028b88>] __kfree_skb+0x11/0x1a > > [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 > > [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 > > [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa > > [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 > > [<ffffffff80099b45>] do_timer+0x2df/0x52c > > [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 > > [<ffffffff80035ab8>] ip_rcv+0x539/0x57c > > [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f > > [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 > > [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 > > [<ffffffff80012464>] __do_softirq+0x89/0x133 > > [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 > > [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d > > [<ffffffff8006d485>] do_IRQ+0xec/0xf5 > > [<ffffffff8006bdb5>] default_idle+0x0/0x50 > > [<ffffffff8005d615>] ret_from_intr+0x0/0xa > > <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 > > [<ffffffff800492fd>] cpu_idle+0x95/0xb8 > > [<ffffffff80461807>] start_kernel+0x220/0x225 > > [<ffffffff8046122f>] _sinittext+0x22f/0x236 > > > > It occurs, because an skb with a fraglist was freed from the tcp retransmit > > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating > > it was allocated from the Slab allocator (which means the free path above can't > > safely free it via put_page) > > > > We tracked this back to an nfsv4 setacl operation, in which the nfs code > > attempted to fill convert the passed in buffer to an array of pages in > > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. > > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via > > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab > > bit is set. > > My code originally, I think, so apologies, and thanks for the fix! > No problem :) > > We can't create a buffer with kmalloc and free it later in the tcp > > ack path with put_page, so we need to either: > > But out of curiosity: why is there this rule? > Its an artifact the results from needing to free memory using a method in keeping with the way in which it was allocated. To use this bug as an example, the acl data was allocated by the VFS using kmalloc, which gets data from the slab. Even though this data is a size that is a multiple of a page, slab objects can be less than a page, and multiple objects can be stored in a single page. As such, anything allocated from the slab allocator needs to be freed by the slab allocator, so that object reference counts internally maintained by the slab can be kept accurate. Couple that with the fact that, once you pass this buffer to the network stack, its the network stack thats responsible for it, particularly for freeing it once its been transmitted. Given that the network stack wasn't responsible for allocating it, but still needs to know how to free it, there must be some convention used to help the stack know how to return it to the heap. In this case the convention is that 'pages placed in skb->frags must have been allocated by the buddy allocator'. In guaranteeing that, the network stack can safely use put_page to free the pages later on when the data has finished transmitting. Thats why we have the PG_Slab check in the free path on the stack trace above. If we just called put_page on all these pages, and one or more of the pages were from the slab allocator, we would get all sorts of corruption, because we might return a page that contains multiple slab objects to the buddy allocator while the slab was still in use, and the slab allocator would loose the tracking data on those objects, leading to further corruption. so we do this PG_Slab check in free_pages_check to ensure that we're not freeing something using a method other than the one used to allocate it. > > 1) ensure that when we create the list of pages, no page struct has PG_Slab set > > or > > 2) not use a page list to send this data > > > > Given that these buffers can be multiple pages and arbitrarily sized, I think > > (1) is the right way to go. I've written the below patch to allocate a page > > from the buddy allocator directly and copy the data over to it. This ensures > > that we have a put_page free-able page for every entry that winds up on an skb > > frag list, so it can be safely freed when the frame is acked. We do a put page > > on each entry after the rpc_call_sync call so as to drop our own reference count > > to the page, leaving only the ref count taken by tcp_sendpages. This way the > > data will be properly freed when the ack comes in > > Seems fine to me. > > There's one thing I still don't understand: why was this only triggered > for large ACLs? The setacl doesn't doesn't treat the first 4k specially > that I can see: even a 1-byte acl would get translated into a page. Is > it that the VFS creates the buffer it hands down differently in that > case? > Thats an artifact of tcp_sendpages (the proto specific function called by kernel_sendpages). If you send less then a page of data via that function, it doesn't use the fraglist to store data, it just inlines the page into the skb. Neil > --b. > > > Successfully tested by myself to solve the above oops. > > > > Note, as this is the result of a setacl operation that exceeded a page of data, > > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm > > CCing security on this as well. > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > CC: Trond Myklebust <Trond.Myklebust@netapp.com> > > CC: security@kernel.org > > CC: Jeff Layton <jlayton@redhat.com> > > --- > > fs/nfs/nfs4proc.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > > 1 files changed, 40 insertions(+), 2 deletions(-) > > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > > index 78936a8..e4bd9b5 100644 > > --- a/fs/nfs/nfs4proc.c > > +++ b/fs/nfs/nfs4proc.c > > @@ -51,6 +51,7 @@ > > #include <linux/sunrpc/bc_xprt.h> > > #include <linux/xattr.h> > > #include <linux/utsname.h> > > +#include <linux/mm.h> > > > > #include "nfs4_fs.h" > > #include "delegation.h" > > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen, > > } > > } > > > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > > + struct page **pages, unsigned int *pgbase) > > +{ > > + const void *p = buf; > > + struct page *page, *newpage, **spages; > > + int rc = -ENOMEM; > > + > > + spages = pages; > > + *pgbase = offset_in_page(buf); > > + p -= *pgbase; > > + while (p < buf + buflen) { > > + page = virt_to_page(p); > > + newpage = alloc_page(GFP_KERNEL); > > + if (!newpage) > > + goto unwind; > > + memcpy(page_address(newpage), page_address(page), > > + PAGE_CACHE_SIZE); > > + *(pages++) = newpage; > > + p += PAGE_CACHE_SIZE; > > + } > > + > > + rc = 0; > > + return rc; > > + > > +unwind: > > + while (*spages != NULL) > > + __free_page(*(spages++)); > > + return rc; > > +} > > + > > struct nfs4_cached_acl { > > int cached; > > size_t len; > > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl > > .rpc_argp = &arg, > > .rpc_resp = &res, > > }; > > - int ret; > > + int ret, i; > > > > if (!nfs4_server_supports_acls(server)) > > return -EOPNOTSUPP; > > + memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES); > > + ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > > + if (ret) > > + return ret; > > nfs_inode_return_delegation(inode); > > - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > > ret = nfs4_call_sync(server, &msg, &arg, &res, 1); > > + > > + for (i = 0; pages[i] != NULL; i++) > > + put_page(pages[i]); > > + > > /* > > * Acl update can result in inode attribute update. > > * so mark the attribute cache invalid. > > -- > > 1.7.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 18:45 ` Neil Horman @ 2011-03-04 19:33 ` J. Bruce Fields 2011-03-04 19:48 ` [Security] " Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: J. Bruce Fields @ 2011-03-04 19:33 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton On Fri, Mar 04, 2011 at 01:45:06PM -0500, Neil Horman wrote: > On Fri, Mar 04, 2011 at 12:13:21PM -0500, J. Bruce Fields wrote: > > On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote: > > > We can't create a buffer with kmalloc and free it later in the tcp > > > ack path with put_page, so we need to either: > > > > But out of curiosity: why is there this rule? > > > Its an artifact the results from needing to free memory using a method > in keeping with the way in which it was allocated. To use this bug as > an example, the acl data was allocated by the VFS using kmalloc, which > gets data from the slab. Even though this data is a size that is a > multiple of a page, slab objects can be less than a page, and multiple > objects can be stored in a single page. As such, anything allocated > from the slab allocator needs to be freed by the slab allocator, so > that object reference counts internally maintained by the slab can be > kept accurate. OK. I guess my naive mental model was that the slab allocator was layered on top of the page allocator--so it got pages with alloc_pages() or equivalent, then handed out pieces of them to people using the slab allocator. So I assumed the slab allocator would hold a reference to the page like any other user would, in which case the tcp code could take a second reference of its own. --b. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 19:33 ` J. Bruce Fields @ 2011-03-04 19:48 ` Linus Torvalds 2011-03-04 20:07 ` J. Bruce Fields 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2011-03-04 19:48 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Horman, linux-nfs, Trond Myklebust, security, Jeff Layton On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > So I assumed the slab allocator would hold a reference to the page like > any other user would, in which case the tcp code could take a second > reference of its own. So the reason that wouldn't work is simple: the reference is obviously at a page level, but slab doles out allocations on its own level. What does that mean? Imagine if the network layer takes a ref on the page, but then the original user does a "kfree()". The _page_ would stay around (we have a ref from it - but so does the slab allocator), but the thing is, the slab allocator will release and then re-use the slab entry. So the "hold a reference to the page" doesn't actually _help_. The problem isn't the page going away, it's the smaller slab-allocation being reused for something else - so the page-level ref would be useless. So page-level references really only do work with page allocators. They don't know about the allocation patterns within a page that slab does. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 19:48 ` [Security] " Linus Torvalds @ 2011-03-04 20:07 ` J. Bruce Fields 2011-03-04 20:30 ` Jeff Layton 0 siblings, 1 reply; 17+ messages in thread From: J. Bruce Fields @ 2011-03-04 20:07 UTC (permalink / raw) To: Linus Torvalds Cc: Neil Horman, linux-nfs, Trond Myklebust, security, Jeff Layton On Fri, Mar 04, 2011 at 11:48:23AM -0800, Linus Torvalds wrote: > On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > > So I assumed the slab allocator would hold a reference to the page like > > any other user would, in which case the tcp code could take a second > > reference of its own. > > So the reason that wouldn't work is simple: the reference is obviously > at a page level, but slab doles out allocations on its own level. > > What does that mean? Imagine if the network layer takes a ref on the > page, but then the original user does a "kfree()". The _page_ would > stay around (we have a ref from it - but so does the slab allocator), > but the thing is, the slab allocator will release and then re-use the > slab entry. > > So the "hold a reference to the page" doesn't actually _help_. The > problem isn't the page going away, it's the smaller slab-allocation > being reused for something else - so the page-level ref would be > useless. > > So page-level references really only do work with page allocators. > They don't know about the allocation patterns within a page that slab > does. Makes sense, thanks for the explanation. Could it still make sense to hand off kmalloc'd memory to tcp_send_pages if we know the kfree won't happen till after the data's sent? OK, maybe in that case someone above the tcp layer just shouldn't be assuming they can know when the tcp layer is done with the data. In this case, we're not kfree()'ing until we've gotten an rpc reply back. But in theory perhaps there could be cases where the server's gotten the data and we've seen the reply but the tcp layer still thinks it needs to retransmit something? I don't think we'd care if the data was still correct in that case, but it could be an information leak if nothing else. --b. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 20:07 ` J. Bruce Fields @ 2011-03-04 20:30 ` Jeff Layton 2011-03-04 20:40 ` Trond Myklebust [not found] ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 2 replies; 17+ messages in thread From: Jeff Layton @ 2011-03-04 20:30 UTC (permalink / raw) To: J. Bruce Fields Cc: Linus Torvalds, Neil Horman, linux-nfs, Trond Myklebust, security On Fri, 4 Mar 2011 15:07:03 -0500 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Fri, Mar 04, 2011 at 11:48:23AM -0800, Linus Torvalds wrote: > > On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > > > > So I assumed the slab allocator would hold a reference to the page like > > > any other user would, in which case the tcp code could take a second > > > reference of its own. > > > > So the reason that wouldn't work is simple: the reference is obviously > > at a page level, but slab doles out allocations on its own level. > > > > What does that mean? Imagine if the network layer takes a ref on the > > page, but then the original user does a "kfree()". The _page_ would > > stay around (we have a ref from it - but so does the slab allocator), > > but the thing is, the slab allocator will release and then re-use the > > slab entry. > > > > So the "hold a reference to the page" doesn't actually _help_. The > > problem isn't the page going away, it's the smaller slab-allocation > > being reused for something else - so the page-level ref would be > > useless. > > > > So page-level references really only do work with page allocators. > > They don't know about the allocation patterns within a page that slab > > does. > > Makes sense, thanks for the explanation. > > Could it still make sense to hand off kmalloc'd memory to tcp_send_pages > if we know the kfree won't happen till after the data's sent? > > OK, maybe in that case someone above the tcp layer just shouldn't be > assuming they can know when the tcp layer is done with the data. > Right, I don't think we can know that since we (at the RPC layer) don't get any real notification of when we receive a TCP ACK. > In this case, we're not kfree()'ing until we've gotten an rpc reply > back. But in theory perhaps there could be cases where the server's > gotten the data and we've seen the reply but the tcp layer still thinks > it needs to retransmit something? I don't think we'd care if the data > was still correct in that case, but it could be an information leak if > nothing else. There's also timeouts + soft mounts to consider. We may send the data on the socket, which gets buffered up and then the caller goes to sleep waiting for a reply. If that never comes (server crashed or something), then we can return an error back up to the VFS layer if it's a soft mount. Meanwhile, the kernel is still trying to send the data on the socket... -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 20:30 ` Jeff Layton @ 2011-03-04 20:40 ` Trond Myklebust [not found] ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 1 sibling, 0 replies; 17+ messages in thread From: Trond Myklebust @ 2011-03-04 20:40 UTC (permalink / raw) To: Jeff Layton Cc: J. Bruce Fields, Linus Torvalds, Neil Horman, linux-nfs, security On Fri, 2011-03-04 at 15:30 -0500, Jeff Layton wrote: > Right, I don't think we can know that since we (at the RPC layer) don't > get any real notification of when we receive a TCP ACK. > > > In this case, we're not kfree()'ing until we've gotten an rpc reply > > back. But in theory perhaps there could be cases where the server's > > gotten the data and we've seen the reply but the tcp layer still thinks > > it needs to retransmit something? I don't think we'd care if the data > > was still correct in that case, but it could be an information leak if > > nothing else. > > There's also timeouts + soft mounts to consider. We may send the data > on the socket, which gets buffered up and then the caller goes to sleep > waiting for a reply. If that never comes (server crashed or something), > then we can return an error back up to the VFS layer if it's a soft > mount. Meanwhile, the kernel is still trying to send the data on the > socket... O_DIRECT writes have a similar problem, BTW. Occasionally we do return control to userland while the pages are still buffered in the socket (either due to the above conditions, or due to the fact that we were retransmitting an RPC request, and we suddenly get a reply to the previous transmission). If the application then scribbles over the page data, the socket may end up transmitting a quite different WRITE payload than what it was supposed to... For that reason, it would be nice to have some way to know when the socket is done referencing the data. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab [not found] ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2011-03-04 21:04 ` J. Bruce Fields 0 siblings, 0 replies; 17+ messages in thread From: J. Bruce Fields @ 2011-03-04 21:04 UTC (permalink / raw) To: Jeff Layton Cc: Linus Torvalds, Neil Horman, linux-nfs, Trond Myklebust, security On Fri, Mar 04, 2011 at 03:30:59PM -0500, Jeff Layton wrote: > On Fri, 4 Mar 2011 15:07:03 -0500 > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > In this case, we're not kfree()'ing until we've gotten an rpc reply > > back. But in theory perhaps there could be cases where the server's > > gotten the data and we've seen the reply but the tcp layer still thinks > > it needs to retransmit something? I don't think we'd care if the data > > was still correct in that case, but it could be an information leak if > > nothing else. > > There's also timeouts + soft mounts to consider. We may send the data > on the socket, which gets buffered up and then the caller goes to sleep > waiting for a reply. If that never comes (server crashed or something), > then we can return an error back up to the VFS layer if it's a soft > mount. Meanwhile, the kernel is still trying to send the data on the > socket... Good point, I forgot about that. Thanks to everyone for setting me straight! --b. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman 2011-03-04 16:58 ` Christoph Hellwig 2011-03-04 17:13 ` J. Bruce Fields @ 2011-03-04 19:01 ` Trond Myklebust 2011-03-04 19:17 ` Neil Horman 2 siblings, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2011-03-04 19:01 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: > This oops was reported recently: > > <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 > [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 > [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 > [<ffffffff80028b88>] __kfree_skb+0x11/0x1a > [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 > [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 > [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa > [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 > [<ffffffff80099b45>] do_timer+0x2df/0x52c > [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 > [<ffffffff80035ab8>] ip_rcv+0x539/0x57c > [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f > [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 > [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 > [<ffffffff80012464>] __do_softirq+0x89/0x133 > [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 > [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d > [<ffffffff8006d485>] do_IRQ+0xec/0xf5 > [<ffffffff8006bdb5>] default_idle+0x0/0x50 > [<ffffffff8005d615>] ret_from_intr+0x0/0xa > <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 > [<ffffffff800492fd>] cpu_idle+0x95/0xb8 > [<ffffffff80461807>] start_kernel+0x220/0x225 > [<ffffffff8046122f>] _sinittext+0x22f/0x236 > > It occurs, because an skb with a fraglist was freed from the tcp retransmit > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating > it was allocated from the Slab allocator (which means the free path above can't > safely free it via put_page) > > We tracked this back to an nfsv4 setacl operation, in which the nfs code > attempted to fill convert the passed in buffer to an array of pages in > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab > bit is set. We can't create a buffer with kmalloc and free it later in the tcp > ack path with put_page, so we need to either: > > 1) ensure that when we create the list of pages, no page struct has PG_Slab set > or > 2) not use a page list to send this data > > Given that these buffers can be multiple pages and arbitrarily sized, I think > (1) is the right way to go. I've written the below patch to allocate a page > from the buddy allocator directly and copy the data over to it. This ensures > that we have a put_page free-able page for every entry that winds up on an skb > frag list, so it can be safely freed when the frame is acked. We do a put page > on each entry after the rpc_call_sync call so as to drop our own reference count > to the page, leaving only the ref count taken by tcp_sendpages. This way the > data will be properly freed when the ack comes in > > Successfully tested by myself to solve the above oops. > > Note, as this is the result of a setacl operation that exceeded a page of data, > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm > CCing security on this as well. > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Trond Myklebust <Trond.Myklebust@netapp.com> > CC: security@kernel.org > CC: Jeff Layton <jlayton@redhat.com> > --- > fs/nfs/nfs4proc.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > 1 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > index 78936a8..e4bd9b5 100644 > --- a/fs/nfs/nfs4proc.c > +++ b/fs/nfs/nfs4proc.c > @@ -51,6 +51,7 @@ > #include <linux/sunrpc/bc_xprt.h> > #include <linux/xattr.h> > #include <linux/utsname.h> > +#include <linux/mm.h> > > #include "nfs4_fs.h" > #include "delegation.h" > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen, > } > } > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > + struct page **pages, unsigned int *pgbase) > +{ > + const void *p = buf; > + struct page *page, *newpage, **spages; > + int rc = -ENOMEM; > + > + spages = pages; > + *pgbase = offset_in_page(buf); > + p -= *pgbase; > + while (p < buf + buflen) { > + page = virt_to_page(p); > + newpage = alloc_page(GFP_KERNEL); > + if (!newpage) > + goto unwind; > + memcpy(page_address(newpage), page_address(page), > + PAGE_CACHE_SIZE); Why do we need to keep this byzantian offset_in_page() and virt_to_page() logic in order to copying data from a linear buffer into a set of pages? > + *(pages++) = newpage; > + p += PAGE_CACHE_SIZE; > + } > + > + rc = 0; > + return rc; > + > +unwind: > + while (*spages != NULL) > + __free_page(*(spages++)); > + return rc; > +} > + > struct nfs4_cached_acl { > int cached; > size_t len; > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl > .rpc_argp = &arg, > .rpc_resp = &res, > }; > - int ret; > + int ret, i; > > if (!nfs4_server_supports_acls(server)) > return -EOPNOTSUPP; > + memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES); Why is this initialisation needed? Just return the number of pages that were actually allocated in the call below. > + ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > + if (ret) > + return ret; > nfs_inode_return_delegation(inode); > - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > ret = nfs4_call_sync(server, &msg, &arg, &res, 1); > + > + for (i = 0; pages[i] != NULL; i++) > + put_page(pages[i]); > + > /* > * Acl update can result in inode attribute update. > * so mark the attribute cache invalid. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 19:01 ` Trond Myklebust @ 2011-03-04 19:17 ` Neil Horman 2011-03-04 19:25 ` Trond Myklebust 0 siblings, 1 reply; 17+ messages in thread From: Neil Horman @ 2011-03-04 19:17 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, security, Jeff Layton On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote: > On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: > > This oops was reported recently: > > > > <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 > > [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 > > [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 > > [<ffffffff80028b88>] __kfree_skb+0x11/0x1a > > [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 > > [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 > > [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa > > [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 > > [<ffffffff80099b45>] do_timer+0x2df/0x52c > > [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 > > [<ffffffff80035ab8>] ip_rcv+0x539/0x57c > > [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f > > [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 > > [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 > > [<ffffffff80012464>] __do_softirq+0x89/0x133 > > [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 > > [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d > > [<ffffffff8006d485>] do_IRQ+0xec/0xf5 > > [<ffffffff8006bdb5>] default_idle+0x0/0x50 > > [<ffffffff8005d615>] ret_from_intr+0x0/0xa > > <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 > > [<ffffffff800492fd>] cpu_idle+0x95/0xb8 > > [<ffffffff80461807>] start_kernel+0x220/0x225 > > [<ffffffff8046122f>] _sinittext+0x22f/0x236 > > > > It occurs, because an skb with a fraglist was freed from the tcp retransmit > > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating > > it was allocated from the Slab allocator (which means the free path above can't > > safely free it via put_page) > > > > We tracked this back to an nfsv4 setacl operation, in which the nfs code > > attempted to fill convert the passed in buffer to an array of pages in > > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. > > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via > > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab > > bit is set. We can't create a buffer with kmalloc and free it later in the tcp > > ack path with put_page, so we need to either: > > > > 1) ensure that when we create the list of pages, no page struct has PG_Slab set > > or > > 2) not use a page list to send this data > > > > Given that these buffers can be multiple pages and arbitrarily sized, I think > > (1) is the right way to go. I've written the below patch to allocate a page > > from the buddy allocator directly and copy the data over to it. This ensures > > that we have a put_page free-able page for every entry that winds up on an skb > > frag list, so it can be safely freed when the frame is acked. We do a put page > > on each entry after the rpc_call_sync call so as to drop our own reference count > > to the page, leaving only the ref count taken by tcp_sendpages. This way the > > data will be properly freed when the ack comes in > > > > Successfully tested by myself to solve the above oops. > > > > Note, as this is the result of a setacl operation that exceeded a page of data, > > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm > > CCing security on this as well. > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > CC: Trond Myklebust <Trond.Myklebust@netapp.com> > > CC: security@kernel.org > > CC: Jeff Layton <jlayton@redhat.com> > > --- > > fs/nfs/nfs4proc.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > > 1 files changed, 40 insertions(+), 2 deletions(-) > > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > > index 78936a8..e4bd9b5 100644 > > --- a/fs/nfs/nfs4proc.c > > +++ b/fs/nfs/nfs4proc.c > > @@ -51,6 +51,7 @@ > > #include <linux/sunrpc/bc_xprt.h> > > #include <linux/xattr.h> > > #include <linux/utsname.h> > > +#include <linux/mm.h> > > > > #include "nfs4_fs.h" > > #include "delegation.h" > > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen, > > } > > } > > > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > > + struct page **pages, unsigned int *pgbase) > > +{ > > + const void *p = buf; > > + struct page *page, *newpage, **spages; > > + int rc = -ENOMEM; > > + > > + spages = pages; > > + *pgbase = offset_in_page(buf); > > + p -= *pgbase; > > + while (p < buf + buflen) { > > + page = virt_to_page(p); > > + newpage = alloc_page(GFP_KERNEL); > > + if (!newpage) > > + goto unwind; > > + memcpy(page_address(newpage), page_address(page), > > + PAGE_CACHE_SIZE); > > Why do we need to keep this byzantian offset_in_page() and > virt_to_page() logic in order to copying data from a linear buffer into > a set of pages? > We don't I suppose, but I thought it best to follow the byzantine style of the function immediately above it. :) If you have a better suggestion, I'm listening. > > + *(pages++) = newpage; > > + p += PAGE_CACHE_SIZE; > > + } > > + > > + rc = 0; > > + return rc; > > + > > +unwind: > > + while (*spages != NULL) > > + __free_page(*(spages++)); > > + return rc; > > +} > > + > > struct nfs4_cached_acl { > > int cached; > > size_t len; > > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl > > .rpc_argp = &arg, > > .rpc_resp = &res, > > }; > > - int ret; > > + int ret, i; > > > > if (!nfs4_server_supports_acls(server)) > > return -EOPNOTSUPP; > > + memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES); > > Why is this initialisation needed? Just return the number of pages that > were actually allocated in the call below. > Yeah, I suppose I can do that, I'll respin here shortly. Neil ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 19:17 ` Neil Horman @ 2011-03-04 19:25 ` Trond Myklebust 2011-03-04 19:59 ` Neil Horman 2011-03-04 21:09 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman 0 siblings, 2 replies; 17+ messages in thread From: Trond Myklebust @ 2011-03-04 19:25 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton On Fri, 2011-03-04 at 14:17 -0500, Neil Horman wrote: > On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote: > > On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: > > > > > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > > > + struct page **pages, unsigned int *pgbase) > > > +{ > > > + const void *p = buf; > > > + struct page *page, *newpage, **spages; > > > + int rc = -ENOMEM; > > > + > > > + spages = pages; > > > + *pgbase = offset_in_page(buf); > > > + p -= *pgbase; > > > + while (p < buf + buflen) { > > > + page = virt_to_page(p); > > > + newpage = alloc_page(GFP_KERNEL); > > > + if (!newpage) > > > + goto unwind; > > > + memcpy(page_address(newpage), page_address(page), > > > + PAGE_CACHE_SIZE); > > > > Why do we need to keep this byzantian offset_in_page() and > > virt_to_page() logic in order to copying data from a linear buffer into > > a set of pages? > > > We don't I suppose, but I thought it best to follow the byzantine style of the > function immediately above it. :) If you have a better suggestion, I'm > listening. Why isn't something like the following good enough? do { size_t len = min(PAGE_CACHE_SIZE, buflen); struct page *newpage = alloc_page(GFP_KERNEL); if (newpage == NULL) goto unwind; memcpy(page_address(newpage), buf, len); copied += len; buf += len; buflen -= len; } while (buflen != 0); -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab 2011-03-04 19:25 ` Trond Myklebust @ 2011-03-04 19:59 ` Neil Horman 2011-03-04 21:09 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman 1 sibling, 0 replies; 17+ messages in thread From: Neil Horman @ 2011-03-04 19:59 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, security, Jeff Layton On Fri, Mar 04, 2011 at 02:25:30PM -0500, Trond Myklebust wrote: > On Fri, 2011-03-04 at 14:17 -0500, Neil Horman wrote: > > On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote: > > > On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: > > > > > > > > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > > > > + struct page **pages, unsigned int *pgbase) > > > > +{ > > > > + const void *p = buf; > > > > + struct page *page, *newpage, **spages; > > > > + int rc = -ENOMEM; > > > > + > > > > + spages = pages; > > > > + *pgbase = offset_in_page(buf); > > > > + p -= *pgbase; > > > > + while (p < buf + buflen) { > > > > + page = virt_to_page(p); > > > > + newpage = alloc_page(GFP_KERNEL); > > > > + if (!newpage) > > > > + goto unwind; > > > > + memcpy(page_address(newpage), page_address(page), > > > > + PAGE_CACHE_SIZE); > > > > > > Why do we need to keep this byzantian offset_in_page() and > > > virt_to_page() logic in order to copying data from a linear buffer into > > > a set of pages? > > > > > We don't I suppose, but I thought it best to follow the byzantine style of the > > function immediately above it. :) If you have a better suggestion, I'm > > listening. > > Why isn't something like the following good enough? > > do { > size_t len = min(PAGE_CACHE_SIZE, buflen); > struct page *newpage = alloc_page(GFP_KERNEL); > > if (newpage == NULL) > goto unwind; > memcpy(page_address(newpage), buf, len); > copied += len; > buf += len; > buflen -= len; > } while (buflen != 0); > Ah, I see what your saying - you're suggesting that we copy from an offset in the page to the start of the page and always make the resulting pgbase be zero. That makes sense. I'll respin it that way. Neil > > > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) 2011-03-04 19:25 ` Trond Myklebust 2011-03-04 19:59 ` Neil Horman @ 2011-03-04 21:09 ` Neil Horman 2011-03-04 21:25 ` Trond Myklebust 1 sibling, 1 reply; 17+ messages in thread From: Neil Horman @ 2011-03-04 21:09 UTC (permalink / raw) To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton This oops was reported recently: <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 [<ffffffff80028b88>] __kfree_skb+0x11/0x1a [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 [<ffffffff80099b45>] do_timer+0x2df/0x52c [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 [<ffffffff80035ab8>] ip_rcv+0x539/0x57c [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 [<ffffffff80012464>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d [<ffffffff8006d485>] do_IRQ+0xec/0xf5 [<ffffffff8006bdb5>] default_idle+0x0/0x50 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 [<ffffffff800492fd>] cpu_idle+0x95/0xb8 [<ffffffff80461807>] start_kernel+0x220/0x225 [<ffffffff8046122f>] _sinittext+0x22f/0x236 It occurs, because an skb with a fraglist was freed from the tcp retransmit queue when it was acked, but a page on that fraglist had PG_Slab set (indicating it was allocated from the Slab allocator (which means the free path above can't safely free it via put_page) We tracked this back to an nfsv4 setacl operation, in which the nfs code attempted to fill convert the passed in buffer to an array of pages in __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer to a page struct via virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab bit is set. We can't create a buffer with kmalloc and free it later in the tcp ack path with put_page, so we need to either: 1) ensure that when we create the list of pages, no page struct has PG_Slab set or 2) not use a page list to send this data Given that these buffers can be multiple pages and arbitrarily sized, I think (1) is the right way to go. I've written the below patch to allocate a page from the buddy allocator directly and copy the data over to it. This ensures that we have a put_page free-able page for every entry that winds up on an skb frag list, so it can be safely freed when the frame is acked. We do a put page on each entry after the rpc_call_sync call so as to drop our own reference count to the page, leaving only the ref count taken by tcp_sendpages. This way the data will be properly freed when the ack comes in Successfully tested by myself to solve the above oops. Note, as this is the result of a setacl operation that exceeded a page of data, I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm CCing security on this as well. Change notes: Updated to reflect Tronds requests, specifically that buf_to_pages_noslab just return the number of pages copied to avoid the extra memset. Also made the copy operation a bit more efficient and removed one of the virt_to_page calls. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: security@kernel.org CC: Jeff Layton <jlayton@redhat.com> --- fs/nfs/nfs4proc.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 43 insertions(+), 2 deletions(-) diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 78936a8..6e906af 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -51,6 +51,7 @@ #include <linux/sunrpc/bc_xprt.h> #include <linux/xattr.h> #include <linux/utsname.h> +#include <linux/mm.h> #include "nfs4_fs.h" #include "delegation.h" @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen, } } +static int buf_to_pages_noslab(const void *buf, size_t buflen, + struct page **pages, unsigned int *pgbase) +{ + struct page *newpage, **spages; + int rc = 0; + size_t len; + spages = pages; + + do { + len = min(PAGE_CACHE_SIZE, buflen); + newpage = alloc_page(GFP_KERNEL); + + if (newpage == NULL) + goto unwind; + memcpy(page_address(newpage), buf, len); + buf += len; + buflen -= len; + *(pages++) = newpage; + rc++; + } while (buflen != 0); + + *pgbase = 0; + return rc; + +unwind: + for(; rc > 0; rc--) + __free_page(spages[rc-1]); + return -ENOMEM; +} + struct nfs4_cached_acl { int cached; size_t len; @@ -3420,13 +3451,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl .rpc_argp = &arg, .rpc_resp = &res, }; - int ret; + int ret, i; if (!nfs4_server_supports_acls(server)) return -EOPNOTSUPP; + i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); + if (i < 0) + return i; nfs_inode_return_delegation(inode); - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); ret = nfs4_call_sync(server, &msg, &arg, &res, 1); + + /* + * Free each page after tx, so the only ref left is + * held by the network stack + */ + for (; i > 0; i--) + put_page(pages[i-1]); + /* * Acl update can result in inode attribute update. * so mark the attribute cache invalid. -- 1.7.4 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) 2011-03-04 21:09 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman @ 2011-03-04 21:25 ` Trond Myklebust 2011-03-05 0:26 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) Neil Horman 0 siblings, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2011-03-04 21:25 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton On Fri, 2011-03-04 at 16:09 -0500, Neil Horman wrote: > +static int buf_to_pages_noslab(const void *buf, size_t buflen, > + struct page **pages, unsigned int *pgbase) > +{ > + struct page *newpage, **spages; > + int rc = 0; > + size_t len; > + spages = pages; > + > + do { > + len = min(PAGE_CACHE_SIZE, buflen); > + newpage = alloc_page(GFP_KERNEL); > + > + if (newpage == NULL) > + goto unwind; > + memcpy(page_address(newpage), buf, len); > + buf += len; > + buflen -= len; > + *(pages++) = newpage; Nit: the brackets aren't needed here. > + rc++; > + } while (buflen != 0); > + > + *pgbase = 0; > + return rc; > + > +unwind: > + for(; rc > 0; rc--) > + __free_page(spages[rc-1]); > + return -ENOMEM; > +} > + > struct nfs4_cached_acl { > int cached; > size_t len; > @@ -3420,13 +3451,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl > .rpc_argp = &arg, > .rpc_resp = &res, > }; > - int ret; > + int ret, i; > > if (!nfs4_server_supports_acls(server)) > return -EOPNOTSUPP; > + i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); NIT: arg.acl_pgbase is already initialised to 0. > + if (i < 0) > + return i; > nfs_inode_return_delegation(inode); > - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); > ret = nfs4_call_sync(server, &msg, &arg, &res, 1); > + > + /* > + * Free each page after tx, so the only ref left is > + * held by the network stack > + */ > + for (; i > 0; i--) > + put_page(pages[i-1]); > + > /* > * Acl update can result in inode attribute update. > * so mark the attribute cache invalid. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) 2011-03-04 21:25 ` Trond Myklebust @ 2011-03-05 0:26 ` Neil Horman 0 siblings, 0 replies; 17+ messages in thread From: Neil Horman @ 2011-03-05 0:26 UTC (permalink / raw) To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton This oops was reported recently: <IRQ> [<ffffffff800ca161>] bad_page+0x69/0x91 [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144 [<ffffffff8022f28a>] skb_release_data+0x5f/0x98 [<ffffffff80028b88>] __kfree_skb+0x11/0x1a [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868 [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9 [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6 [<ffffffff80099b45>] do_timer+0x2df/0x52c [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263 [<ffffffff80035ab8>] ip_rcv+0x539/0x57c [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5 [<ffffffff8000c979>] net_rx_action+0xac/0x1b3 [<ffffffff80012464>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d [<ffffffff8006d485>] do_IRQ+0xec/0xf5 [<ffffffff8006bdb5>] default_idle+0x0/0x50 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8006bdde>] default_idle+0x29/0x50 [<ffffffff800492fd>] cpu_idle+0x95/0xb8 [<ffffffff80461807>] start_kernel+0x220/0x225 [<ffffffff8046122f>] _sinittext+0x22f/0x236 It occurs, because an skb with a fraglist was freed from the tcp retransmit queue when it was acked, but a page on that fraglist had PG_Slab set (indicating it was allocated from the Slab allocator (which means the free path above can't safely free it via put_page) We tracked this back to an nfsv4 setacl operation, in which the nfs code attempted to fill convert the passed in buffer to an array of pages in __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer to a page struct via virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab bit is set. We can't create a buffer with kmalloc and free it later in the tcp ack path with put_page, so we need to either: 1) ensure that when we create the list of pages, no page struct has PG_Slab set or 2) not use a page list to send this data Given that these buffers can be multiple pages and arbitrarily sized, I think (1) is the right way to go. I've written the below patch to allocate a page from the buddy allocator directly and copy the data over to it. This ensures that we have a put_page free-able page for every entry that winds up on an skb frag list, so it can be safely freed when the frame is acked. We do a put page on each entry after the rpc_call_sync call so as to drop our own reference count to the page, leaving only the ref count taken by tcp_sendpages. This way the data will be properly freed when the ack comes in Successfully tested by myself to solve the above oops. Note, as this is the result of a setacl operation that exceeded a page of data, I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm CCing security on this as well. Change notes: v2: Updated to reflect Tronds requests, specifically that buf_to_pages_noslab just return the number of pages copied to avoid the extra memset. Also made the copy operation a bit more efficient and removed one of the virt_to_page calls. v3: Fixed nits (brackets around *pages++ where none are needed and duplicate initalization of acl_pgbase variable) Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: security@kernel.org CC: Jeff Layton <jlayton@redhat.com> --- fs/nfs/nfs4proc.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 42 insertions(+), 2 deletions(-) diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 78936a8..5368c79 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -51,6 +51,7 @@ #include <linux/sunrpc/bc_xprt.h> #include <linux/xattr.h> #include <linux/utsname.h> +#include <linux/mm.h> #include "nfs4_fs.h" #include "delegation.h" @@ -3252,6 +3253,35 @@ static void buf_to_pages(const void *buf, size_t buflen, } } +static int buf_to_pages_noslab(const void *buf, size_t buflen, + struct page **pages, unsigned int *pgbase) +{ + struct page *newpage, **spages; + int rc = 0; + size_t len; + spages = pages; + + do { + len = min(PAGE_CACHE_SIZE, buflen); + newpage = alloc_page(GFP_KERNEL); + + if (newpage == NULL) + goto unwind; + memcpy(page_address(newpage), buf, len); + buf += len; + buflen -= len; + *pages++ = newpage; + rc++; + } while (buflen != 0); + + return rc; + +unwind: + for(; rc > 0; rc--) + __free_page(spages[rc-1]); + return -ENOMEM; +} + struct nfs4_cached_acl { int cached; size_t len; @@ -3420,13 +3450,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl .rpc_argp = &arg, .rpc_resp = &res, }; - int ret; + int ret, i; if (!nfs4_server_supports_acls(server)) return -EOPNOTSUPP; + i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase); + if (i < 0) + return i; nfs_inode_return_delegation(inode); - buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase); ret = nfs4_call_sync(server, &msg, &arg, &res, 1); + + /* + * Free each page after tx, so the only ref left is + * held by the network stack + */ + for (; i > 0; i--) + put_page(pages[i-1]); + /* * Acl update can result in inode attribute update. * so mark the attribute cache invalid. -- 1.7.4 ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-05 0:26 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman
2011-03-04 16:58 ` Christoph Hellwig
2011-03-04 17:13 ` J. Bruce Fields
2011-03-04 18:45 ` Neil Horman
2011-03-04 19:33 ` J. Bruce Fields
2011-03-04 19:48 ` [Security] " Linus Torvalds
2011-03-04 20:07 ` J. Bruce Fields
2011-03-04 20:30 ` Jeff Layton
2011-03-04 20:40 ` Trond Myklebust
[not found] ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-03-04 21:04 ` J. Bruce Fields
2011-03-04 19:01 ` Trond Myklebust
2011-03-04 19:17 ` Neil Horman
2011-03-04 19:25 ` Trond Myklebust
2011-03-04 19:59 ` Neil Horman
2011-03-04 21:09 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman
2011-03-04 21:25 ` Trond Myklebust
2011-03-05 0:26 ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) Neil Horman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).