[PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
@ 2011-03-04 16:44 Neil Horman
  2011-03-04 16:58 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Neil Horman @ 2011-03-04 16:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton

This oops was reported recently:

 <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
  [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
  [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
  [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
  [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
  [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
  [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
  [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
  [<ffffffff80099b45>] do_timer+0x2df/0x52c
  [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
  [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
  [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
  [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
  [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
  [<ffffffff80012464>] __do_softirq+0x89/0x133
  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
  [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
  [<ffffffff8006d485>] do_IRQ+0xec/0xf5
  [<ffffffff8006bdb5>] default_idle+0x0/0x50
  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
  <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
  [<ffffffff800492fd>] cpu_idle+0x95/0xb8
  [<ffffffff80461807>] start_kernel+0x220/0x225
  [<ffffffff8046122f>] _sinittext+0x22f/0x236

It occurs, because an skb with a fraglist was freed from the tcp retransmit
queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
it was allocated from the Slab allocator (which means the free path above can't
safely free it via put_page)

We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
__nfs4_proc_set_acl just converts each page in the buffer to a page struct via
virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
ack path with put_page, so we need to either:

1) ensure that when we create the list of pages, no page struct has PG_Slab set
or
2) not use a page list to send this data

Given that these buffers can be multiple pages and arbitrarily sized, I think
(1) is the right way to go.  I've written the below patch to allocate a page
from the buddy allocator directly and copy the data over to it.  This ensures
that we have a put_page free-able page for every entry that winds up on an skb
frag list, so it can be safely freed when the frame is acked.  We do a put page
on each entry after the rpc_call_sync call so as to drop our own reference count
to the page, leaving only the ref count taken by tcp_sendpages.  This way the
data will be properly freed when the ack comes in

Successfully tested by myself to solve the above oops.

Note, as this is the result of a setacl operation that exceeded a page of data,
I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
CCing security on this as well.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/nfs4proc.c |   42 ++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 78936a8..e4bd9b5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -51,6 +51,7 @@
 #include <linux/sunrpc/bc_xprt.h>
 #include <linux/xattr.h>
 #include <linux/utsname.h>
+#include <linux/mm.h>
 
 #include "nfs4_fs.h"
 #include "delegation.h"
@@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
 	}
 }
 
+static int buf_to_pages_noslab(const void *buf, size_t buflen,
+		struct page **pages, unsigned int *pgbase)
+{
+	const void *p = buf;
+	struct page *page, *newpage, **spages;
+	int rc = -ENOMEM;
+
+	spages = pages;
+	*pgbase = offset_in_page(buf);
+	p -= *pgbase;
+	while (p < buf + buflen) {
+		page = virt_to_page(p);
+		newpage = alloc_page(GFP_KERNEL);
+		if (!newpage)
+			goto unwind;
+		memcpy(page_address(newpage), page_address(page),
+		       PAGE_CACHE_SIZE);
+		*(pages++) = newpage;
+		p += PAGE_CACHE_SIZE;
+	}
+
+	rc = 0;
+	return rc;
+
+unwind:
+	while (*spages != NULL)
+		__free_page(*(spages++));
+	return rc;
+}
+
 struct nfs4_cached_acl {
 	int cached;
 	size_t len;
@@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
 		.rpc_argp	= &arg,
 		.rpc_resp	= &res,
 	};
-	int ret;
+	int ret, i;
 
 	if (!nfs4_server_supports_acls(server))
 		return -EOPNOTSUPP;
+	memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);
+	ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
+	if (ret)
+		return ret;
 	nfs_inode_return_delegation(inode);
-	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
 	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
+
+	for (i = 0; pages[i] != NULL; i++)
+		put_page(pages[i]);
+
 	/*
 	 * Acl update can result in inode attribute update.
 	 * so mark the attribute cache invalid.
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman
@ 2011-03-04 16:58 ` Christoph Hellwig
  2011-03-04 17:13 ` J. Bruce Fields
  2011-03-04 19:01 ` Trond Myklebust
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2011-03-04 16:58 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-nfs, Trond Myklebust, security, Jeff Layton, linux-fsdevel

> It occurs, because an skb with a fraglist was freed from the tcp retransmit
> queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
> it was allocated from the Slab allocator (which means the free path above can't
> safely free it via put_page)
> 
> We tracked this back to an nfsv4 setacl operation, in which the nfs code
> attempted to fill convert the passed in buffer to an array of pages in
> __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
> __nfs4_proc_set_acl just converts each page in the buffer to a page struct via
> virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
> bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
> ack path with put_page, so we need to either:

Note that the same thing can also happen with ext4 sending kmalloc'ed
buffers through network based block storage.  Long time ago we had the
same issue with XFS and went to great length to avoid sending down
kmalloc'ed pages, but recently we (fs developers) were told that's just
fine.  Consequently ext4 now actually uses slab pages for I/O, and we
are planning to make use of that fact soon again in XFS, too.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman
  2011-03-04 16:58 ` Christoph Hellwig
@ 2011-03-04 17:13 ` J. Bruce Fields
  2011-03-04 18:45   ` Neil Horman
  2011-03-04 19:01 ` Trond Myklebust
  2 siblings, 1 reply; 17+ messages in thread
From: J. Bruce Fields @ 2011-03-04 17:13 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton

On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote:
> This oops was reported recently:
> 
>  <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
>   [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
>   [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
>   [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
>   [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
>   [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
>   [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
>   [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
>   [<ffffffff80099b45>] do_timer+0x2df/0x52c
>   [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
>   [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
>   [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
>   [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
>   [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
>   [<ffffffff80012464>] __do_softirq+0x89/0x133
>   [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
>   [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
>   [<ffffffff8006d485>] do_IRQ+0xec/0xf5
>   [<ffffffff8006bdb5>] default_idle+0x0/0x50
>   [<ffffffff8005d615>] ret_from_intr+0x0/0xa
>   <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
>   [<ffffffff800492fd>] cpu_idle+0x95/0xb8
>   [<ffffffff80461807>] start_kernel+0x220/0x225
>   [<ffffffff8046122f>] _sinittext+0x22f/0x236
> 
> It occurs, because an skb with a fraglist was freed from the tcp retransmit
> queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
> it was allocated from the Slab allocator (which means the free path above can't
> safely free it via put_page)
> 
> We tracked this back to an nfsv4 setacl operation, in which the nfs code
> attempted to fill convert the passed in buffer to an array of pages in
> __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
> __nfs4_proc_set_acl just converts each page in the buffer to a page struct via
> virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
> bit is set.

My code originally, I think, so apologies, and thanks for the fix!

> We can't create a buffer with kmalloc and free it later in the tcp
> ack path with put_page, so we need to either:

But out of curiosity: why is there this rule?

> 1) ensure that when we create the list of pages, no page struct has PG_Slab set
> or
> 2) not use a page list to send this data
> 
> Given that these buffers can be multiple pages and arbitrarily sized, I think
> (1) is the right way to go.  I've written the below patch to allocate a page
> from the buddy allocator directly and copy the data over to it.  This ensures
> that we have a put_page free-able page for every entry that winds up on an skb
> frag list, so it can be safely freed when the frame is acked.  We do a put page
> on each entry after the rpc_call_sync call so as to drop our own reference count
> to the page, leaving only the ref count taken by tcp_sendpages.  This way the
> data will be properly freed when the ack comes in

Seems fine to me.

There's one thing I still don't understand: why was this only triggered
for large ACLs?  The setacl doesn't doesn't treat the first 4k specially
that I can see: even a 1-byte acl would get translated into a page.  Is
it that the VFS creates the buffer it hands down differently in that
case?

--b.

> Successfully tested by myself to solve the above oops.
> 
> Note, as this is the result of a setacl operation that exceeded a page of data,
> I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
> CCing security on this as well.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> CC: security@kernel.org
> CC: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/nfs/nfs4proc.c |   42 ++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 78936a8..e4bd9b5 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -51,6 +51,7 @@
>  #include <linux/sunrpc/bc_xprt.h>
>  #include <linux/xattr.h>
>  #include <linux/utsname.h>
> +#include <linux/mm.h>
>  
>  #include "nfs4_fs.h"
>  #include "delegation.h"
> @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
>  	}
>  }
>  
> +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> +		struct page **pages, unsigned int *pgbase)
> +{
> +	const void *p = buf;
> +	struct page *page, *newpage, **spages;
> +	int rc = -ENOMEM;
> +
> +	spages = pages;
> +	*pgbase = offset_in_page(buf);
> +	p -= *pgbase;
> +	while (p < buf + buflen) {
> +		page = virt_to_page(p);
> +		newpage = alloc_page(GFP_KERNEL);
> +		if (!newpage)
> +			goto unwind;
> +		memcpy(page_address(newpage), page_address(page),
> +		       PAGE_CACHE_SIZE);
> +		*(pages++) = newpage;
> +		p += PAGE_CACHE_SIZE;
> +	}
> +
> +	rc = 0;
> +	return rc;
> +
> +unwind:
> +	while (*spages != NULL)
> +		__free_page(*(spages++));
> +	return rc;
> +}
> +
>  struct nfs4_cached_acl {
>  	int cached;
>  	size_t len;
> @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
>  		.rpc_argp	= &arg,
>  		.rpc_resp	= &res,
>  	};
> -	int ret;
> +	int ret, i;
>  
>  	if (!nfs4_server_supports_acls(server))
>  		return -EOPNOTSUPP;
> +	memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);
> +	ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
> +	if (ret)
> +		return ret;
>  	nfs_inode_return_delegation(inode);
> -	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
>  	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
> +
> +	for (i = 0; pages[i] != NULL; i++)
> +		put_page(pages[i]);
> +
>  	/*
>  	 * Acl update can result in inode attribute update.
>  	 * so mark the attribute cache invalid.
> -- 
> 1.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 17:13 ` J. Bruce Fields
@ 2011-03-04 18:45   ` Neil Horman
  2011-03-04 19:33     ` J. Bruce Fields
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Horman @ 2011-03-04 18:45 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton

On Fri, Mar 04, 2011 at 12:13:21PM -0500, J. Bruce Fields wrote:
> On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote:
> > This oops was reported recently:
> > 
> >  <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
> >   [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
> >   [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
> >   [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
> >   [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
> >   [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
> >   [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
> >   [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
> >   [<ffffffff80099b45>] do_timer+0x2df/0x52c
> >   [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
> >   [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
> >   [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
> >   [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
> >   [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
> >   [<ffffffff80012464>] __do_softirq+0x89/0x133
> >   [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
> >   [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
> >   [<ffffffff8006d485>] do_IRQ+0xec/0xf5
> >   [<ffffffff8006bdb5>] default_idle+0x0/0x50
> >   [<ffffffff8005d615>] ret_from_intr+0x0/0xa
> >   <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
> >   [<ffffffff800492fd>] cpu_idle+0x95/0xb8
> >   [<ffffffff80461807>] start_kernel+0x220/0x225
> >   [<ffffffff8046122f>] _sinittext+0x22f/0x236
> > 
> > It occurs, because an skb with a fraglist was freed from the tcp retransmit
> > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
> > it was allocated from the Slab allocator (which means the free path above can't
> > safely free it via put_page)
> > 
> > We tracked this back to an nfsv4 setacl operation, in which the nfs code
> > attempted to fill convert the passed in buffer to an array of pages in
> > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
> > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via
> > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
> > bit is set.
> 
> My code originally, I think, so apologies, and thanks for the fix!
> 

No problem :)

> > We can't create a buffer with kmalloc and free it later in the tcp
> > ack path with put_page, so we need to either:
> 
> But out of curiosity: why is there this rule?
> 
Its an artifact the results from needing to free memory using a method in
keeping with the way in which it was allocated.  To use this bug as an example,
the acl data was allocated by the VFS using kmalloc, which gets data from the
slab.  Even though this data is a size that is a multiple of a page, slab
objects can be less than a page, and multiple objects can be stored in a single
page.  As such, anything allocated from the slab allocator needs to be freed by
the slab allocator, so that object reference counts internally maintained by the
slab can be kept accurate.  

Couple that with the fact that, once you pass this buffer to the network stack,
its the network stack thats responsible for it, particularly for freeing it once
its been transmitted.  Given that the network stack wasn't responsible for
allocating it, but still needs to know how to free it, there must be some
convention used to help the stack know how to return it to the heap.  In this
case the convention is that 'pages placed in skb->frags must have been allocated
by the buddy allocator'.  In guaranteeing that, the network stack can safely use
put_page to free the pages later on when the data has finished transmitting.
Thats why we have the PG_Slab check in the free path on the stack trace above.
If we just called put_page on all these pages, and one or more of the pages were
from the slab allocator, we would get all sorts of corruption, because we might
return a page that contains multiple slab objects to the buddy allocator while
the slab was still in use, and the slab allocator would loose the tracking data
on those objects, leading to further corruption.  so we do this PG_Slab check in
free_pages_check to ensure that we're not freeing something using a method other
than the one used to allocate it.

> > 1) ensure that when we create the list of pages, no page struct has PG_Slab set
> > or
> > 2) not use a page list to send this data
> > 
> > Given that these buffers can be multiple pages and arbitrarily sized, I think
> > (1) is the right way to go.  I've written the below patch to allocate a page
> > from the buddy allocator directly and copy the data over to it.  This ensures
> > that we have a put_page free-able page for every entry that winds up on an skb
> > frag list, so it can be safely freed when the frame is acked.  We do a put page
> > on each entry after the rpc_call_sync call so as to drop our own reference count
> > to the page, leaving only the ref count taken by tcp_sendpages.  This way the
> > data will be properly freed when the ack comes in
> 
> Seems fine to me.
> 
> There's one thing I still don't understand: why was this only triggered
> for large ACLs?  The setacl doesn't doesn't treat the first 4k specially
> that I can see: even a 1-byte acl would get translated into a page.  Is
> it that the VFS creates the buffer it hands down differently in that
> case?
> 
Thats an artifact of tcp_sendpages (the proto specific function called by
kernel_sendpages).  If you send less then a page of data via that function, it
doesn't use the fraglist to store data, it just inlines the page into the skb.
Neil

> --b.
> 
> > Successfully tested by myself to solve the above oops.
> > 
> > Note, as this is the result of a setacl operation that exceeded a page of data,
> > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
> > CCing security on this as well.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > CC: security@kernel.org
> > CC: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/nfs/nfs4proc.c |   42 ++++++++++++++++++++++++++++++++++++++++--
> >  1 files changed, 40 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index 78936a8..e4bd9b5 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -51,6 +51,7 @@
> >  #include <linux/sunrpc/bc_xprt.h>
> >  #include <linux/xattr.h>
> >  #include <linux/utsname.h>
> > +#include <linux/mm.h>
> >  
> >  #include "nfs4_fs.h"
> >  #include "delegation.h"
> > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
> >  	}
> >  }
> >  
> > +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> > +		struct page **pages, unsigned int *pgbase)
> > +{
> > +	const void *p = buf;
> > +	struct page *page, *newpage, **spages;
> > +	int rc = -ENOMEM;
> > +
> > +	spages = pages;
> > +	*pgbase = offset_in_page(buf);
> > +	p -= *pgbase;
> > +	while (p < buf + buflen) {
> > +		page = virt_to_page(p);
> > +		newpage = alloc_page(GFP_KERNEL);
> > +		if (!newpage)
> > +			goto unwind;
> > +		memcpy(page_address(newpage), page_address(page),
> > +		       PAGE_CACHE_SIZE);
> > +		*(pages++) = newpage;
> > +		p += PAGE_CACHE_SIZE;
> > +	}
> > +
> > +	rc = 0;
> > +	return rc;
> > +
> > +unwind:
> > +	while (*spages != NULL)
> > +		__free_page(*(spages++));
> > +	return rc;
> > +}
> > +
> >  struct nfs4_cached_acl {
> >  	int cached;
> >  	size_t len;
> > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
> >  		.rpc_argp	= &arg,
> >  		.rpc_resp	= &res,
> >  	};
> > -	int ret;
> > +	int ret, i;
> >  
> >  	if (!nfs4_server_supports_acls(server))
> >  		return -EOPNOTSUPP;
> > +	memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);
> > +	ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
> > +	if (ret)
> > +		return ret;
> >  	nfs_inode_return_delegation(inode);
> > -	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
> >  	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
> > +
> > +	for (i = 0; pages[i] != NULL; i++)
> > +		put_page(pages[i]);
> > +
> >  	/*
> >  	 * Acl update can result in inode attribute update.
> >  	 * so mark the attribute cache invalid.
> > -- 
> > 1.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 18:45   ` Neil Horman
@ 2011-03-04 19:33     ` J. Bruce Fields
  2011-03-04 19:48       ` [Security] " Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: J. Bruce Fields @ 2011-03-04 19:33 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, Trond Myklebust, security, Jeff Layton

On Fri, Mar 04, 2011 at 01:45:06PM -0500, Neil Horman wrote:
> On Fri, Mar 04, 2011 at 12:13:21PM -0500, J. Bruce Fields wrote:
> > On Fri, Mar 04, 2011 at 11:44:13AM -0500, Neil Horman wrote:
> > > We can't create a buffer with kmalloc and free it later in the tcp
> > > ack path with put_page, so we need to either:
> > 
> > But out of curiosity: why is there this rule?
> > 
> Its an artifact the results from needing to free memory using a method
> in keeping with the way in which it was allocated.  To use this bug as
> an example, the acl data was allocated by the VFS using kmalloc, which
> gets data from the slab.  Even though this data is a size that is a
> multiple of a page, slab objects can be less than a page, and multiple
> objects can be stored in a single page.  As such, anything allocated
> from the slab allocator needs to be freed by the slab allocator, so
> that object reference counts internally maintained by the slab can be
> kept accurate.  

OK.  I guess my naive mental model was that the slab allocator was
layered on top of the page allocator--so it got pages with alloc_pages()
or equivalent, then handed out pieces of them to people using the slab
allocator.

So I assumed the slab allocator would hold a reference to the page like
any other user would, in which case the tcp code could take a second
reference of its own.

--b.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 19:33     ` J. Bruce Fields
@ 2011-03-04 19:48       ` Linus Torvalds
  2011-03-04 20:07         ` J. Bruce Fields
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-03-04 19:48 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Horman, linux-nfs, Trond Myklebust, security, Jeff Layton

On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> So I assumed the slab allocator would hold a reference to the page like
> any other user would, in which case the tcp code could take a second
> reference of its own.

So the reason that wouldn't work is simple: the reference is obviously
at a page level, but slab doles out allocations on its own level.

What does that mean? Imagine if the network layer takes a ref on the
page, but then the original user does a "kfree()". The _page_ would
stay around (we have a ref from it - but so does the slab allocator),
but the thing is, the slab allocator will release and then re-use the
slab entry.

So the "hold a reference to the page" doesn't actually _help_. The
problem isn't the page going away, it's the smaller slab-allocation
being reused for something else - so the page-level ref would be
useless.

So page-level references really only do work with page allocators.
They don't know about the allocation patterns within a page that slab
does.

                     Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 19:48       ` [Security] " Linus Torvalds
@ 2011-03-04 20:07         ` J. Bruce Fields
  2011-03-04 20:30           ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: J. Bruce Fields @ 2011-03-04 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Neil Horman, linux-nfs, Trond Myklebust, security, Jeff Layton

On Fri, Mar 04, 2011 at 11:48:23AM -0800, Linus Torvalds wrote:
> On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >
> > So I assumed the slab allocator would hold a reference to the page like
> > any other user would, in which case the tcp code could take a second
> > reference of its own.
> 
> So the reason that wouldn't work is simple: the reference is obviously
> at a page level, but slab doles out allocations on its own level.
> 
> What does that mean? Imagine if the network layer takes a ref on the
> page, but then the original user does a "kfree()". The _page_ would
> stay around (we have a ref from it - but so does the slab allocator),
> but the thing is, the slab allocator will release and then re-use the
> slab entry.
>
> So the "hold a reference to the page" doesn't actually _help_. The
> problem isn't the page going away, it's the smaller slab-allocation
> being reused for something else - so the page-level ref would be
> useless.
> 
> So page-level references really only do work with page allocators.
> They don't know about the allocation patterns within a page that slab
> does.

Makes sense, thanks for the explanation.

Could it still make sense to hand off kmalloc'd memory to tcp_send_pages
if we know the kfree won't happen till after the data's sent?

OK, maybe in that case someone above the tcp layer just shouldn't be
assuming they can know when the tcp layer is done with the data.

In this case, we're not kfree()'ing until we've gotten an rpc reply
back.  But in theory perhaps there could be cases where the server's
gotten the data and we've seen the reply but the tcp layer still thinks
it needs to retransmit something?  I don't think we'd care if the data
was still correct in that case, but it could be an information leak if
nothing else.

--b.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 20:07         ` J. Bruce Fields
@ 2011-03-04 20:30           ` Jeff Layton
  2011-03-04 20:40             ` Trond Myklebust
       [not found]             ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 2 replies; 17+ messages in thread
From: Jeff Layton @ 2011-03-04 20:30 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Linus Torvalds, Neil Horman, linux-nfs, Trond Myklebust, security

On Fri, 4 Mar 2011 15:07:03 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Fri, Mar 04, 2011 at 11:48:23AM -0800, Linus Torvalds wrote:
> > On Fri, Mar 4, 2011 at 11:33 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > >
> > > So I assumed the slab allocator would hold a reference to the page like
> > > any other user would, in which case the tcp code could take a second
> > > reference of its own.
> > 
> > So the reason that wouldn't work is simple: the reference is obviously
> > at a page level, but slab doles out allocations on its own level.
> > 
> > What does that mean? Imagine if the network layer takes a ref on the
> > page, but then the original user does a "kfree()". The _page_ would
> > stay around (we have a ref from it - but so does the slab allocator),
> > but the thing is, the slab allocator will release and then re-use the
> > slab entry.
> >
> > So the "hold a reference to the page" doesn't actually _help_. The
> > problem isn't the page going away, it's the smaller slab-allocation
> > being reused for something else - so the page-level ref would be
> > useless.
> > 
> > So page-level references really only do work with page allocators.
> > They don't know about the allocation patterns within a page that slab
> > does.
> 
> Makes sense, thanks for the explanation.
> 
> Could it still make sense to hand off kmalloc'd memory to tcp_send_pages
> if we know the kfree won't happen till after the data's sent?
> 
> OK, maybe in that case someone above the tcp layer just shouldn't be
> assuming they can know when the tcp layer is done with the data.
> 

Right, I don't think we can know that since we (at the RPC layer) don't
get any real notification of when we receive a TCP ACK.

> In this case, we're not kfree()'ing until we've gotten an rpc reply
> back.  But in theory perhaps there could be cases where the server's
> gotten the data and we've seen the reply but the tcp layer still thinks
> it needs to retransmit something?  I don't think we'd care if the data
> was still correct in that case, but it could be an information leak if
> nothing else.

There's also timeouts + soft mounts to consider. We may send the data
on the socket, which gets buffered up and then the caller goes to sleep
waiting for a reply. If that never comes (server crashed or something),
then we can return an error back up to the VFS layer if it's a soft
mount. Meanwhile, the kernel is still trying to send the data on the
socket...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 20:30           ` Jeff Layton
@ 2011-03-04 20:40             ` Trond Myklebust
       [not found]             ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  1 sibling, 0 replies; 17+ messages in thread
From: Trond Myklebust @ 2011-03-04 20:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Linus Torvalds, Neil Horman, linux-nfs, security

On Fri, 2011-03-04 at 15:30 -0500, Jeff Layton wrote: 
> Right, I don't think we can know that since we (at the RPC layer) don't
> get any real notification of when we receive a TCP ACK.
> 
> > In this case, we're not kfree()'ing until we've gotten an rpc reply
> > back.  But in theory perhaps there could be cases where the server's
> > gotten the data and we've seen the reply but the tcp layer still thinks
> > it needs to retransmit something?  I don't think we'd care if the data
> > was still correct in that case, but it could be an information leak if
> > nothing else.
> 
> There's also timeouts + soft mounts to consider. We may send the data
> on the socket, which gets buffered up and then the caller goes to sleep
> waiting for a reply. If that never comes (server crashed or something),
> then we can return an error back up to the VFS layer if it's a soft
> mount. Meanwhile, the kernel is still trying to send the data on the
> socket...

O_DIRECT writes have a similar problem, BTW. Occasionally we do return
control to userland while the pages are still buffered in the socket
(either due to the above conditions, or due to the fact that we were
retransmitting an RPC request, and we suddenly get a reply to the
previous transmission). If the application then scribbles over the page
data, the socket may end up transmitting a quite different WRITE payload
than what it was supposed to...

For that reason, it would be nice to have some way to know when the
socket is done referencing the data.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: [Security] [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
       [not found]             ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-03-04 21:04               ` J. Bruce Fields
  0 siblings, 0 replies; 17+ messages in thread
From: J. Bruce Fields @ 2011-03-04 21:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, Neil Horman, linux-nfs, Trond Myklebust, security

On Fri, Mar 04, 2011 at 03:30:59PM -0500, Jeff Layton wrote:
> On Fri, 4 Mar 2011 15:07:03 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > In this case, we're not kfree()'ing until we've gotten an rpc reply
> > back.  But in theory perhaps there could be cases where the server's
> > gotten the data and we've seen the reply but the tcp layer still thinks
> > it needs to retransmit something?  I don't think we'd care if the data
> > was still correct in that case, but it could be an information leak if
> > nothing else.
> 
> There's also timeouts + soft mounts to consider. We may send the data
> on the socket, which gets buffered up and then the caller goes to sleep
> waiting for a reply. If that never comes (server crashed or something),
> then we can return an error back up to the VFS layer if it's a soft
> mount. Meanwhile, the kernel is still trying to send the data on the
> socket...

Good point, I forgot about that.

Thanks to everyone for setting me straight!

--b.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman
  2011-03-04 16:58 ` Christoph Hellwig
  2011-03-04 17:13 ` J. Bruce Fields
@ 2011-03-04 19:01 ` Trond Myklebust
  2011-03-04 19:17   ` Neil Horman
  2 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2011-03-04 19:01 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton

On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: 
> This oops was reported recently:
> 
>  <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
>   [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
>   [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
>   [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
>   [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
>   [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
>   [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
>   [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
>   [<ffffffff80099b45>] do_timer+0x2df/0x52c
>   [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
>   [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
>   [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
>   [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
>   [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
>   [<ffffffff80012464>] __do_softirq+0x89/0x133
>   [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
>   [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
>   [<ffffffff8006d485>] do_IRQ+0xec/0xf5
>   [<ffffffff8006bdb5>] default_idle+0x0/0x50
>   [<ffffffff8005d615>] ret_from_intr+0x0/0xa
>   <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
>   [<ffffffff800492fd>] cpu_idle+0x95/0xb8
>   [<ffffffff80461807>] start_kernel+0x220/0x225
>   [<ffffffff8046122f>] _sinittext+0x22f/0x236
> 
> It occurs, because an skb with a fraglist was freed from the tcp retransmit
> queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
> it was allocated from the Slab allocator (which means the free path above can't
> safely free it via put_page)
> 
> We tracked this back to an nfsv4 setacl operation, in which the nfs code
> attempted to fill convert the passed in buffer to an array of pages in
> __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
> __nfs4_proc_set_acl just converts each page in the buffer to a page struct via
> virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
> bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
> ack path with put_page, so we need to either:
> 
> 1) ensure that when we create the list of pages, no page struct has PG_Slab set
> or
> 2) not use a page list to send this data
> 
> Given that these buffers can be multiple pages and arbitrarily sized, I think
> (1) is the right way to go.  I've written the below patch to allocate a page
> from the buddy allocator directly and copy the data over to it.  This ensures
> that we have a put_page free-able page for every entry that winds up on an skb
> frag list, so it can be safely freed when the frame is acked.  We do a put page
> on each entry after the rpc_call_sync call so as to drop our own reference count
> to the page, leaving only the ref count taken by tcp_sendpages.  This way the
> data will be properly freed when the ack comes in
> 
> Successfully tested by myself to solve the above oops.
> 
> Note, as this is the result of a setacl operation that exceeded a page of data,
> I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
> CCing security on this as well.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> CC: security@kernel.org
> CC: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/nfs/nfs4proc.c |   42 ++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 78936a8..e4bd9b5 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -51,6 +51,7 @@
>  #include <linux/sunrpc/bc_xprt.h>
>  #include <linux/xattr.h>
>  #include <linux/utsname.h>
> +#include <linux/mm.h>
>  
>  #include "nfs4_fs.h"
>  #include "delegation.h"
> @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
>  	}
>  }
>  
> +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> +		struct page **pages, unsigned int *pgbase)
> +{
> +	const void *p = buf;
> +	struct page *page, *newpage, **spages;
> +	int rc = -ENOMEM;
> +
> +	spages = pages;
> +	*pgbase = offset_in_page(buf);
> +	p -= *pgbase;
> +	while (p < buf + buflen) {
> +		page = virt_to_page(p);
> +		newpage = alloc_page(GFP_KERNEL);
> +		if (!newpage)
> +			goto unwind;
> +		memcpy(page_address(newpage), page_address(page),
> +		       PAGE_CACHE_SIZE);

Why do we need to keep this byzantian offset_in_page() and
virt_to_page() logic in order to copying data from a linear buffer into
a set of pages?

> +		*(pages++) = newpage;
> +		p += PAGE_CACHE_SIZE;
> +	}
> +
> +	rc = 0;
> +	return rc;
> +
> +unwind:
> +	while (*spages != NULL)
> +		__free_page(*(spages++));
> +	return rc;
> +}
> +
>  struct nfs4_cached_acl {
>  	int cached;
>  	size_t len;
> @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
>  		.rpc_argp	= &arg,
>  		.rpc_resp	= &res,
>  	};
> -	int ret;
> +	int ret, i;
>  
>  	if (!nfs4_server_supports_acls(server))
>  		return -EOPNOTSUPP;
> +	memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);

Why is this initialisation needed? Just return the number of pages that
were actually allocated in the call below.

> +	ret = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
> +	if (ret)
> +		return ret;
>  	nfs_inode_return_delegation(inode);
> -	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
>  	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
> +
> +	for (i = 0; pages[i] != NULL; i++)
> +		put_page(pages[i]);
> +
>  	/*
>  	 * Acl update can result in inode attribute update.
>  	 * so mark the attribute cache invalid.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 19:01 ` Trond Myklebust
@ 2011-03-04 19:17   ` Neil Horman
  2011-03-04 19:25     ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Horman @ 2011-03-04 19:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, security, Jeff Layton

On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote:
> On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote: 
> > This oops was reported recently:
> > 
> >  <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
> >   [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
> >   [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
> >   [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
> >   [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
> >   [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
> >   [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
> >   [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
> >   [<ffffffff80099b45>] do_timer+0x2df/0x52c
> >   [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
> >   [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
> >   [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
> >   [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
> >   [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
> >   [<ffffffff80012464>] __do_softirq+0x89/0x133
> >   [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
> >   [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
> >   [<ffffffff8006d485>] do_IRQ+0xec/0xf5
> >   [<ffffffff8006bdb5>] default_idle+0x0/0x50
> >   [<ffffffff8005d615>] ret_from_intr+0x0/0xa
> >   <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
> >   [<ffffffff800492fd>] cpu_idle+0x95/0xb8
> >   [<ffffffff80461807>] start_kernel+0x220/0x225
> >   [<ffffffff8046122f>] _sinittext+0x22f/0x236
> > 
> > It occurs, because an skb with a fraglist was freed from the tcp retransmit
> > queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
> > it was allocated from the Slab allocator (which means the free path above can't
> > safely free it via put_page)
> > 
> > We tracked this back to an nfsv4 setacl operation, in which the nfs code
> > attempted to fill convert the passed in buffer to an array of pages in
> > __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
> > __nfs4_proc_set_acl just converts each page in the buffer to a page struct via
> > virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
> > bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
> > ack path with put_page, so we need to either:
> > 
> > 1) ensure that when we create the list of pages, no page struct has PG_Slab set
> > or
> > 2) not use a page list to send this data
> > 
> > Given that these buffers can be multiple pages and arbitrarily sized, I think
> > (1) is the right way to go.  I've written the below patch to allocate a page
> > from the buddy allocator directly and copy the data over to it.  This ensures
> > that we have a put_page free-able page for every entry that winds up on an skb
> > frag list, so it can be safely freed when the frame is acked.  We do a put page
> > on each entry after the rpc_call_sync call so as to drop our own reference count
> > to the page, leaving only the ref count taken by tcp_sendpages.  This way the
> > data will be properly freed when the ack comes in
> > 
> > Successfully tested by myself to solve the above oops.
> > 
> > Note, as this is the result of a setacl operation that exceeded a page of data,
> > I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
> > CCing security on this as well.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > CC: security@kernel.org
> > CC: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/nfs/nfs4proc.c |   42 ++++++++++++++++++++++++++++++++++++++++--
> >  1 files changed, 40 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index 78936a8..e4bd9b5 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -51,6 +51,7 @@
> >  #include <linux/sunrpc/bc_xprt.h>
> >  #include <linux/xattr.h>
> >  #include <linux/utsname.h>
> > +#include <linux/mm.h>
> >  
> >  #include "nfs4_fs.h"
> >  #include "delegation.h"
> > @@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
> >  	}
> >  }
> >  
> > +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> > +		struct page **pages, unsigned int *pgbase)
> > +{
> > +	const void *p = buf;
> > +	struct page *page, *newpage, **spages;
> > +	int rc = -ENOMEM;
> > +
> > +	spages = pages;
> > +	*pgbase = offset_in_page(buf);
> > +	p -= *pgbase;
> > +	while (p < buf + buflen) {
> > +		page = virt_to_page(p);
> > +		newpage = alloc_page(GFP_KERNEL);
> > +		if (!newpage)
> > +			goto unwind;
> > +		memcpy(page_address(newpage), page_address(page),
> > +		       PAGE_CACHE_SIZE);
> 
> Why do we need to keep this byzantian offset_in_page() and
> virt_to_page() logic in order to copying data from a linear buffer into
> a set of pages?
> 
We don't I suppose, but I thought it best to follow the byzantine style of the
function immediately above it. :)  If you have a better suggestion, I'm
listening.
 

> > +		*(pages++) = newpage;
> > +		p += PAGE_CACHE_SIZE;
> > +	}
> > +
> > +	rc = 0;
> > +	return rc;
> > +
> > +unwind:
> > +	while (*spages != NULL)
> > +		__free_page(*(spages++));
> > +	return rc;
> > +}
> > +
> >  struct nfs4_cached_acl {
> >  	int cached;
> >  	size_t len;
> > @@ -3420,13 +3451,20 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
> >  		.rpc_argp	= &arg,
> >  		.rpc_resp	= &res,
> >  	};
> > -	int ret;
> > +	int ret, i;
> >  
> >  	if (!nfs4_server_supports_acls(server))
> >  		return -EOPNOTSUPP;
> > +	memset(pages, 0, sizeof(struct page *) * NFS4ACL_MAXPAGES);
> 
> Why is this initialisation needed? Just return the number of pages that
> were actually allocated in the call below.
> 
Yeah, I suppose I can do that, I'll respin here shortly.
Neil

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 19:17   ` Neil Horman
@ 2011-03-04 19:25     ` Trond Myklebust
  2011-03-04 19:59       ` Neil Horman
  2011-03-04 21:09       ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman
  0 siblings, 2 replies; 17+ messages in thread
From: Trond Myklebust @ 2011-03-04 19:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton

On Fri, 2011-03-04 at 14:17 -0500, Neil Horman wrote: 
> On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote:
> > On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote:
> > >  
> > > +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> > > +		struct page **pages, unsigned int *pgbase)
> > > +{
> > > +	const void *p = buf;
> > > +	struct page *page, *newpage, **spages;
> > > +	int rc = -ENOMEM;
> > > +
> > > +	spages = pages;
> > > +	*pgbase = offset_in_page(buf);
> > > +	p -= *pgbase;
> > > +	while (p < buf + buflen) {
> > > +		page = virt_to_page(p);
> > > +		newpage = alloc_page(GFP_KERNEL);
> > > +		if (!newpage)
> > > +			goto unwind;
> > > +		memcpy(page_address(newpage), page_address(page),
> > > +		       PAGE_CACHE_SIZE);
> > 
> > Why do we need to keep this byzantian offset_in_page() and
> > virt_to_page() logic in order to copying data from a linear buffer into
> > a set of pages?
> > 
> We don't I suppose, but I thought it best to follow the byzantine style of the
> function immediately above it. :)  If you have a better suggestion, I'm
> listening.

Why isn't something like the following good enough?

	do {
		size_t len = min(PAGE_CACHE_SIZE, buflen);
		struct page *newpage = alloc_page(GFP_KERNEL);

		if (newpage == NULL)
			goto unwind;
		memcpy(page_address(newpage), buf, len);
		copied += len;
		buf += len;
		buflen -= len;
	} while (buflen != 0);




-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab
  2011-03-04 19:25     ` Trond Myklebust
@ 2011-03-04 19:59       ` Neil Horman
  2011-03-04 21:09       ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman
  1 sibling, 0 replies; 17+ messages in thread
From: Neil Horman @ 2011-03-04 19:59 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, security, Jeff Layton

On Fri, Mar 04, 2011 at 02:25:30PM -0500, Trond Myklebust wrote:
> On Fri, 2011-03-04 at 14:17 -0500, Neil Horman wrote: 
> > On Fri, Mar 04, 2011 at 02:01:55PM -0500, Trond Myklebust wrote:
> > > On Fri, 2011-03-04 at 11:44 -0500, Neil Horman wrote:
> > > >  
> > > > +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> > > > +		struct page **pages, unsigned int *pgbase)
> > > > +{
> > > > +	const void *p = buf;
> > > > +	struct page *page, *newpage, **spages;
> > > > +	int rc = -ENOMEM;
> > > > +
> > > > +	spages = pages;
> > > > +	*pgbase = offset_in_page(buf);
> > > > +	p -= *pgbase;
> > > > +	while (p < buf + buflen) {
> > > > +		page = virt_to_page(p);
> > > > +		newpage = alloc_page(GFP_KERNEL);
> > > > +		if (!newpage)
> > > > +			goto unwind;
> > > > +		memcpy(page_address(newpage), page_address(page),
> > > > +		       PAGE_CACHE_SIZE);
> > > 
> > > Why do we need to keep this byzantian offset_in_page() and
> > > virt_to_page() logic in order to copying data from a linear buffer into
> > > a set of pages?
> > > 
> > We don't I suppose, but I thought it best to follow the byzantine style of the
> > function immediately above it. :)  If you have a better suggestion, I'm
> > listening.
> 
> Why isn't something like the following good enough?
> 
> 	do {
> 		size_t len = min(PAGE_CACHE_SIZE, buflen);
> 		struct page *newpage = alloc_page(GFP_KERNEL);
> 
> 		if (newpage == NULL)
> 			goto unwind;
> 		memcpy(page_address(newpage), buf, len);
> 		copied += len;
> 		buf += len;
> 		buflen -= len;
> 	} while (buflen != 0);
> 
Ah, I see what your saying - you're suggesting that we copy from an offset in
the page to the start of the page and always make the resulting pgbase be zero.
That makes sense.  I'll respin it that way.
Neil

> 
> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2)
  2011-03-04 19:25     ` Trond Myklebust
  2011-03-04 19:59       ` Neil Horman
@ 2011-03-04 21:09       ` Neil Horman
  2011-03-04 21:25         ` Trond Myklebust
  1 sibling, 1 reply; 17+ messages in thread
From: Neil Horman @ 2011-03-04 21:09 UTC (permalink / raw)
  To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton

This oops was reported recently:

 <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
  [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
  [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
  [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
  [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
  [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
  [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
  [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
  [<ffffffff80099b45>] do_timer+0x2df/0x52c
  [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
  [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
  [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
  [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
  [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
  [<ffffffff80012464>] __do_softirq+0x89/0x133
  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
  [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
  [<ffffffff8006d485>] do_IRQ+0xec/0xf5
  [<ffffffff8006bdb5>] default_idle+0x0/0x50
  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
  <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
  [<ffffffff800492fd>] cpu_idle+0x95/0xb8
  [<ffffffff80461807>] start_kernel+0x220/0x225
  [<ffffffff8046122f>] _sinittext+0x22f/0x236

It occurs, because an skb with a fraglist was freed from the tcp retransmit
queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
it was allocated from the Slab allocator (which means the free path above can't
safely free it via put_page)

We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
__nfs4_proc_set_acl just converts each page in the buffer to a page struct via
virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
ack path with put_page, so we need to either:

1) ensure that when we create the list of pages, no page struct has PG_Slab set
or
2) not use a page list to send this data

Given that these buffers can be multiple pages and arbitrarily sized, I think
(1) is the right way to go.  I've written the below patch to allocate a page
from the buddy allocator directly and copy the data over to it.  This ensures
that we have a put_page free-able page for every entry that winds up on an skb
frag list, so it can be safely freed when the frame is acked.  We do a put page
on each entry after the rpc_call_sync call so as to drop our own reference count
to the page, leaving only the ref count taken by tcp_sendpages.  This way the
data will be properly freed when the ack comes in

Successfully tested by myself to solve the above oops.

Note, as this is the result of a setacl operation that exceeded a page of data,
I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
CCing security on this as well.

Change notes:
Updated to reflect Tronds requests, specifically that buf_to_pages_noslab just
return the number of pages copied to avoid the extra memset.  Also made the copy
operation a bit more efficient and removed one of the virt_to_page calls.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/nfs4proc.c |   45 +++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 78936a8..6e906af 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -51,6 +51,7 @@
 #include <linux/sunrpc/bc_xprt.h>
 #include <linux/xattr.h>
 #include <linux/utsname.h>
+#include <linux/mm.h>
 
 #include "nfs4_fs.h"
 #include "delegation.h"
@@ -3252,6 +3253,36 @@ static void buf_to_pages(const void *buf, size_t buflen,
 	}
 }
 
+static int buf_to_pages_noslab(const void *buf, size_t buflen,
+		struct page **pages, unsigned int *pgbase)
+{
+	struct page *newpage, **spages;
+	int rc = 0;
+	size_t len;
+	spages = pages;
+
+	do {
+		len = min(PAGE_CACHE_SIZE, buflen);
+		newpage = alloc_page(GFP_KERNEL);
+
+		if (newpage == NULL)
+			goto unwind;
+		memcpy(page_address(newpage), buf, len);
+                buf += len;
+                buflen -= len;
+		*(pages++) = newpage;
+		rc++;
+	} while (buflen != 0);
+
+	*pgbase = 0;
+	return rc;
+
+unwind:
+	for(; rc > 0; rc--)
+		__free_page(spages[rc-1]);
+	return -ENOMEM;
+}
+
 struct nfs4_cached_acl {
 	int cached;
 	size_t len;
@@ -3420,13 +3451,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
 		.rpc_argp	= &arg,
 		.rpc_resp	= &res,
 	};
-	int ret;
+	int ret, i;
 
 	if (!nfs4_server_supports_acls(server))
 		return -EOPNOTSUPP;
+	i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
+	if (i < 0)
+		return i;
 	nfs_inode_return_delegation(inode);
-	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
 	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
+
+	/*
+	 * Free each page after tx, so the only ref left is 
+	 * held by the network stack
+	 */
+	for (; i > 0; i--)
+		put_page(pages[i-1]);
+
 	/*
 	 * Acl update can result in inode attribute update.
 	 * so mark the attribute cache invalid.
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2)
  2011-03-04 21:09       ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman
@ 2011-03-04 21:25         ` Trond Myklebust
  2011-03-05  0:26           ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) Neil Horman
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2011-03-04 21:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, security, Jeff Layton

On Fri, 2011-03-04 at 16:09 -0500, Neil Horman wrote: 
> +static int buf_to_pages_noslab(const void *buf, size_t buflen,
> +		struct page **pages, unsigned int *pgbase)
> +{
> +	struct page *newpage, **spages;
> +	int rc = 0;
> +	size_t len;
> +	spages = pages;
> +
> +	do {
> +		len = min(PAGE_CACHE_SIZE, buflen);
> +		newpage = alloc_page(GFP_KERNEL);
> +
> +		if (newpage == NULL)
> +			goto unwind;
> +		memcpy(page_address(newpage), buf, len);
> +                buf += len;
> +                buflen -= len;
> +		*(pages++) = newpage;

Nit: the brackets aren't needed here.

> +		rc++;
> +	} while (buflen != 0);
> +
> +	*pgbase = 0;
> +	return rc;
> +
> +unwind:
> +	for(; rc > 0; rc--)
> +		__free_page(spages[rc-1]);
> +	return -ENOMEM;
> +}
> +
>  struct nfs4_cached_acl {
>  	int cached;
>  	size_t len;
> @@ -3420,13 +3451,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
>  		.rpc_argp	= &arg,
>  		.rpc_resp	= &res,
>  	};
> -	int ret;
> +	int ret, i;
>  
>  	if (!nfs4_server_supports_acls(server))
>  		return -EOPNOTSUPP;
> +	i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);

NIT: arg.acl_pgbase is already initialised to 0.

> +	if (i < 0)
> +		return i;
>  	nfs_inode_return_delegation(inode);
> -	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
>  	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
> +
> +	/*
> +	 * Free each page after tx, so the only ref left is 
> +	 * held by the network stack
> +	 */
> +	for (; i > 0; i--)
> +		put_page(pages[i-1]);
> +
>  	/*
>  	 * Acl update can result in inode attribute update.
>  	 * so mark the attribute cache invalid.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3)
  2011-03-04 21:25         ` Trond Myklebust
@ 2011-03-05  0:26           ` Neil Horman
  0 siblings, 0 replies; 17+ messages in thread
From: Neil Horman @ 2011-03-05  0:26 UTC (permalink / raw)
  To: linux-nfs; +Cc: Neil Horman, Trond Myklebust, security, Jeff Layton

This oops was reported recently:

 <IRQ>  [<ffffffff800ca161>] bad_page+0x69/0x91
  [<ffffffff8000b8ae>] free_hot_cold_page+0x81/0x144
  [<ffffffff8022f28a>] skb_release_data+0x5f/0x98
  [<ffffffff80028b88>] __kfree_skb+0x11/0x1a
  [<ffffffff800243b8>] tcp_ack+0x6a3/0x1868
  [<ffffffff8001bff4>] tcp_rcv_established+0x7a6/0x8b9
  [<ffffffff8003b684>] tcp_v4_do_rcv+0x2a/0x2fa
  [<ffffffff8002730e>] tcp_v4_rcv+0x9a2/0x9f6
  [<ffffffff80099b45>] do_timer+0x2df/0x52c
  [<ffffffff8003495c>] ip_local_deliver+0x19d/0x263
  [<ffffffff80035ab8>] ip_rcv+0x539/0x57c
  [<ffffffff80020ab6>] netif_receive_skb+0x470/0x49f
  [<ffffffff883190b5>] :virtio_net:virtnet_poll+0x46b/0x5c5
  [<ffffffff8000c979>] net_rx_action+0xac/0x1b3
  [<ffffffff80012464>] __do_softirq+0x89/0x133
  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
  [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
  [<ffffffff8006d485>] do_IRQ+0xec/0xf5
  [<ffffffff8006bdb5>] default_idle+0x0/0x50
  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
  <EOI>  [<ffffffff8006bdde>] default_idle+0x29/0x50
  [<ffffffff800492fd>] cpu_idle+0x95/0xb8
  [<ffffffff80461807>] start_kernel+0x220/0x225
  [<ffffffff8046122f>] _sinittext+0x22f/0x236

It occurs, because an skb with a fraglist was freed from the tcp retransmit
queue when it was acked, but a page on that fraglist had PG_Slab set (indicating
it was allocated from the Slab allocator (which means the free path above can't
safely free it via put_page)

We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages.
__nfs4_proc_set_acl just converts each page in the buffer to a page struct via
virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab
bit is set.  We can't create a buffer with kmalloc and free it later in the tcp
ack path with put_page, so we need to either:

1) ensure that when we create the list of pages, no page struct has PG_Slab set
or
2) not use a page list to send this data

Given that these buffers can be multiple pages and arbitrarily sized, I think
(1) is the right way to go.  I've written the below patch to allocate a page
from the buddy allocator directly and copy the data over to it.  This ensures
that we have a put_page free-able page for every entry that winds up on an skb
frag list, so it can be safely freed when the frame is acked.  We do a put page
on each entry after the rpc_call_sync call so as to drop our own reference count
to the page, leaving only the ref count taken by tcp_sendpages.  This way the
data will be properly freed when the ack comes in

Successfully tested by myself to solve the above oops.

Note, as this is the result of a setacl operation that exceeded a page of data,
I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm
CCing security on this as well.

Change notes:
v2:
Updated to reflect Tronds requests, specifically that buf_to_pages_noslab just
return the number of pages copied to avoid the extra memset.  Also made the copy
operation a bit more efficient and removed one of the virt_to_page calls.

v3:
Fixed nits (brackets around *pages++ where none are needed and duplicate
initalization of acl_pgbase variable)

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/nfs4proc.c |   44 ++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 78936a8..5368c79 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -51,6 +51,7 @@
 #include <linux/sunrpc/bc_xprt.h>
 #include <linux/xattr.h>
 #include <linux/utsname.h>
+#include <linux/mm.h>
 
 #include "nfs4_fs.h"
 #include "delegation.h"
@@ -3252,6 +3253,35 @@ static void buf_to_pages(const void *buf, size_t buflen,
 	}
 }
 
+static int buf_to_pages_noslab(const void *buf, size_t buflen,
+		struct page **pages, unsigned int *pgbase)
+{
+	struct page *newpage, **spages;
+	int rc = 0;
+	size_t len;
+	spages = pages;
+
+	do {
+		len = min(PAGE_CACHE_SIZE, buflen);
+		newpage = alloc_page(GFP_KERNEL);
+
+		if (newpage == NULL)
+			goto unwind;
+		memcpy(page_address(newpage), buf, len);
+                buf += len;
+                buflen -= len;
+		*pages++ = newpage;
+		rc++;
+	} while (buflen != 0);
+
+	return rc;
+
+unwind:
+	for(; rc > 0; rc--)
+		__free_page(spages[rc-1]);
+	return -ENOMEM;
+}
+
 struct nfs4_cached_acl {
 	int cached;
 	size_t len;
@@ -3420,13 +3450,23 @@ static int __nfs4_proc_set_acl(struct inode *inode, const void *buf, size_t bufl
 		.rpc_argp	= &arg,
 		.rpc_resp	= &res,
 	};
-	int ret;
+	int ret, i;
 
 	if (!nfs4_server_supports_acls(server))
 		return -EOPNOTSUPP;
+	i = buf_to_pages_noslab(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
+	if (i < 0)
+		return i;
 	nfs_inode_return_delegation(inode);
-	buf_to_pages(buf, buflen, arg.acl_pages, &arg.acl_pgbase);
 	ret = nfs4_call_sync(server, &msg, &arg, &res, 1);
+
+	/*
+	 * Free each page after tx, so the only ref left is 
+	 * held by the network stack
+	 */
+	for (; i > 0; i--)
+		put_page(pages[i-1]);
+
 	/*
 	 * Acl update can result in inode attribute update.
 	 * so mark the attribute cache invalid.
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-03-05  0:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-04 16:44 [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab Neil Horman
2011-03-04 16:58 ` Christoph Hellwig
2011-03-04 17:13 ` J. Bruce Fields
2011-03-04 18:45   ` Neil Horman
2011-03-04 19:33     ` J. Bruce Fields
2011-03-04 19:48       ` [Security] " Linus Torvalds
2011-03-04 20:07         ` J. Bruce Fields
2011-03-04 20:30           ` Jeff Layton
2011-03-04 20:40             ` Trond Myklebust
     [not found]             ` <20110304153059.79374df7-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-03-04 21:04               ` J. Bruce Fields
2011-03-04 19:01 ` Trond Myklebust
2011-03-04 19:17   ` Neil Horman
2011-03-04 19:25     ` Trond Myklebust
2011-03-04 19:59       ` Neil Horman
2011-03-04 21:09       ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v2) Neil Horman
2011-03-04 21:25         ` Trond Myklebust
2011-03-05  0:26           ` [PATCH] nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).