[PATCH net] netlink, mmap: transform mmap skb into full skb on taps

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
@ 2015-09-10 18:05 Daniel Borkmann
  2015-09-11  5:11 ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2015-09-10 18:05 UTC (permalink / raw)
  To: davem; +Cc: chamaken, fw, netdev, Daniel Borkmann

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
---
 ( This was the last one I had in my queue. )

 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..4cad99d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -125,6 +125,24 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -206,7 +224,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -279,11 +301,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -846,7 +863,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..df9a060 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -59,6 +59,15 @@ static inline struct netlink_sock *nlk_sk(struct sock *sk)
 	return container_of(sk, struct netlink_sock, sk);
 }
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct rhashtable	hash;
 	struct hlist_head	mc_list;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-10 18:05 [PATCH net] netlink, mmap: transform mmap skb into full skb on taps Daniel Borkmann
@ 2015-09-11  5:11 ` David Miller
  2015-09-11 10:25   ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2015-09-11  5:11 UTC (permalink / raw)
  To: daniel; +Cc: chamaken, fw, netdev

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 10 Sep 2015 20:05:46 +0200

> Ken-ichirou reported that running netlink in mmap mode for receive in
> combination with nlmon will throw a NULL pointer dereference in
> __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
> to handle kernel paging request". The problem is the skb_clone() in
> __netlink_deliver_tap_skb() for skbs that are mmaped.
> 
> I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
> skb has it pointed to netlink_skb_destructor(), set in the handler
> netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
> that in such cases, __kfree_skb() doesn't perform a skb_release_data()
> via skb_release_all(), where skb->head is possibly being freed through
> kfree(head) into slab allocator, although netlink mmap skb->head points
> to the mmap buffer. Similarly, the same has to be done also for large
> netlink skbs where the data area is vmalloced. Therefore, as discussed,
> make a copy for these rather rare cases for now. This fixes the issue
> on my and Ken-ichirou's test-cases.
> 
> Reference: http://thread.gmane.org/gmane.linux.network/371129
> Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
> Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>

Looking more deeply into this, I think we have the same exact problem
with netlink skbs that use vmalloc memory at skb->head.

We have a special hack, but only when netlink_skb_clone() is used,
to preserve the destructor.  But these skbs can escape anywhere
and be eventually cloned as we have seen with the mmap stuff.

I'm wondering if we should do something more precise to fix this,
and in a way that handles both the mmap and vmalloc cases.

How about we burn the last flag bit in sk_buff:

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 639e9b8..47d5875 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -97,22 +97,6 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
 int netlink_sendskb(struct sock *sk, struct sk_buff *skb);
 
-static inline struct sk_buff *
-netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
-{
-	struct sk_buff *nskb;
-
-	nskb = skb_clone(skb, gfp_mask);
-	if (!nskb)
-		return NULL;
-
-	/* This is a large skb, set destructor callback to release head */
-	if (is_vmalloc_addr(skb->head))
-		nskb->destructor = skb->destructor;
-
-	return nskb;
-}
-
 /*
  *	skb should fit one page. This choice is good for headerless malloc.
  *	But we should limit to 8K so that userspace does not have to
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..77b804c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,8 +584,9 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				xmit_more:1;
-	/* one bit hole */
+				xmit_more:1,
+				clone_preserves_destructor;
+
 	kmemcheck_bitfield_end(flags1);
 
 	/* fields enclosed in headers_start/headers_end are copied
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd3..4a7b8e3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -825,7 +825,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
 	n->cloned = 1;
 	n->nohdr = 0;
-	n->destructor = NULL;
+	if (!skb->clone_preserves_destructor)
+		n->destructor = NULL;
 	C(tail);
 	C(end);
 	C(head);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6fcbd21..2ec5425 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1075,7 +1075,7 @@ static void nl_fib_input(struct sk_buff *skb)
 	    nlmsg_len(nlh) < sizeof(*frn))
 		return;
 
-	skb = netlink_skb_clone(skb, GFP_KERNEL);
+	skb = skb_clone(skb, GFP_KERNEL);
 	if (!skb)
 		return;
 	nlh = nlmsg_hdr(skb);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 70277b1..4c612481 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -291,7 +291,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
 replay:
 	status = 0;
 
-	skb = netlink_skb_clone(oskb, GFP_KERNEL);
+	skb = skb_clone(oskb, GFP_KERNEL);
 	if (!skb)
 		return netlink_ack(oskb, nlh, -ENOMEM);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..214f1a1 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -719,6 +719,7 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
 	skb->end	= skb->tail + size;
 	skb->len	= 0;
 
+	skb->clone_preserves_destructor = 1;
 	skb->destructor	= netlink_skb_destructor;
 	NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
 	NETLINK_CB(skb).sk = sk;
@@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm)	0
 #endif /* CONFIG_NETLINK_MMAP */
 
+static bool skb_can_release_head(struct sk_buff *skb)
+{
+	if (!skb->cloned ||
+	    !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
+		return true;
+	return false;
+}
+
 static void netlink_skb_destructor(struct sk_buff *skb)
 {
 #ifdef CONFIG_NETLINK_MMAP
@@ -866,31 +875,35 @@ static void netlink_skb_destructor(struct sk_buff *skb)
 	 * the status. In the direction userspace to kernel, the status is
 	 * always reset here after the packet was processed and freed.
 	 */
-	if (netlink_skb_is_mmaped(skb)) {
-		hdr = netlink_mmap_hdr(skb);
-		sk = NETLINK_CB(skb).sk;
+	if (!netlink_skb_is_mmaped(skb))
+		goto not_mmaped;
 
-		if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
-			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
-			ring = &nlk_sk(sk)->tx_ring;
-		} else {
-			if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
-				hdr->nm_len = 0;
-				netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
-			}
-			ring = &nlk_sk(sk)->rx_ring;
-		}
+	if (!skb_can_release_head(skb))
+		goto clone_refs_remain;
 
-		WARN_ON(atomic_read(&ring->pending) == 0);
-		atomic_dec(&ring->pending);
-		sock_put(sk);
+	hdr = netlink_mmap_hdr(skb);
+	sk = NETLINK_CB(skb).sk;
 
-		skb->head = NULL;
+	if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
+		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+		ring = &nlk_sk(sk)->tx_ring;
+	} else {
+		if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+			hdr->nm_len = 0;
+			netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+		}
+		ring = &nlk_sk(sk)->rx_ring;
 	}
+
+	WARN_ON(atomic_read(&ring->pending) == 0);
+	atomic_dec(&ring->pending);
+	sock_put(sk);
+clone_refs_remain:
+	skb->head = NULL;
+not_mmaped:
 #endif
 	if (is_vmalloc_addr(skb->head)) {
-		if (!skb->cloned ||
-		    !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
+		if (skb_can_release_head(skb))
 			vfree(skb->head);
 
 		skb->head = NULL;
@@ -1669,6 +1682,7 @@ static struct sk_buff *netlink_alloc_large_skb(unsigned int size,
 	if (data == NULL)
 		return NULL;
 
+	skb->clone_preserves_destructor = 1;
 	skb = __build_skb(data, size);
 	if (skb == NULL)
 		vfree(data);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-11  5:11 ` David Miller
@ 2015-09-11 10:25   ` Daniel Borkmann
  2015-09-11 19:42     ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2015-09-11 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: chamaken, fw, netdev

On 09/11/2015 07:11 AM, David Miller wrote:
...
> Looking more deeply into this, I think we have the same exact problem
> with netlink skbs that use vmalloc memory at skb->head.

Yes, agreed, the test in the patch covered those as well via:

   if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))

> We have a special hack, but only when netlink_skb_clone() is used,
> to preserve the destructor.  But these skbs can escape anywhere
> and be eventually cloned as we have seen with the mmap stuff.

Yes, it looks like they are currently only used from user space to
kernel space. I saw that 3a36515f7294 ("netlink: fix splat in skb_clone
with large messages") fixed a couple of these in upper layers with
regards to large skbs, so there's a chance that this can be overseen
rather easily as well in other places and then only come to light in
cases where we allocate more than NLMSG_GOODSIZE, so we don't actually
use the alloc_skb() path. :/ So I like your idea below!

> I'm wondering if we should do something more precise to fix this,
> and in a way that handles both the mmap and vmalloc cases.

Perhaps it might also be useful if the kernel would one day want to
use netlink_alloc_large_skb() for answers back to user space, or in
places where we use network skbs (haven't looked into it with regards
to this).

Some more comments below:

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2738d35..77b804c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -584,8 +584,9 @@ struct sk_buff {
>   				fclone:2,
>   				peeked:1,
>   				head_frag:1,
> -				xmit_more:1;
> -	/* one bit hole */
> +				xmit_more:1,
> +				clone_preserves_destructor;

( Nit: maybe as clone_preserves_destructor:1 )

> +
>   	kmemcheck_bitfield_end(flags1);
>
>   	/* fields enclosed in headers_start/headers_end are copied
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index dad4dd3..4a7b8e3 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -825,7 +825,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>   	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
>   	n->cloned = 1;
>   	n->nohdr = 0;
> -	n->destructor = NULL;
> +	if (!skb->clone_preserves_destructor)
> +		n->destructor = NULL;

I think we also need here:

	else
		C(destructor);

>   	C(tail);
>   	C(end);
>   	C(head);
[...]
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 7f86d3b..214f1a1 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -719,6 +719,7 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
>   	skb->end	= skb->tail + size;
>   	skb->len	= 0;
>
> +	skb->clone_preserves_destructor = 1;
>   	skb->destructor	= netlink_skb_destructor;
>   	NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
>   	NETLINK_CB(skb).sk = sk;
> @@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
>   #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm)	0
>   #endif /* CONFIG_NETLINK_MMAP */

One more thing that came to my mind. For netlink mmap skbs, the
skb->clone_preserves_destructor is actually not enough.

Already calling into skb_clone() is an issue itself, as the data area is
user space buffer, and skb_clone() as well as skb_copy() access skb_shinfo()
area. :/ So in that regard netlink mmap skbs are even further restrictive
on what we can do than netlink large skbs.

Best,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-11 10:25   ` Daniel Borkmann
@ 2015-09-11 19:42     ` David Miller
  2015-09-11 20:35       ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2015-09-11 19:42 UTC (permalink / raw)
  To: daniel; +Cc: chamaken, fw, netdev

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 11 Sep 2015 12:25:45 +0200

> Already calling into skb_clone() is an issue itself, as the data
> area is user space buffer, and skb_clone() as well as skb_copy()
> access skb_shinfo() area. :/ So in that regard netlink mmap skbs are
> even further restrictive on what we can do than netlink large skbs.

Indeed, this is fatal.

So we'd still need something special like your
netlink_to_full_skb_clone to elide trying to touch the skb_shinfo
area.

I thought briefly about somehow cobbling up extra space in the ring
entries so we could have a real skb_shinfo() there, but that's illegal
too as the user could scribble all over it randomly while we interpret
the contents.  We don't own that memory.  So this doesn't work.

We could rename the clone_preserves_destructor and have it also mean
that the SKB lacks frags and skb_shinfo() should not be inspected.

Something like this:

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 639e9b8..47d5875 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -97,22 +97,6 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
 int netlink_sendskb(struct sock *sk, struct sk_buff *skb);
 
-static inline struct sk_buff *
-netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
-{
-	struct sk_buff *nskb;
-
-	nskb = skb_clone(skb, gfp_mask);
-	if (!nskb)
-		return NULL;
-
-	/* This is a large skb, set destructor callback to release head */
-	if (is_vmalloc_addr(skb->head))
-		nskb->destructor = skb->destructor;
-
-	return nskb;
-}
-
 /*
  *	skb should fit one page. This choice is good for headerless malloc.
  *	But we should limit to 8K so that userspace does not have to
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..898c53d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,8 +584,9 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				xmit_more:1;
-	/* one bit hole */
+				xmit_more:1,
+				private_buffers:1;
+
 	kmemcheck_bitfield_end(flags1);
 
 	/* fields enclosed in headers_start/headers_end are copied
@@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
+		   skb->private_buffers))
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd3..54f9d6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -825,7 +825,10 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
 	n->cloned = 1;
 	n->nohdr = 0;
-	n->destructor = NULL;
+	if (!skb->private_buffers)
+		n->destructor = NULL;
+	else
+		C(destructor);
 	C(tail);
 	C(end);
 	C(head);
@@ -1675,6 +1678,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 		to     += copy;
 	}
 
+	if (skb->private_buffers)
+		goto out;
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		int end;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
@@ -1720,7 +1726,7 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 		}
 		start = end;
 	}
-
+out:
 	if (!len)
 		return 0;
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6fcbd21..2ec5425 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1075,7 +1075,7 @@ static void nl_fib_input(struct sk_buff *skb)
 	    nlmsg_len(nlh) < sizeof(*frn))
 		return;
 
-	skb = netlink_skb_clone(skb, GFP_KERNEL);
+	skb = skb_clone(skb, GFP_KERNEL);
 	if (!skb)
 		return;
 	nlh = nlmsg_hdr(skb);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 70277b1..4c612481 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -291,7 +291,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
 replay:
 	status = 0;
 
-	skb = netlink_skb_clone(oskb, GFP_KERNEL);
+	skb = skb_clone(oskb, GFP_KERNEL);
 	if (!skb)
 		return netlink_ack(oskb, nlh, -ENOMEM);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..523adac 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -719,6 +719,7 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
 	skb->end	= skb->tail + size;
 	skb->len	= 0;
 
+	skb->private_buffers = 1;
 	skb->destructor	= netlink_skb_destructor;
 	NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
 	NETLINK_CB(skb).sk = sk;
@@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm)	0
 #endif /* CONFIG_NETLINK_MMAP */
 
+static bool skb_can_release_head(struct sk_buff *skb)
+{
+	if (!skb->cloned ||
+	    !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
+		return true;
+	return false;
+}
+
 static void netlink_skb_destructor(struct sk_buff *skb)
 {
 #ifdef CONFIG_NETLINK_MMAP
@@ -866,31 +875,35 @@ static void netlink_skb_destructor(struct sk_buff *skb)
 	 * the status. In the direction userspace to kernel, the status is
 	 * always reset here after the packet was processed and freed.
 	 */
-	if (netlink_skb_is_mmaped(skb)) {
-		hdr = netlink_mmap_hdr(skb);
-		sk = NETLINK_CB(skb).sk;
+	if (!netlink_skb_is_mmaped(skb))
+		goto not_mmaped;
 
-		if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
-			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
-			ring = &nlk_sk(sk)->tx_ring;
-		} else {
-			if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
-				hdr->nm_len = 0;
-				netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
-			}
-			ring = &nlk_sk(sk)->rx_ring;
-		}
+	if (!skb_can_release_head(skb))
+		goto clone_refs_remain;
 
-		WARN_ON(atomic_read(&ring->pending) == 0);
-		atomic_dec(&ring->pending);
-		sock_put(sk);
+	hdr = netlink_mmap_hdr(skb);
+	sk = NETLINK_CB(skb).sk;
 
-		skb->head = NULL;
+	if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
+		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+		ring = &nlk_sk(sk)->tx_ring;
+	} else {
+		if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+			hdr->nm_len = 0;
+			netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+		}
+		ring = &nlk_sk(sk)->rx_ring;
 	}
+
+	WARN_ON(atomic_read(&ring->pending) == 0);
+	atomic_dec(&ring->pending);
+	sock_put(sk);
+clone_refs_remain:
+	skb->head = NULL;
+not_mmaped:
 #endif
 	if (is_vmalloc_addr(skb->head)) {
-		if (!skb->cloned ||
-		    !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
+		if (skb_can_release_head(skb))
 			vfree(skb->head);
 
 		skb->head = NULL;
@@ -1669,6 +1682,7 @@ static struct sk_buff *netlink_alloc_large_skb(unsigned int size,
 	if (data == NULL)
 		return NULL;
 
+	skb->private_buffers = 1;
 	skb = __build_skb(data, size);
 	if (skb == NULL)
 		vfree(data);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-11 19:42     ` David Miller
@ 2015-09-11 20:35       ` Daniel Borkmann
  2015-09-11 21:34         ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2015-09-11 20:35 UTC (permalink / raw)
  To: David Miller; +Cc: chamaken, fw, netdev

On 09/11/2015 09:42 PM, David Miller wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, 11 Sep 2015 12:25:45 +0200
>
>> Already calling into skb_clone() is an issue itself, as the data
>> area is user space buffer, and skb_clone() as well as skb_copy()
>> access skb_shinfo() area. :/ So in that regard netlink mmap skbs are
>> even further restrictive on what we can do than netlink large skbs.
>
> Indeed, this is fatal.
>
> So we'd still need something special like your
> netlink_to_full_skb_clone to elide trying to touch the skb_shinfo
> area.
>
> I thought briefly about somehow cobbling up extra space in the ring
> entries so we could have a real skb_shinfo() there, but that's illegal
> too as the user could scribble all over it randomly while we interpret
> the contents.  We don't own that memory.  So this doesn't work.

Yes, agreed.

> We could rename the clone_preserves_destructor and have it also mean
> that the SKB lacks frags and skb_shinfo() should not be inspected.
>
> Something like this:
[...]
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2738d35..898c53d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
[...]
> @@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff *skb)
>    */
>   static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
>   {
> -	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
> +	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
> +		   skb->private_buffers))

(These two would need to be swapped.)

>   		return 0;
>   	return skb_copy_ubufs(skb, gfp_mask);
>   }
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index dad4dd3..54f9d6e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -825,7 +825,10 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>   	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
>   	n->cloned = 1;
>   	n->nohdr = 0;
> -	n->destructor = NULL;
> +	if (!skb->private_buffers)
> +		n->destructor = NULL;
> +	else
> +		C(destructor);
>   	C(tail);
>   	C(end);
>   	C(head);

We would also have to conditionally skip the __skb_clone()'s ...

   atomic_inc(&(skb_shinfo(skb)->dataref));

Thus, the issue here is that while netlink_alloc_large_skb() and
netlink_ring_setup_skb() would set both skb->private_buffers = 1,
the large skb case would actually need to inspect dataref count
(which it also can legally do) to properly release the vmalloc'ed
area again, while the other case must not even touch it. So if I
see this correctly, it looks like it's unfortunately not possible
to combine the two cases in a single flag. :/

If there's a good case to burn this flag outside of netlink for e.g.
vmalloc backend memory on skbs, it could be solved like that, while
the mmap case be declared netlink's problem. ;) I currently don't
have a better idea than to copy these guys, hmmm.

[...]
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 7f86d3b..523adac 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
>   #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm)	0
>   #endif /* CONFIG_NETLINK_MMAP */
>
> +static bool skb_can_release_head(struct sk_buff *skb)
> +{
> +	if (!skb->cloned ||
> +	    !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
> +		return true;
> +	return false;
> +}
> +
>   static void netlink_skb_destructor(struct sk_buff *skb)
>   {
>   #ifdef CONFIG_NETLINK_MMAP
> @@ -866,31 +875,35 @@ static void netlink_skb_destructor(struct sk_buff *skb)
[...]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-11 20:35       ` Daniel Borkmann
@ 2015-09-11 21:34         ` David Miller
  2015-09-11 22:18           ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2015-09-11 21:34 UTC (permalink / raw)
  To: daniel; +Cc: chamaken, fw, netdev

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 11 Sep 2015 22:35:08 +0200

> On 09/11/2015 09:42 PM, David Miller wrote:
>> @@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff
>> *skb)
>>    */
>>   static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t
>>   gfp_mask)
>>   {
>> -	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
>> +	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
>> +		   skb->private_buffers))
> 
> (These two would need to be swapped.)

Right, good catch.

>> -	n->destructor = NULL;
>> +	if (!skb->private_buffers)
>> +		n->destructor = NULL;
>> +	else
>> +		C(destructor);
>>   	C(tail);
>>   	C(end);
>>   	C(head);
> 
> We would also have to conditionally skip the __skb_clone()'s ...
> 
>   atomic_inc(&(skb_shinfo(skb)->dataref));

Sigh, the solution was too good to be true I guess. :-/ That's right,
we can't touch skb_shinfo for mmap skbs.

Another approach would be to put the mmap user data into a page frag,
but that obviously has some costs associated with it.  However,
nothing in netlink is ready for fragged skbs yet.  It's the reason why
we have the large skb via vmalloc facility.

Long term fixing that is probably the way to go, but that would be an
enormous project.

So for now I'm going to apply your patch Daniel, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-11 21:34         ` David Miller
@ 2015-09-11 22:18           ` Daniel Borkmann
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2015-09-11 22:18 UTC (permalink / raw)
  To: David Miller; +Cc: chamaken, fw, netdev

On 09/11/2015 11:34 PM, David Miller wrote:
...
> Another approach would be to put the mmap user data into a page frag,
> but that obviously has some costs associated with it.  However,
> nothing in netlink is ready for fragged skbs yet.  It's the reason why
> we have the large skb via vmalloc facility.
>
> Long term fixing that is probably the way to go, but that would be an
> enormous project.

Yeah, that could resolve it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-09-11 22:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-10 18:05 [PATCH net] netlink, mmap: transform mmap skb into full skb on taps Daniel Borkmann
2015-09-11  5:11 ` David Miller
2015-09-11 10:25   ` Daniel Borkmann
2015-09-11 19:42     ` David Miller
2015-09-11 20:35       ` Daniel Borkmann
2015-09-11 21:34         ` David Miller
2015-09-11 22:18           ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).