netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/2] net: permit skb_segment on head_frag frag_list skb
@ 2018-03-20 23:21 Yonghong Song
  2018-03-20 23:21 ` [PATCH net-next v3 1/2] " Yonghong Song
  2018-03-20 23:21 ` [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module Yonghong Song
  0 siblings, 2 replies; 9+ messages in thread
From: Yonghong Song @ 2018-03-20 23:21 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev, alexander.duyck; +Cc: kernel-team

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
 ...
    3665                 while (pos < offset + len) {
    3666                         if (i >= nfrags) {
    3667                                 BUG_ON(skb_headlen(list_skb));
 ...

The triggering input skb has the following properties:
    list_skb = skb->frag_list;
    skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.

The patch has been tested in the following setup:
  ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.

Changelog:
v2 -> v3:
  . Use starting frag index -1 (instead of 0) to
    special process head_frag before other frags in the skb,
    suggested by Alexander Duyck.
v1 -> v2:
  . Removed never-hit BUG_ON, spotted by Linyu Yuan.


Yonghong Song (2):
  net: permit skb_segment on head_frag frag_list skb
  net: bpf: add a test for skb_segment in test_bpf module

 lib/test_bpf.c    | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/core/skbuff.c | 51 ++++++++++++++++++++++++++++-----------
 2 files changed, 107 insertions(+), 15 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
  2018-03-20 23:21 [PATCH net-next v3 0/2] net: permit skb_segment on head_frag frag_list skb Yonghong Song
@ 2018-03-20 23:21 ` Yonghong Song
  2018-03-20 23:50   ` Alexander Duyck
  2018-03-20 23:21 ` [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module Yonghong Song
  1 sibling, 1 reply; 9+ messages in thread
From: Yonghong Song @ 2018-03-20 23:21 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev, alexander.duyck; +Cc: kernel-team

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473                             netdev_features_t features)
3474 {
3475         struct sk_buff *segs = NULL;
3476         struct sk_buff *tail = NULL;
...
3665                 while (pos < offset + len) {
3666                         if (i >= nfrags) {
3667                                 BUG_ON(skb_headlen(list_skb));
3668
3669                                 i = 0;
3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
3671                                 frag = skb_shinfo(list_skb)->frags;
3672                                 frag_skb = list_skb;
...

call stack:
...
 #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
 #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
 #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
 #4 [ffff883ffef03668] die at ffffffff8101deb2
 #5 [ffff883ffef03698] do_trap at ffffffff8101a700
 #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
 #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
 #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
    [exception RIP: skb_segment+3044]
    RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
    RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
    RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
    RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
    R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
    R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
--- <IRQ stack> ---
...

The triggering input skb has the following properties:
    list_skb = skb->frag_list;
    skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 net/core/skbuff.c | 51 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..59bbc06 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
 	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
-	skb_frag_t *frag = skb_shinfo(head_skb)->frags;
+	skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag = NULL;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
 	struct sk_buff *frag_skb = head_skb;
@@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
-				BUG_ON(skb_headlen(list_skb));
-
 				i = 0;
+				if (skb_headlen(list_skb)) {
+					struct page *page;
+
+					BUG_ON(!list_skb->head_frag);
+
+					page = virt_to_head_page(list_skb->head);
+					if (!head_frag) {
+						head_frag = kmalloc(sizeof(skb_frag_t),
+								    GFP_KERNEL);
+						if (!head_frag)
+							goto err;
+					}
+					head_frag->page.p = page;
+					head_frag->page_offset = list_skb->data -
+						(unsigned char *)page_address(page);
+					head_frag->size = skb_headlen(list_skb);
+					/* set i = -1 so we will pick head_frag
+					 * instead of skb_shinfo(list_skb)->frags
+					 * when i == -1.
+					 */
+					i = -1;
+				}
 				nfrags = skb_shinfo(list_skb)->nr_frags;
-				frag = skb_shinfo(list_skb)->frags;
-				frag_skb = list_skb;
-
-				BUG_ON(!nfrags);
-
-				if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
-				    skb_zerocopy_clone(nskb, frag_skb,
-						       GFP_ATOMIC))
-					goto err;
+				if (nfrags) {
+					frag = skb_shinfo(list_skb)->frags;
+					frag_skb = list_skb;
+
+					if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+					    skb_zerocopy_clone(nskb, frag_skb,
+							       GFP_ATOMIC))
+						goto err;
+				}
 
 				list_skb = list_skb->next;
 			}
@@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 				goto err;
 			}
 
-			*nskb_frag = *frag;
+			*nskb_frag = (i == -1) ? *head_frag : *frag;
 			__skb_frag_ref(nskb_frag);
 			size = skb_frag_size(nskb_frag);
 
@@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			if (pos + size <= offset + len) {
 				i++;
-				frag++;
+				if (i != 0)
+					frag++;
 				pos += size;
 			} else {
 				skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
@@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		swap(tail->destructor, head_skb->destructor);
 		swap(tail->sk, head_skb->sk);
 	}
+	kfree(head_frag);
 	return segs;
 
 err:
 	kfree_skb_list(segs);
+	kfree(head_frag);
 	return ERR_PTR(err);
 }
 EXPORT_SYMBOL_GPL(skb_segment);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module
  2018-03-20 23:21 [PATCH net-next v3 0/2] net: permit skb_segment on head_frag frag_list skb Yonghong Song
  2018-03-20 23:21 ` [PATCH net-next v3 1/2] " Yonghong Song
@ 2018-03-20 23:21 ` Yonghong Song
  2018-03-21  0:44   ` Eric Dumazet
  1 sibling, 1 reply; 9+ messages in thread
From: Yonghong Song @ 2018-03-20 23:21 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev, alexander.duyck; +Cc: kernel-team

Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[   98.149165] ------------[ cut here ]------------
[   98.159362] kernel BUG at net/core/skbuff.c:3667!
[   98.169756] invalid opcode: 0000 [#1] SMP PTI
[   98.179370] Modules linked in:
[   98.179371]  test_bpf(+)
...
which triggers the bug the previous commit intends to fix.

The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 lib/test_bpf.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..045d7d3 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,72 @@ static bool exclude_test(int test_id)
 	return test_id < test_range[0] || test_id > test_range[1];
 }
 
+static struct sk_buff *build_test_skb(void *page)
+{
+	u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+	struct sk_buff *skb[2];
+	int i, data_size = 8;
+
+	for (i = 0; i < 2; i++) {
+		/* this will set skb[i]->head_frag */
+		skb[i] = build_skb(page, headroom);
+		if (!skb[i])
+			return NULL;
+
+		skb_reserve(skb[i], headroom);
+		skb_put(skb[i], data_size);
+		skb[i]->protocol = htons(ETH_P_IP);
+		skb_reset_network_header(skb[i]);
+		skb_set_mac_header(skb[i], -ETH_HLEN);
+
+		skb_add_rx_frag(skb[i], skb_shinfo(skb[i])->nr_frags,
+				page, 0, 64, 64);
+		// skb: skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
+	}
+
+	/* setup shinfo */
+	skb_shinfo(skb[0])->gso_size = 1448;
+	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+	skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+	skb_shinfo(skb[0])->gso_segs = 0;
+	skb_shinfo(skb[0])->frag_list = skb[1];
+
+	/* adjust skb[0]'s len */
+	skb[0]->len += skb[1]->len;
+	skb[0]->data_len += skb[1]->data_len;
+	skb[0]->truesize += skb[1]->truesize;
+
+	return skb[0];
+}
+
+static __init int test_skb_segment(void)
+{
+	netdev_features_t features;
+	struct sk_buff *skb;
+	void *page;
+	int ret = -1;
+
+	page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page) {
+		pr_info("%s: failed to get_free_page!", __func__);
+		return ret;
+	}
+
+	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+	features |= NETIF_F_RXCSUM;
+	skb = build_test_skb(page);
+	if (!skb) {
+		pr_info("%s: failed to build_test_skb", __func__);
+	} else if (skb_segment(skb, features)) {
+		ret = 0;
+		pr_info("%s: success in skb_segment!", __func__);
+	} else {
+		pr_info("%s: failed in skb_segment!", __func__);
+	}
+	free_page((unsigned long)page);
+	return ret;
+}
+
 static __init int test_bpf(void)
 {
 	int i, err_cnt = 0, pass_cnt = 0;
@@ -6632,8 +6698,11 @@ static int __init test_bpf_init(void)
 		return ret;
 
 	ret = test_bpf();
-
 	destroy_bpf_tests();
+	if (ret)
+		return ret;
+
+	ret = test_skb_segment();
 	return ret;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
  2018-03-20 23:21 ` [PATCH net-next v3 1/2] " Yonghong Song
@ 2018-03-20 23:50   ` Alexander Duyck
  2018-03-21  5:02     ` Yonghong Song
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Duyck @ 2018-03-20 23:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team

On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> function skb_segment(), line 3667. The bpf program attaches to
> clsact ingress, calls bpf_skb_change_proto to change protocol
> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
> to send the changed packet out.
>
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473                             netdev_features_t features)
> 3474 {
> 3475         struct sk_buff *segs = NULL;
> 3476         struct sk_buff *tail = NULL;
> ...
> 3665                 while (pos < offset + len) {
> 3666                         if (i >= nfrags) {
> 3667                                 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669                                 i = 0;
> 3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
> 3671                                 frag = skb_shinfo(list_skb)->frags;
> 3672                                 frag_skb = list_skb;
> ...
>
> call stack:
> ...
>  #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>  #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>  #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>  #4 [ffff883ffef03668] die at ffffffff8101deb2
>  #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>  #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>  #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>  #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>     [exception RIP: skb_segment+3044]
>     RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>     RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>     RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>     RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>     R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>     R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
> --- <IRQ stack> ---
> ...
>
> The triggering input skb has the following properties:
>     list_skb = skb->frag_list;
>     skb->nfrags != NULL && skb_headlen(list_skb) != 0
> and skb_segment() is not able to handle a frag_list skb
> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>
> This patch addressed the issue by handling skb_headlen(list_skb) != 0
> case properly if list_skb->head_frag is true, which is expected in
> most cases. The head frag is processed before list_skb->frags
> are processed.
>
> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  net/core/skbuff.c | 51 +++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 37 insertions(+), 14 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c134..59bbc06 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>         struct sk_buff *segs = NULL;
>         struct sk_buff *tail = NULL;
>         struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag = NULL;

I think you misunderstood me. I wasn't saying you allocate head_frag.
I was saying you could move the declaration down.

>         unsigned int mss = skb_shinfo(head_skb)->gso_size;
>         unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>         struct sk_buff *frag_skb = head_skb;
> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
>                 while (pos < offset + len) {

So right here in the loop you could add a "skb_frag_t head_frag;" just
so we declare it here and save ourselves the stack space.

>                         if (i >= nfrags) {
> -                               BUG_ON(skb_headlen(list_skb));
> -
>                                 i = 0;
> +                               if (skb_headlen(list_skb)) {
> +                                       struct page *page;
> +
> +                                       BUG_ON(!list_skb->head_frag);
> +
> +                                       page = virt_to_head_page(list_skb->head);
> +                                       if (!head_frag) {
> +                                               head_frag = kmalloc(sizeof(skb_frag_t),
> +                                                                   GFP_KERNEL);
> +                                               if (!head_frag)
> +                                                       goto err;
> +                                       }

Please no memory allocation. I just meant you could allocate it on the
stack later.

> +                                       head_frag->page.p = page;
> +                                       head_frag->page_offset = list_skb->data -
> +                                               (unsigned char *)page_address(page);
> +                                       head_frag->size = skb_headlen(list_skb);
> +                                       /* set i = -1 so we will pick head_frag
> +                                        * instead of skb_shinfo(list_skb)->frags
> +                                        * when i == -1.
> +                                        */
> +                                       i = -1;
> +                               }

So it took me a bit to pick up on the fact that line below wasn't
removed. So we are basically trying to do this all in one pass now. Do
I have that right?

One thing you could look at doing to save yourself the extra "if"
later would be to pull frag pointer before you go through skb_headlen
check above. Then if you are going to use a head_frag you could just
do a "i--; frag--;" combination just to rewind and make the room for
the increment to come later. That way you don't have an invalid frag
pointer floating around. That way you only have to do this once
instead of having to do a conditional check per fragment.

>                                 nfrags = skb_shinfo(list_skb)->nr_frags;
> -                               frag = skb_shinfo(list_skb)->frags;

This patch might be more readable if you were to just insert the
skb_headlen() bits down here and left the i=0 through frag = .. in one
piece.

> -                               frag_skb = list_skb;
> -
> -                               BUG_ON(!nfrags);
> -
> -                               if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
> -                                   skb_zerocopy_clone(nskb, frag_skb,
> -                                                      GFP_ATOMIC))
> -                                       goto err;
> +                               if (nfrags) {
> +                                       frag = skb_shinfo(list_skb)->frags;
> +                                       frag_skb = list_skb;
> +
> +                                       if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
> +                                           skb_zerocopy_clone(nskb, frag_skb,
> +                                                              GFP_ATOMIC))
> +                                               goto err;
> +                               }
>
>                                 list_skb = list_skb->next;
>                         }
> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                                 goto err;
>                         }
>
> -                       *nskb_frag = *frag;
> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;

So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".

>                         __skb_frag_ref(nskb_frag);
>                         size = skb_frag_size(nskb_frag);
>
> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
>                         if (pos + size <= offset + len) {
>                                 i++;
> -                               frag++;
> +                               if (i != 0)
> +                                       frag++;
>                                 pos += size;
>                         } else {
>                                 skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                 swap(tail->destructor, head_skb->destructor);
>                 swap(tail->sk, head_skb->sk);
>         }
> +       kfree(head_frag);
>         return segs;
>
>  err:
>         kfree_skb_list(segs);
> +       kfree(head_frag);
>         return ERR_PTR(err);
>  }
>  EXPORT_SYMBOL_GPL(skb_segment);
> --
> 2.9.5
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module
  2018-03-20 23:21 ` [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module Yonghong Song
@ 2018-03-21  0:44   ` Eric Dumazet
  2018-03-21  5:15     ` Yonghong Song
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2018-03-21  0:44 UTC (permalink / raw)
  To: Yonghong Song, edumazet, ast, daniel, diptanu, netdev,
	alexander.duyck
  Cc: kernel-team



On 03/20/2018 04:21 PM, Yonghong Song wrote:
> Without the previous commit,
> "modprobe test_bpf" will have the following errors:
> ...
> [   98.149165] ------------[ cut here ]------------
> [   98.159362] kernel BUG at net/core/skbuff.c:3667!
> [   98.169756] invalid opcode: 0000 [#1] SMP PTI
> [   98.179370] Modules linked in:
> [   98.179371]  test_bpf(+)
> ...
> which triggers the bug the previous commit intends to fix.
> 
> The skbs are constructed to mimic what mlx5 may generate.
> The packet size/header may not mimic real cases in production. But
> the processing flow is similar.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  lib/test_bpf.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> index 2efb213..045d7d3 100644
> --- a/lib/test_bpf.c
> +++ b/lib/test_bpf.c
> @@ -6574,6 +6574,72 @@ static bool exclude_test(int test_id)
>  	return test_id < test_range[0] || test_id > test_range[1];
>  }
>  
> +static struct sk_buff *build_test_skb(void *page)
> +{
> +	u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
> +	struct sk_buff *skb[2];
> +	int i, data_size = 8;
> +
> +	for (i = 0; i < 2; i++) {
> +		/* this will set skb[i]->head_frag */
> +		skb[i] = build_skb(page, headroom);
> +		if (!skb[i])
> +			return NULL;

You are using the same virtual address (page) for both skb ?

So we have 2 skbs having skb->head pointing to the same location ?

This is illegal.

Please use instead : skb = dev_alloc_skb(headroom + data_size)

> +
> +		skb_reserve(skb[i], headroom);
> +		skb_put(skb[i], data_size);
> +		skb[i]->protocol = htons(ETH_P_IP);
> +		skb_reset_network_header(skb[i]);
> +		skb_set_mac_header(skb[i], -ETH_HLEN);
> +
> +		skb_add_rx_frag(skb[i], 

skb_shinfo(skb[i])->nr_frags,

0 ?

> +				page, 0, 64, 64);

get_page(page) ?

> +		// skb: skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
> +	}
> +
> +	/* setup shinfo */
> +	skb_shinfo(skb[0])->gso_size = 1448;
> +	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
> +	skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
> +	skb_shinfo(skb[0])->gso_segs = 0;
> +	skb_shinfo(skb[0])->frag_list = skb[1];
> +
> +	/* adjust skb[0]'s len */
> +	skb[0]->len += skb[1]->len;
> +	skb[0]->data_len += skb[1]->data_len;
> +	skb[0]->truesize += skb[1]->truesize;
> +
> +	return skb[0];
> +}
> +
> +static __init int test_skb_segment(void)
> +{
> +	netdev_features_t features;
> +	struct sk_buff *skb;
> +	void *page;
> +	int ret = -1;
> +
> +	page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		pr_info("%s: failed to get_free_page!", __func__);
> +		return ret;
> +	}
> +
> +	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
> +	features |= NETIF_F_RXCSUM;
> +	skb = build_test_skb(page);
> +	if (!skb) {
> +		pr_info("%s: failed to build_test_skb", __func__);
> +	} else if (skb_segment(skb, features)) {
> +		ret = 0;
> +		pr_info("%s: success in skb_segment!", __func__);
> +	} else {
> +		pr_info("%s: failed in skb_segment!", __func__);
> +	}
> +	free_page((unsigned long)page);


Where are the skbs freed ?


> +	return ret;
> +}
> +
>  static __init int test_bpf(void)
>  {
>  	int i, err_cnt = 0, pass_cnt = 0;
> @@ -6632,8 +6698,11 @@ static int __init test_bpf_init(void)
>  		return ret;
>  
>  	ret = test_bpf();
> -
>  	destroy_bpf_tests();
> +	if (ret)
> +		return ret;
> +
> +	ret = test_skb_segment();
>  	return ret;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
  2018-03-20 23:50   ` Alexander Duyck
@ 2018-03-21  5:02     ` Yonghong Song
  2018-03-21 14:59       ` Alexander Duyck
  0 siblings, 1 reply; 9+ messages in thread
From: Yonghong Song @ 2018-03-21  5:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team



On 3/20/18 4:50 PM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>> function skb_segment(), line 3667. The bpf program attaches to
>> clsact ingress, calls bpf_skb_change_proto to change protocol
>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>> to send the changed packet out.
>>
>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> 3473                             netdev_features_t features)
>> 3474 {
>> 3475         struct sk_buff *segs = NULL;
>> 3476         struct sk_buff *tail = NULL;
>> ...
>> 3665                 while (pos < offset + len) {
>> 3666                         if (i >= nfrags) {
>> 3667                                 BUG_ON(skb_headlen(list_skb));
>> 3668
>> 3669                                 i = 0;
>> 3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>> 3672                                 frag_skb = list_skb;
>> ...
>>
>> call stack:
>> ...
>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>      [exception RIP: skb_segment+3044]
>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>> --- <IRQ stack> ---
>> ...
>>
>> The triggering input skb has the following properties:
>>      list_skb = skb->frag_list;
>>      skb->nfrags != NULL && skb_headlen(list_skb) != 0
>> and skb_segment() is not able to handle a frag_list skb
>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>
>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>> case properly if list_skb->head_frag is true, which is expected in
>> most cases. The head frag is processed before list_skb->frags
>> are processed.
>>
>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   net/core/skbuff.c | 51 +++++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c134..59bbc06 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>          struct sk_buff *segs = NULL;
>>          struct sk_buff *tail = NULL;
>>          struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag = NULL;
> 
> I think you misunderstood me. I wasn't saying you allocate head_frag.
> I was saying you could move the declaration down.

Sorry for my misunderstanding. I did understand your intention of moving
the declaration down in order to save stack space. I thought that we 
cannot really move declaration down (although it works in C, but 
semantically it is not quite right, more later), so I moved on to
use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it 
could live on the stack.

> 
>>          unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>          unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>>          struct sk_buff *frag_skb = head_skb;
>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>>                  while (pos < offset + len) {
> 
> So right here in the loop you could add a "skb_frag_t head_frag;" just
> so we declare it here and save ourselves the stack space.

I actually tried to move "skb_frag_t head_frag". The stack size remains 
the same, 0xc0. This is related to how C compiler allocates stack space.
The declaration place won't decide the stack size as long as the 
declaration dictates the usage. The stack size is really determined by 
liveness analysis.

Further, we have code like:
	do {
            ....
            while (pos < offset + len) {
                if (i >= nfrags) {
                    ...
                    head_frag = ...
                }
                ... = head_frag; // head_frag access guaranteed after
                                 // above definition, but it may not
                                 // be in the same outer do-while loop.
            }
            ...
          } while (((offset += len) < head_skb->len);

So the use of head_frag maybe in different outer loop iterations.
So I feel the definition of head_frag should be outside the
outer do-while loop, which is the main function scope. I will add some
comments here.


> 
>>                          if (i >= nfrags) {
>> -                               BUG_ON(skb_headlen(list_skb));
>> -
>>                                  i = 0;
>> +                               if (skb_headlen(list_skb)) {
>> +                                       struct page *page;
>> +
>> +                                       BUG_ON(!list_skb->head_frag);
>> +
>> +                                       page = virt_to_head_page(list_skb->head);
>> +                                       if (!head_frag) {
>> +                                               head_frag = kmalloc(sizeof(skb_frag_t),
>> +                                                                   GFP_KERNEL);
>> +                                               if (!head_frag)
>> +                                                       goto err;
>> +                                       }
> 
> Please no memory allocation. I just meant you could allocate it on the
> stack later.
> 
>> +                                       head_frag->page.p = page;
>> +                                       head_frag->page_offset = list_skb->data -
>> +                                               (unsigned char *)page_address(page);
>> +                                       head_frag->size = skb_headlen(list_skb);
>> +                                       /* set i = -1 so we will pick head_frag
>> +                                        * instead of skb_shinfo(list_skb)->frags
>> +                                        * when i == -1.
>> +                                        */
>> +                                       i = -1;
>> +                               }
> 
> So it took me a bit to pick up on the fact that line below wasn't
> removed. So we are basically trying to do this all in one pass now. Do
> I have that right?
> 
> One thing you could look at doing to save yourself the extra "if"
> later would be to pull frag pointer before you go through skb_headlen
> check above. Then if you are going to use a head_frag you could just
> do a "i--; frag--;" combination just to rewind and make the room for
> the increment to come later. That way you don't have an invalid frag
> pointer floating around. That way you only have to do this once
> instead of having to do a conditional check per fragment.

Right. This indeed make code more cleaner.

> 
>>                                  nfrags = skb_shinfo(list_skb)->nr_frags;
>> -                               frag = skb_shinfo(list_skb)->frags;
> 
> This patch might be more readable if you were to just insert the
> skb_headlen() bits down here and left the i=0 through frag = .. in one
> piece.

Right. Will implement as suggested.

> 
>> -                               frag_skb = list_skb;
>> -
>> -                               BUG_ON(!nfrags);
>> -
>> -                               if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>> -                                                      GFP_ATOMIC))
>> -                                       goto err;
>> +                               if (nfrags) {
>> +                                       frag = skb_shinfo(list_skb)->frags;
>> +                                       frag_skb = list_skb;
>> +
>> +                                       if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> +                                           skb_zerocopy_clone(nskb, frag_skb,
>> +                                                              GFP_ATOMIC))
>> +                                               goto err;
>> +                               }
>>
>>                                  list_skb = list_skb->next;
>>                          }
>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>                                  goto err;
>>                          }
>>
>> -                       *nskb_frag = *frag;
>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
> 
> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".

Good suggestion. Will implement as suggested.

> 
>>                          __skb_frag_ref(nskb_frag);
>>                          size = skb_frag_size(nskb_frag);
>>
>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>>                          if (pos + size <= offset + len) {
>>                                  i++;
>> -                               frag++;
>> +                               if (i != 0)
>> +                                       frag++;
>>                                  pos += size;
>>                          } else {
>>                                  skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>                  swap(tail->destructor, head_skb->destructor);
>>                  swap(tail->sk, head_skb->sk);
>>          }
>> +       kfree(head_frag);
>>          return segs;
>>
>>   err:
>>          kfree_skb_list(segs);
>> +       kfree(head_frag);
>>          return ERR_PTR(err);
>>   }
>>   EXPORT_SYMBOL_GPL(skb_segment);
>> --
>> 2.9.5
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module
  2018-03-21  0:44   ` Eric Dumazet
@ 2018-03-21  5:15     ` Yonghong Song
  0 siblings, 0 replies; 9+ messages in thread
From: Yonghong Song @ 2018-03-21  5:15 UTC (permalink / raw)
  To: Eric Dumazet, edumazet, ast, daniel, diptanu, netdev,
	alexander.duyck
  Cc: kernel-team



On 3/20/18 5:44 PM, Eric Dumazet wrote:
> 
> 
> On 03/20/2018 04:21 PM, Yonghong Song wrote:
>> Without the previous commit,
>> "modprobe test_bpf" will have the following errors:
>> ...
>> [   98.149165] ------------[ cut here ]------------
>> [   98.159362] kernel BUG at net/core/skbuff.c:3667!
>> [   98.169756] invalid opcode: 0000 [#1] SMP PTI
>> [   98.179370] Modules linked in:
>> [   98.179371]  test_bpf(+)
>> ...
>> which triggers the bug the previous commit intends to fix.
>>
>> The skbs are constructed to mimic what mlx5 may generate.
>> The packet size/header may not mimic real cases in production. But
>> the processing flow is similar.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   lib/test_bpf.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 70 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
>> index 2efb213..045d7d3 100644
>> --- a/lib/test_bpf.c
>> +++ b/lib/test_bpf.c
>> @@ -6574,6 +6574,72 @@ static bool exclude_test(int test_id)
>>   	return test_id < test_range[0] || test_id > test_range[1];
>>   }
>>   
>> +static struct sk_buff *build_test_skb(void *page)
>> +{
>> +	u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
>> +	struct sk_buff *skb[2];
>> +	int i, data_size = 8;
>> +
>> +	for (i = 0; i < 2; i++) {
>> +		/* this will set skb[i]->head_frag */
>> +		skb[i] = build_skb(page, headroom);
>> +		if (!skb[i])
>> +			return NULL;
> 
> You are using the same virtual address (page) for both skb ?
> 
> So we have 2 skbs having skb->head pointing to the same location ?

Thanks, Eric. This is purely due to my 'laziness' to make it work as I 
know that skb_segment does not really enforce this. I will address
all of your comments in the next revision.

> 
> This is illegal.
> 
> Please use instead : skb = dev_alloc_skb(headroom + data_size)
> 
>> +
>> +		skb_reserve(skb[i], headroom);
>> +		skb_put(skb[i], data_size);
>> +		skb[i]->protocol = htons(ETH_P_IP);
>> +		skb_reset_network_header(skb[i]);
>> +		skb_set_mac_header(skb[i], -ETH_HLEN);
>> +
>> +		skb_add_rx_frag(skb[i],
> 
> skb_shinfo(skb[i])->nr_frags,
> 
> 0 ?
> 
>> +				page, 0, 64, 64);
> 
> get_page(page) ?
> 
>> +		// skb: skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
>> +	}
>> +
>> +	/* setup shinfo */
>> +	skb_shinfo(skb[0])->gso_size = 1448;
>> +	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
>> +	skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
>> +	skb_shinfo(skb[0])->gso_segs = 0;
>> +	skb_shinfo(skb[0])->frag_list = skb[1];
>> +
>> +	/* adjust skb[0]'s len */
>> +	skb[0]->len += skb[1]->len;
>> +	skb[0]->data_len += skb[1]->data_len;
>> +	skb[0]->truesize += skb[1]->truesize;
>> +
>> +	return skb[0];
>> +}
>> +
>> +static __init int test_skb_segment(void)
>> +{
>> +	netdev_features_t features;
>> +	struct sk_buff *skb;
>> +	void *page;
>> +	int ret = -1;
>> +
>> +	page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
>> +	if (!page) {
>> +		pr_info("%s: failed to get_free_page!", __func__);
>> +		return ret;
>> +	}
>> +
>> +	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
>> +	features |= NETIF_F_RXCSUM;
>> +	skb = build_test_skb(page);
>> +	if (!skb) {
>> +		pr_info("%s: failed to build_test_skb", __func__);
>> +	} else if (skb_segment(skb, features)) {
>> +		ret = 0;
>> +		pr_info("%s: success in skb_segment!", __func__);
>> +	} else {
>> +		pr_info("%s: failed in skb_segment!", __func__);
>> +	}
>> +	free_page((unsigned long)page);
> 
> 
> Where are the skbs freed ?
> 
> 
>> +	return ret;
>> +}
>> +
>>   static __init int test_bpf(void)
>>   {
>>   	int i, err_cnt = 0, pass_cnt = 0;
>> @@ -6632,8 +6698,11 @@ static int __init test_bpf_init(void)
>>   		return ret;
>>   
>>   	ret = test_bpf();
>> -
>>   	destroy_bpf_tests();
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = test_skb_segment();
>>   	return ret;
>>   }
>>   
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
  2018-03-21  5:02     ` Yonghong Song
@ 2018-03-21 14:59       ` Alexander Duyck
  2018-03-21 20:10         ` Yonghong Song
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Duyck @ 2018-03-21 14:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team

On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song <yhs@fb.com> wrote:
>
>
> On 3/20/18 4:50 PM, Alexander Duyck wrote:
>>
>> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>>>
>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>> function skb_segment(), line 3667. The bpf program attaches to
>>> clsact ingress, calls bpf_skb_change_proto to change protocol
>>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>>> to send the changed packet out.
>>>
>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>> 3473                             netdev_features_t features)
>>> 3474 {
>>> 3475         struct sk_buff *segs = NULL;
>>> 3476         struct sk_buff *tail = NULL;
>>> ...
>>> 3665                 while (pos < offset + len) {
>>> 3666                         if (i >= nfrags) {
>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>> 3668
>>> 3669                                 i = 0;
>>> 3670                                 nfrags =
>>> skb_shinfo(list_skb)->nr_frags;
>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>> 3672                                 frag_skb = list_skb;
>>> ...
>>>
>>> call stack:
>>> ...
>>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>      [exception RIP: skb_segment+3044]
>>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>> --- <IRQ stack> ---
>>> ...
>>>
>>> The triggering input skb has the following properties:
>>>      list_skb = skb->frag_list;
>>>      skb->nfrags != NULL && skb_headlen(list_skb) != 0
>>> and skb_segment() is not able to handle a frag_list skb
>>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>>
>>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>>> case properly if list_skb->head_frag is true, which is expected in
>>> most cases. The head frag is processed before list_skb->frags
>>> are processed.
>>>
>>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>   net/core/skbuff.c | 51
>>> +++++++++++++++++++++++++++++++++++++--------------
>>>   1 file changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 715c134..59bbc06 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>          struct sk_buff *segs = NULL;
>>>          struct sk_buff *tail = NULL;
>>>          struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
>>> NULL;
>>
>>
>> I think you misunderstood me. I wasn't saying you allocate head_frag.
>> I was saying you could move the declaration down.
>
>
> Sorry for my misunderstanding. I did understand your intention of moving
> the declaration down in order to save stack space. I thought that we cannot
> really move declaration down (although it works in C, but semantically it is
> not quite right, more later), so I moved on to
> use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
> could live on the stack.
>
>>
>>>          unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>>          unsigned int doffset = head_skb->data -
>>> skb_mac_header(head_skb);
>>>          struct sk_buff *frag_skb = head_skb;
>>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>
>>>                  while (pos < offset + len) {
>>
>>
>> So right here in the loop you could add a "skb_frag_t head_frag;" just
>> so we declare it here and save ourselves the stack space.
>
>
> I actually tried to move "skb_frag_t head_frag". The stack size remains the
> same, 0xc0. This is related to how C compiler allocates stack space.
> The declaration place won't decide the stack size as long as the declaration
> dictates the usage. The stack size is really determined by liveness
> analysis.
>
> Further, we have code like:
>         do {
>            ....
>            while (pos < offset + len) {
>                if (i >= nfrags) {
>                    ...
>                    head_frag = ...
>                }
>                ... = head_frag; // head_frag access guaranteed after
>                                 // above definition, but it may not
>                                 // be in the same outer do-while loop.
>            }
>            ...
>          } while (((offset += len) < head_skb->len);
>
> So the use of head_frag maybe in different outer loop iterations.
> So I feel the definition of head_frag should be outside the
> outer do-while loop, which is the main function scope. I will add some
> comments here.

So the point I had is that head_frag doesn't need to live that long.
All you are doing is arranging the data so that you can essentially
just dump it into *nskb_frag.

One alternative you could look at would be to rearrange the code so
that the MAX_SKB_FRAGS check occurs first, then perform the check for
(i >= nfrags). Doing it that way we end up bypassing the need for
head_frag entirely as you could just populate *nskb_frag directly. You
could probably just add an inline structure that would convert the
head frag into a skb_frag_t structure and return that. Something along
the lines of:
        *nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;

Doing it that way you could pull out the bits below that were
populating head frag and just have it all handled as an inline
function. An added advantage is that it should make the line wrapping
easier to deal with. :-)

>>
>>>                          if (i >= nfrags) {
>>> -                               BUG_ON(skb_headlen(list_skb));
>>> -
>>>                                  i = 0;
>>> +                               if (skb_headlen(list_skb)) {
>>> +                                       struct page *page;
>>> +
>>> +                                       BUG_ON(!list_skb->head_frag);
>>> +
>>> +                                       page =
>>> virt_to_head_page(list_skb->head);
>>> +                                       if (!head_frag) {
>>> +                                               head_frag =
>>> kmalloc(sizeof(skb_frag_t),
>>> +
>>> GFP_KERNEL);
>>> +                                               if (!head_frag)
>>> +                                                       goto err;
>>> +                                       }
>>
>>
>> Please no memory allocation. I just meant you could allocate it on the
>> stack later.
>>
>>> +                                       head_frag->page.p = page;
>>> +                                       head_frag->page_offset =
>>> list_skb->data -
>>> +                                               (unsigned char
>>> *)page_address(page);
>>> +                                       head_frag->size =
>>> skb_headlen(list_skb);
>>> +                                       /* set i = -1 so we will pick
>>> head_frag
>>> +                                        * instead of
>>> skb_shinfo(list_skb)->frags
>>> +                                        * when i == -1.
>>> +                                        */
>>> +                                       i = -1;
>>> +                               }
>>
>>
>> So it took me a bit to pick up on the fact that line below wasn't
>> removed. So we are basically trying to do this all in one pass now. Do
>> I have that right?
>>
>> One thing you could look at doing to save yourself the extra "if"
>> later would be to pull frag pointer before you go through skb_headlen
>> check above. Then if you are going to use a head_frag you could just
>> do a "i--; frag--;" combination just to rewind and make the room for
>> the increment to come later. That way you don't have an invalid frag
>> pointer floating around. That way you only have to do this once
>> instead of having to do a conditional check per fragment.
>
>
> Right. This indeed make code more cleaner.
>
>>
>>>                                  nfrags = skb_shinfo(list_skb)->nr_frags;
>>> -                               frag = skb_shinfo(list_skb)->frags;
>>
>>
>> This patch might be more readable if you were to just insert the
>> skb_headlen() bits down here and left the i=0 through frag = .. in one
>> piece.
>
>
> Right. Will implement as suggested.
>
>>
>>> -                               frag_skb = list_skb;
>>> -
>>> -                               BUG_ON(!nfrags);
>>> -
>>> -                               if (skb_orphan_frags(frag_skb,
>>> GFP_ATOMIC) ||
>>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>>> -                                                      GFP_ATOMIC))
>>> -                                       goto err;
>>> +                               if (nfrags) {
>>> +                                       frag =
>>> skb_shinfo(list_skb)->frags;
>>> +                                       frag_skb = list_skb;
>>> +
>>> +                                       if (skb_orphan_frags(frag_skb,
>>> GFP_ATOMIC) ||
>>> +                                           skb_zerocopy_clone(nskb,
>>> frag_skb,
>>> +
>>> GFP_ATOMIC))
>>> +                                               goto err;
>>> +                               }
>>>
>>>                                  list_skb = list_skb->next;
>>>                          }
>>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>                                  goto err;
>>>                          }
>>>
>>> -                       *nskb_frag = *frag;
>>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
>>
>>
>> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
>
>
> Good suggestion. Will implement as suggested.
>
>
>>
>>>                          __skb_frag_ref(nskb_frag);
>>>                          size = skb_frag_size(nskb_frag);
>>>
>>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>
>>>                          if (pos + size <= offset + len) {
>>>                                  i++;
>>> -                               frag++;
>>> +                               if (i != 0)
>>> +                                       frag++;
>>>                                  pos += size;
>>>                          } else {
>>>                                  skb_frag_size_sub(nskb_frag, pos + size
>>> - (offset + len));
>>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>                  swap(tail->destructor, head_skb->destructor);
>>>                  swap(tail->sk, head_skb->sk);
>>>          }
>>> +       kfree(head_frag);
>>>          return segs;
>>>
>>>   err:
>>>          kfree_skb_list(segs);
>>> +       kfree(head_frag);
>>>          return ERR_PTR(err);
>>>   }
>>>   EXPORT_SYMBOL_GPL(skb_segment);
>>> --
>>> 2.9.5
>>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
  2018-03-21 14:59       ` Alexander Duyck
@ 2018-03-21 20:10         ` Yonghong Song
  0 siblings, 0 replies; 9+ messages in thread
From: Yonghong Song @ 2018-03-21 20:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team



On 3/21/18 7:59 AM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>> On 3/20/18 4:50 PM, Alexander Duyck wrote:
>>>
>>> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>>> function skb_segment(), line 3667. The bpf program attaches to
>>>> clsact ingress, calls bpf_skb_change_proto to change protocol
>>>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>>>> to send the changed packet out.
>>>>
>>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>> 3473                             netdev_features_t features)
>>>> 3474 {
>>>> 3475         struct sk_buff *segs = NULL;
>>>> 3476         struct sk_buff *tail = NULL;
>>>> ...
>>>> 3665                 while (pos < offset + len) {
>>>> 3666                         if (i >= nfrags) {
>>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>>> 3668
>>>> 3669                                 i = 0;
>>>> 3670                                 nfrags =
>>>> skb_shinfo(list_skb)->nr_frags;
>>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>>> 3672                                 frag_skb = list_skb;
>>>> ...
>>>>
>>>> call stack:
>>>> ...
>>>>    #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>>    #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>>    #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>>    #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>>    #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>>    #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>>    #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>>    #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>>       [exception RIP: skb_segment+3044]
>>>>       RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>>       RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>>       RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>>       RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>>       R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>>       R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>>    #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>>> --- <IRQ stack> ---
>>>> ...
>>>>
>>>> The triggering input skb has the following properties:
>>>>       list_skb = skb->frag_list;
>>>>       skb->nfrags != NULL && skb_headlen(list_skb) != 0
>>>> and skb_segment() is not able to handle a frag_list skb
>>>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>>>
>>>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>>>> case properly if list_skb->head_frag is true, which is expected in
>>>> most cases. The head frag is processed before list_skb->frags
>>>> are processed.
>>>>
>>>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    net/core/skbuff.c | 51
>>>> +++++++++++++++++++++++++++++++++++++--------------
>>>>    1 file changed, 37 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 715c134..59bbc06 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>           struct sk_buff *segs = NULL;
>>>>           struct sk_buff *tail = NULL;
>>>>           struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>>>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
>>>> NULL;
>>>
>>>
>>> I think you misunderstood me. I wasn't saying you allocate head_frag.
>>> I was saying you could move the declaration down.
>>
>>
>> Sorry for my misunderstanding. I did understand your intention of moving
>> the declaration down in order to save stack space. I thought that we cannot
>> really move declaration down (although it works in C, but semantically it is
>> not quite right, more later), so I moved on to
>> use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
>> could live on the stack.
>>
>>>
>>>>           unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>>>           unsigned int doffset = head_skb->data -
>>>> skb_mac_header(head_skb);
>>>>           struct sk_buff *frag_skb = head_skb;
>>>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                   while (pos < offset + len) {
>>>
>>>
>>> So right here in the loop you could add a "skb_frag_t head_frag;" just
>>> so we declare it here and save ourselves the stack space.
>>
>>
>> I actually tried to move "skb_frag_t head_frag". The stack size remains the
>> same, 0xc0. This is related to how C compiler allocates stack space.
>> The declaration place won't decide the stack size as long as the declaration
>> dictates the usage. The stack size is really determined by liveness
>> analysis.
>>
>> Further, we have code like:
>>          do {
>>             ....
>>             while (pos < offset + len) {
>>                 if (i >= nfrags) {
>>                     ...
>>                     head_frag = ...
>>                 }
>>                 ... = head_frag; // head_frag access guaranteed after
>>                                  // above definition, but it may not
>>                                  // be in the same outer do-while loop.
>>             }
>>             ...
>>           } while (((offset += len) < head_skb->len);
>>
>> So the use of head_frag maybe in different outer loop iterations.
>> So I feel the definition of head_frag should be outside the
>> outer do-while loop, which is the main function scope. I will add some
>> comments here.
> 
> So the point I had is that head_frag doesn't need to live that long.
> All you are doing is arranging the data so that you can essentially
> just dump it into *nskb_frag.
> 
> One alternative you could look at would be to rearrange the code so
> that the MAX_SKB_FRAGS check occurs first, then perform the check for
> (i >= nfrags). Doing it that way we end up bypassing the need for
> head_frag entirely as you could just populate *nskb_frag directly. You
> could probably just add an inline structure that would convert the
> head frag into a skb_frag_t structure and return that. Something along
> the lines of:
>          *nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;

This mechanism (inline function skb_head_frag_to_page_desc) works great!
I removed the local variable and the total stack size remains unchanged
with the patch.

> 
> Doing it that way you could pull out the bits below that were
> populating head frag and just have it all handled as an inline
> function. An added advantage is that it should make the line wrapping
> easier to deal with. :-)
> 
>>>
>>>>                           if (i >= nfrags) {
>>>> -                               BUG_ON(skb_headlen(list_skb));
>>>> -
>>>>                                   i = 0;
>>>> +                               if (skb_headlen(list_skb)) {
>>>> +                                       struct page *page;
>>>> +
>>>> +                                       BUG_ON(!list_skb->head_frag);
>>>> +
>>>> +                                       page =
>>>> virt_to_head_page(list_skb->head);
>>>> +                                       if (!head_frag) {
>>>> +                                               head_frag =
>>>> kmalloc(sizeof(skb_frag_t),
>>>> +
>>>> GFP_KERNEL);
>>>> +                                               if (!head_frag)
>>>> +                                                       goto err;
>>>> +                                       }
>>>
>>>
>>> Please no memory allocation. I just meant you could allocate it on the
>>> stack later.
>>>
>>>> +                                       head_frag->page.p = page;
>>>> +                                       head_frag->page_offset =
>>>> list_skb->data -
>>>> +                                               (unsigned char
>>>> *)page_address(page);
>>>> +                                       head_frag->size =
>>>> skb_headlen(list_skb);
>>>> +                                       /* set i = -1 so we will pick
>>>> head_frag
>>>> +                                        * instead of
>>>> skb_shinfo(list_skb)->frags
>>>> +                                        * when i == -1.
>>>> +                                        */
>>>> +                                       i = -1;
>>>> +                               }
>>>
>>>
>>> So it took me a bit to pick up on the fact that line below wasn't
>>> removed. So we are basically trying to do this all in one pass now. Do
>>> I have that right?
>>>
>>> One thing you could look at doing to save yourself the extra "if"
>>> later would be to pull frag pointer before you go through skb_headlen
>>> check above. Then if you are going to use a head_frag you could just
>>> do a "i--; frag--;" combination just to rewind and make the room for
>>> the increment to come later. That way you don't have an invalid frag
>>> pointer floating around. That way you only have to do this once
>>> instead of having to do a conditional check per fragment.
>>
>>
>> Right. This indeed make code more cleaner.
>>
>>>
>>>>                                   nfrags = skb_shinfo(list_skb)->nr_frags;
>>>> -                               frag = skb_shinfo(list_skb)->frags;
>>>
>>>
>>> This patch might be more readable if you were to just insert the
>>> skb_headlen() bits down here and left the i=0 through frag = .. in one
>>> piece.
>>
>>
>> Right. Will implement as suggested.
>>
>>>
>>>> -                               frag_skb = list_skb;
>>>> -
>>>> -                               BUG_ON(!nfrags);
>>>> -
>>>> -                               if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>>>> -                                                      GFP_ATOMIC))
>>>> -                                       goto err;
>>>> +                               if (nfrags) {
>>>> +                                       frag =
>>>> skb_shinfo(list_skb)->frags;
>>>> +                                       frag_skb = list_skb;
>>>> +
>>>> +                                       if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> +                                           skb_zerocopy_clone(nskb,
>>>> frag_skb,
>>>> +
>>>> GFP_ATOMIC))
>>>> +                                               goto err;
>>>> +                               }
>>>>
>>>>                                   list_skb = list_skb->next;
>>>>                           }
>>>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                                   goto err;
>>>>                           }
>>>>
>>>> -                       *nskb_frag = *frag;
>>>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
>>>
>>>
>>> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
>>
>>
>> Good suggestion. Will implement as suggested.
>>
>>
>>>
>>>>                           __skb_frag_ref(nskb_frag);
>>>>                           size = skb_frag_size(nskb_frag);
>>>>
>>>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                           if (pos + size <= offset + len) {
>>>>                                   i++;
>>>> -                               frag++;
>>>> +                               if (i != 0)
>>>> +                                       frag++;
>>>>                                   pos += size;
>>>>                           } else {
>>>>                                   skb_frag_size_sub(nskb_frag, pos + size
>>>> - (offset + len));
>>>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                   swap(tail->destructor, head_skb->destructor);
>>>>                   swap(tail->sk, head_skb->sk);
>>>>           }
>>>> +       kfree(head_frag);
>>>>           return segs;
>>>>
>>>>    err:
>>>>           kfree_skb_list(segs);
>>>> +       kfree(head_frag);
>>>>           return ERR_PTR(err);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(skb_segment);
>>>> --
>>>> 2.9.5
>>>>
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-03-21 20:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-20 23:21 [PATCH net-next v3 0/2] net: permit skb_segment on head_frag frag_list skb Yonghong Song
2018-03-20 23:21 ` [PATCH net-next v3 1/2] " Yonghong Song
2018-03-20 23:50   ` Alexander Duyck
2018-03-21  5:02     ` Yonghong Song
2018-03-21 14:59       ` Alexander Duyck
2018-03-21 20:10         ` Yonghong Song
2018-03-20 23:21 ` [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module Yonghong Song
2018-03-21  0:44   ` Eric Dumazet
2018-03-21  5:15     ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).