All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Yonghong Song <yhs@fb.com>, Eric Dumazet <eric.dumazet@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <ast@fb.com>, netdev <netdev@vger.kernel.org>,
	Martin Lau <kafai@fb.com>
Subject: Re: BUG_ON triggered in skb_segment
Date: Mon, 12 Mar 2018 23:25:09 -0700	[thread overview]
Message-ID: <c679bf38-ef11-7ba3-c24f-fa618bb0d956@gmail.com> (raw)
In-Reply-To: <d8d5a8c3-4004-47f2-9cfb-d94d0cd0d56c@fb.com>



On 03/12/2018 11:08 PM, Yonghong Song wrote:
> 
> 
> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>
>>
>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>> Hi,
>>>
>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>> net-next function skb_segment, line 3667.
>>>
>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>> 3473                             netdev_features_t features)
>>> 3474 {
>>> 3475         struct sk_buff *segs = NULL;
>>> 3476         struct sk_buff *tail = NULL;
>>> ...
>>> 3665                 while (pos < offset + len) {
>>> 3666                         if (i >= nfrags) {
>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>> 3668
>>> 3669                                 i = 0;
>>> 3670                                 nfrags = 
>>> skb_shinfo(list_skb)->nr_frags;
>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>> 3672                                 frag_skb = list_skb;
>>> ...
>>>
>>> call stack:
>>> ...
>>> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>      [exception RIP: skb_segment+3044]
>>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
>>> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
>>> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
>>> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
>>> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
>>> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
>>> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
>>> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
>>> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
>>> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
>>> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
>>> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
>>> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
>>> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
>>> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
>>> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
>>> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
>>> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
>>> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
>>> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
>>> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
>>> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
>>> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
>>> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
>>> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
>>> --- <IRQ stack> ---
>>> bt: cannot transition from IRQ stack to current process stack:
>>>          IRQ stack pointer: ffff883ffef034f8
>>>      process stack pointer: ffffffff81a01ae9
>>>         current stack base: ffffc9000c5c4000
>>> ...
>>> Setup:
>>> =====
>>>
>>> The test will involve three machines:
>>>    M_ipv6 <-> M_nat <-> M_ipv4
>>>
>>> The M_nat will do ipv4<->ipv6 address translation and then forward 
>>> packet
>>> to proper destination. The control plane will configure M_nat properly
>>> will understand virtual ipv4 address for machine M_ipv6, and
>>> virtual ipv6 address for machine M_ipv4.
>>>
>>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>> based on protocol change.
>>> After the conversion, the program will make proper change on
>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
>>> through bpf_redirect.
>>>
>>> Experiment:
>>> ===========
>>>
>>> MTU: 1500B for all three machines.
>>>
>>> The tso/lro/gro are enabled on the M_nat box.
>>>
>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
>>> (both ways).
>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
>>> with the above BUG_ON, really fast.
>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>
>>> The error path likely to be (also from the above call stack):
>>>    nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>
>>> In one of experiments, I explicitly printed the skb->len and 
>>> skb->data_len. The values are below:
>>>    skb_segment: len 2856, data_len 2686
>>> They should be equal to avoid BUG.
>>>
>>> In another experiment, I got:
>>>    skb_segment: len 1428, data_len 1258
>>>
>>> In both cases, the difference is 170 bytes. Not sure whether
>>> this is just a coincidence or not.
>>>
>>> Workaround:
>>> ===========
>>>
>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>> kernel will not receive big packets and hence gso is not really called.
>>>
>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>> Any suggestion on how to debug this?
>>>
>>
>> skb_segment() works if incoming GRO packet is not modified in its 
>> geometry.
>>
>> In your case it seems you had to adjust gso_size (calling 
>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks 
>> skb_segment() badly, because geometry changes, unless you had specific 
>> MTU/MSS restrictions.
>>
>> You will have to make skb_segment() more generic if you really want this.
> 
> In net/core/filter.c function bpf_skb_change_proto, which is called
> in the bpf program, does some GSO adjustment. Could you help check
> whether it satisfies my above use case or not? Thanks!

As I said this  helper ends up modifying gso_size by +/- 20 
(sizeof(ipv6 header) - sizeof(ipv4 header))

So it wont work if skb_segment() is called after this change.

Not clear why the GRO packet is not sent as is (as a TSO packet) since 
mlx4/mlx5 NICs certainly support TSO.

  reply	other threads:[~2018-03-13  6:25 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-13  5:45 BUG_ON triggered in skb_segment Yonghong Song
2018-03-13  6:04 ` Eric Dumazet
2018-03-13  6:08   ` Yonghong Song
2018-03-13  6:25     ` Eric Dumazet [this message]
2018-03-13  8:44       ` Steffen Klassert
2018-03-13 22:37         ` Yonghong Song
2018-03-13 22:47           ` Eric Dumazet
2018-03-13 23:09             ` Alexei Starovoitov
2018-03-13 23:18               ` Daniel Borkmann
2018-03-13 23:27               ` Eric Dumazet
2018-03-14  0:04                 ` Alexei Starovoitov
2018-03-14  0:26                   ` Eric Dumazet
2018-03-14  0:35                     ` Eric Dumazet
2018-03-14  1:15                       ` Eric Dumazet
2018-03-16 22:37                         ` Yonghong Song
2018-03-16 23:03                           ` Eric Dumazet
2018-03-17  4:44                             ` Yonghong Song
2018-03-13  6:18 ` Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c679bf38-ef11-7ba3-c24f-fa618bb0d956@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=ast@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.