From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f52.google.com ([209.85.214.52]:37783 "EHLO mail-it0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751405AbeCMGZM (ORCPT ); Tue, 13 Mar 2018 02:25:12 -0400 Received: by mail-it0-f52.google.com with SMTP id k79-v6so14760256ita.2 for ; Mon, 12 Mar 2018 23:25:12 -0700 (PDT) Subject: Re: BUG_ON triggered in skb_segment To: Yonghong Song , Eric Dumazet , Daniel Borkmann , Alexei Starovoitov , netdev , Martin Lau References: <9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com> <875f59f2-d1ec-c47c-cdd7-2ce4985c5143@gmail.com> From: Eric Dumazet Message-ID: Date: Mon, 12 Mar 2018 23:25:09 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org List-ID: On 03/12/2018 11:08 PM, Yonghong Song wrote: > > > On 3/12/18 11:04 PM, Eric Dumazet wrote: >> >> >> On 03/12/2018 10:45 PM, Yonghong Song wrote: >>> Hi, >>> >>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at >>> net-next function skb_segment, line 3667. >>> >>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb, >>> 3473                             netdev_features_t features) >>> 3474 { >>> 3475         struct sk_buff *segs = NULL; >>> 3476         struct sk_buff *tail = NULL; >>> ... >>> 3665                 while (pos < offset + len) { >>> 3666                         if (i >= nfrags) { >>> 3667                                 BUG_ON(skb_headlen(list_skb)); >>> 3668 >>> 3669                                 i = 0; >>> 3670                                 nfrags = >>> skb_shinfo(list_skb)->nr_frags; >>> 3671                                 frag = skb_shinfo(list_skb)->frags; >>> 3672                                 frag_skb = list_skb; >>> ... >>> >>> call stack: >>> ... >>> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41 >>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525 >>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc >>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7 >>>   #4 [ffff883ffef03668] die at ffffffff8101deb2 >>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700 >>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe >>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0 >>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab >>>      [exception RIP: skb_segment+3044] >>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216 >>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011 >>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500 >>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da >>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80 >>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2 >>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018 >>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7 >>> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8 >>> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b >>> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8 >>> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9 >>> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed >>> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2 >>> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb >>> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa >>> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed >>> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90 >>> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66 >>> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209 >>> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d >>> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8 >>> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538 >>> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0 >>> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6 >>> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168 >>> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2 >>> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2 >>> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e >>> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e >>> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3 >>> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515 >>> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11 >>> --- --- >>> bt: cannot transition from IRQ stack to current process stack: >>>          IRQ stack pointer: ffff883ffef034f8 >>>      process stack pointer: ffffffff81a01ae9 >>>         current stack base: ffffc9000c5c4000 >>> ... >>> Setup: >>> ===== >>> >>> The test will involve three machines: >>>    M_ipv6 <-> M_nat <-> M_ipv4 >>> >>> The M_nat will do ipv4<->ipv6 address translation and then forward >>> packet >>> to proper destination. The control plane will configure M_nat properly >>> will understand virtual ipv4 address for machine M_ipv6, and >>> virtual ipv6 address for machine M_ipv4. >>> >>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc. >>> The program uses bpf_skb_change_proto to do protocol conversion. >>> bpf_skb_change_proto will adjust skb header_len and len properly >>> based on protocol change. >>> After the conversion, the program will make proper change on >>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out >>> through bpf_redirect. >>> >>> Experiment: >>> =========== >>> >>> MTU: 1500B for all three machines. >>> >>> The tso/lro/gro are enabled on the M_nat box. >>> >>> ping works on both ways of M_ipv6 <-> M_ipv4. >>> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 >>> (both ways). >>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed >>> with the above BUG_ON, really fast. >>> Did not really test from M_ipv4 to M_ipv6 with large file. >>> >>> The error path likely to be (also from the above call stack): >>>    nic -> lro/gro -> bpf_program -> gso (BUG_ON) >>> >>> In one of experiments, I explicitly printed the skb->len and >>> skb->data_len. The values are below: >>>    skb_segment: len 2856, data_len 2686 >>> They should be equal to avoid BUG. >>> >>> In another experiment, I got: >>>    skb_segment: len 1428, data_len 1258 >>> >>> In both cases, the difference is 170 bytes. Not sure whether >>> this is just a coincidence or not. >>> >>> Workaround: >>> =========== >>> >>> A workaround to avoid BUG_ON is to disable lro/gro. This way, >>> kernel will not receive big packets and hence gso is not really called. >>> >>> I am not familiar with gso code. Does anybody hit this BUG_ON before? >>> Any suggestion on how to debug this? >>> >> >> skb_segment() works if incoming GRO packet is not modified in its >> geometry. >> >> In your case it seems you had to adjust gso_size (calling >> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks >> skb_segment() badly, because geometry changes, unless you had specific >> MTU/MSS restrictions. >> >> You will have to make skb_segment() more generic if you really want this. > > In net/core/filter.c function bpf_skb_change_proto, which is called > in the bpf program, does some GSO adjustment. Could you help check > whether it satisfies my above use case or not? Thanks! As I said this helper ends up modifying gso_size by +/- 20 (sizeof(ipv6 header) - sizeof(ipv4 header)) So it wont work if skb_segment() is called after this change. Not clear why the GRO packet is not sent as is (as a TSO packet) since mlx4/mlx5 NICs certainly support TSO.