From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: Latest net-next kernel 4.19.0+ Date: Mon, 29 Oct 2018 20:52:39 -0700 Message-ID: References: <59d5657c-ea0a-7b64-d5ff-5b55eb4fcccf@itcare.pl> <1e954663-ed05-4f33-4384-db880844f9d1@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: Linux Kernel Network Developers , Dimitris Michailidis To: Cong Wang , =?UTF-8?Q?Pawe=c5=82_Staszewski?= Return-path: Received: from mail-pf1-f176.google.com ([209.85.210.176]:39756 "EHLO mail-pf1-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725935AbeJ3Mo0 (ORCPT ); Tue, 30 Oct 2018 08:44:26 -0400 Received: by mail-pf1-f176.google.com with SMTP id c25-v6so5093507pfe.6 for ; Mon, 29 Oct 2018 20:52:42 -0700 (PDT) In-Reply-To: <1e954663-ed05-4f33-4384-db880844f9d1@gmail.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 10/29/2018 07:53 PM, Eric Dumazet wrote: > > > On 10/29/2018 07:27 PM, Cong Wang wrote: >> Hi, >> >> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski wrote: >>> >>> Sorry not complete - followed by hw csum: >>> >>> [ 342.190831] vlan1490: hw csum failure >>> [ 342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 >>> [ 342.190836] Call Trace: >>> [ 342.190839] >>> [ 342.190849] dump_stack+0x46/0x5b >>> [ 342.190856] __skb_checksum_complete+0x9a/0xa0 >>> [ 342.190859] tcp_v4_rcv+0xef/0x960 >>> [ 342.190864] ip_local_deliver_finish+0x49/0xd0 >>> [ 342.190866] ip_local_deliver+0x5e/0xe0 >>> [ 342.190869] ? ip_sublist_rcv_finish+0x50/0x50 >>> [ 342.190870] ip_rcv+0x41/0xc0 >>> [ 342.190874] __netif_receive_skb_one_core+0x4b/0x70 >>> [ 342.190877] netif_receive_skb_internal+0x2f/0xd0 >>> [ 342.190879] napi_gro_receive+0xb7/0xe0 >>> [ 342.190884] mlx5e_handle_rx_cqe+0x7a/0xd0 >>> [ 342.190886] mlx5e_poll_rx_cq+0xc6/0x930 >>> [ 342.190888] mlx5e_napi_poll+0xab/0xc90 >> >> >> We got exactly the same backtrace in our data center. However, >> it is not easy for us to reproduce it, do you have any clue to reproduce it? >> >> If you do, try to tcpdump the packets triggering this warning, it could >> be useful for debugging. >> >> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_ >> occurs. We tried to revert the offending commit 88078d98d1bb, it >> disappears. So it is likely that commit 88078d98d1bb introduces >> more troubles than the one fixed by d55bef5059dd057bd. >> > > Or this could be that mlx5 driver is buggy when dealing with VLAN tags. > > It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of code in mlx5e_handle_csum() > > if (network_depth > ETH_HLEN) > /* CQE csum is calculated from the IP header and does > * not cover VLAN headers (if present). This will add > * the checksum manually. > */ > skb->csum = csum_partial(skb->data + ETH_HLEN, > network_depth - ETH_HLEN, > skb->csum); > > > That seems strange to me, because skb_vlan_untag() will not adjust skb->csum in this case. > Bug might be in NETIF_F_RXFCS mlx5 handling btw... Code does : if (unlikely(netdev->features & NETIF_F_RXFCS)) skb->csum = csum_add(skb->csum, (__force __wsum)mlx5e_get_fcs(skb)); But Dimitris told us that we need to take into account if FCS starts at odd or even offset. -> if (unlikely(netdev->features & NETIF_F_RXFCS)) skb->csum = csum_block_add(skb->csum, (__force __wsum)mlx5e_get_fcs(skb), skb->len);