From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBE49C282D7 for ; Wed, 30 Jan 2019 17:37:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B324221473 for ; Wed, 30 Jan 2019 17:37:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732511AbfA3Rhy (ORCPT ); Wed, 30 Jan 2019 12:37:54 -0500 Received: from dispatch1-us1.ppe-hosted.com ([148.163.129.52]:45222 "EHLO dispatch1-us1.ppe-hosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726341AbfA3Rhy (ORCPT ); Wed, 30 Jan 2019 12:37:54 -0500 X-Virus-Scanned: Proofpoint Essentials engine Received: from webmail.solarflare.com (webmail.solarflare.com [12.187.104.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mx1-us4.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTPS id E2CB5BC006B; Wed, 30 Jan 2019 17:37:52 +0000 (UTC) Received: from ec-desktop.uk.solarflarecom.com (10.17.20.45) by ocex03.SolarFlarecom.com (10.20.40.36) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Wed, 30 Jan 2019 09:37:49 -0800 Subject: Re: Crashes in skb clone/allocation in 4.19.18 From: Edward Cree To: Ivan Babrou , CC: "David S. Miller" , Eric Dumazet , Ignat Korchagin , Shawn Bohrer , Jakub Sitnicki References: <90051606-1883-7dc7-fe4f-3bb135e816ae@solarflare.com> Message-ID: Date: Wed, 30 Jan 2019 17:37:48 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: <90051606-1883-7dc7-fe4f-3bb135e816ae@solarflare.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Content-Language: en-GB X-Originating-IP: [10.17.20.45] X-TM-AS-Product-Ver: SMEX-12.5.0.1300-8.5.1010-24398.003 X-TM-AS-Result: No-13.307500-4.000000-10 X-TMASE-MatchedRID: vEvJ7Rh1lGgOwH4pD14DsPHkpkyUphL9KqoSrlM7Ph/b6Y+fnTZUL5u9 5H0ktA+MguCag6mUGCj313xHX3MUNZfx38fTz5HddXu122+iJtpzd7C7BtJobmHZ+cd7VyKXtpg p4+urkOhSIoIxFLxiElZ1grUnQWgzUIghu6W9zN3G693ff8j9ZPUsfKVo7nUf0SxMhOhuA0T4/o 3axUzntUlzFJ+vo5FA+a51r6kpbfOo6c8tZT77+6Ka0xB73sAA0KTi2mJgJ0hLBxm1Vv3RsGSxW nuVAZFLMR3wjYXNu1NMh8d3WwUaAu4dcT3ZaToc4Os+BmyjEu1Abrq1NGg/52d6vNuG6Cqyz79r RLS8e3iyB4IqjtpgyAoOn0sJCda/Nyl1nd9CIt2DGx/OQ1GV8tp/U3XwL5kCF70JBot7Y8/3FLe ZXNZS4KBkcgGnJ4WmovOfI3gA5yJl9fg4ytwmjP4nawT9WywtjpQdBGihm3Z+3BndfXUhXQ== X-TM-AS-User-Approved-Sender: No X-TM-AS-User-Blocked-Sender: No X-TMASE-Result: 10--13.307500-4.000000 X-TMASE-Version: SMEX-12.5.0.1300-8.5.1010-24398.003 X-MDID: 1548869873-uSVWLP_oSdg4 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 30/01/19 17:33, Edward Cree wrote: > On 30/01/19 16:51, Ivan Babrou wrote: >> Hey, >> >> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them >> crashed with the following: >> >> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI >> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G >> O 4.19.18-cloudflare-2019.1.8 #2019.1.8 >> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex >> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018 >> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0 >> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d >> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39 >> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 >> 0f 84 >> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202 >> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d >> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40 >> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680 >> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220 >> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e >> [ 2313.394820] FS: 00007fdea755c780(0000) GS:ffff94457f900000(0000) >> knlGS:0000000000000000 >> [ 2313.412887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0 >> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering >> kernel.perf_event_max_sample_rate to 24000 >> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [ 2313.500216] Call Trace: >> [ 2313.512833] >> [ 2313.524748] __alloc_skb+0x57/0x1d0 >> [ 2313.537934] __tcp_send_ack.part.48+0x2f/0x100 >> [ 2313.551845] tcp_rcv_established+0x550/0x640 >> [ 2313.565394] tcp_v4_do_rcv+0x12a/0x1e0 >> [ 2313.578322] tcp_v4_rcv+0xadc/0xbd0 >> [ 2313.590993] ip_local_deliver_finish+0x5d/0x1d0 >> [ 2313.604727] ip_local_deliver+0x6b/0xe0 >> [ 2313.617782] ? ip_sublist_rcv+0x200/0x200 >> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering >> kernel.perf_event_max_sample_rate to 19000 >> [ 2313.630948] ip_rcv+0x52/0xd0 >> [ 2313.662850] ? ip_rcv_core.isra.22+0x2b0/0x2b0 >> [ 2313.662857] __netif_receive_skb_one_core+0x52/0x70 >> [ 2313.690860] netif_receive_skb_internal+0x34/0xe0 >> [ 2313.690883] efx_rx_deliver+0x11a/0x180 [sfc] >> [ 2313.717780] ? __efx_rx_packet+0x1ef/0x730 [sfc] >> [ 2313.717786] ? __queue_work+0x103/0x3e0 >> [ 2313.743118] ? efx_poll+0x35e/0x460 [sfc] >> [ 2313.743125] ? net_rx_action+0x138/0x360 >> [ 2313.767356] ? __do_softirq+0xd8/0x2d2 >> [ 2313.767362] ? irq_exit+0xb4/0xc0 >> [ 2313.790680] ? do_IRQ+0x85/0xd0 >> [ 2313.790688] ? common_interrupt+0xf/0xf >> [ 2313.790694] > Something odd is going on.  As far as I can tell from this call trace >  (which has some weirdness in it; any chance you could reproduce with >  frame pointers or a lower build optimisation level?) you're in the >  normal sfc receive path (under efx_process_channel(), although that's >  one of the functions that hasn't made it into the stack trace), which >  means you should have a channel->rx_list, and thus efx_rx_deliver() >  should be putting the packet on that list rather than calling >  netif_receive_skb(). > > I don't know how, or if, that could be related to the crash you're >  getting, but it might be worth looking into. > (It can't be the whole story, as your other crash is on a mlx5e and >  AFAIK they don't use list-RX yet.  Though, confusingly, an entry for >  ip_sublist_rcv still makes it into both stack traces.) > > Maybe it's secondary damage from a wild pointer or other mm problem >  letting memory get scribbled on. > > -Ed Aaaand as Lance has just pointed out, you're running the out-of-tree  sfc driver, which doesn't have list RX yet.  Disregard the above. -Ed