public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Edward Cree <ecree@solarflare.com>
To: Ivan Babrou <ivan@cloudflare.com>, <netdev@vger.kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Ignat Korchagin <ignat@cloudflare.com>,
	Shawn Bohrer <sbohrer@cloudflare.com>,
	Jakub Sitnicki <jakub@cloudflare.com>
Subject: Re: Crashes in skb clone/allocation in 4.19.18
Date: Wed, 30 Jan 2019 17:37:48 +0000	[thread overview]
Message-ID: <a9d22829-2d33-e856-ee60-a3a0f51750db@solarflare.com> (raw)
In-Reply-To: <90051606-1883-7dc7-fe4f-3bb135e816ae@solarflare.com>

On 30/01/19 17:33, Edward Cree wrote:
> On 30/01/19 16:51, Ivan Babrou wrote:
>> Hey,
>>
>> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
>> crashed with the following:
>>
>> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
>> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
>> O      4.19.18-cloudflare-2019.1.8 #2019.1.8
>> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
>> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
>> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
>> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
>> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
>> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
>> 0f 84
>> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
>> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
>> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
>> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
>> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
>> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
>> [ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
>> knlGS:0000000000000000
>> [ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
>> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
>> kernel.perf_event_max_sample_rate to 24000
>> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 2313.500216] Call Trace:
>> [ 2313.512833]  <IRQ>
>> [ 2313.524748]  __alloc_skb+0x57/0x1d0
>> [ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
>> [ 2313.551845]  tcp_rcv_established+0x550/0x640
>> [ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
>> [ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
>> [ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
>> [ 2313.604727]  ip_local_deliver+0x6b/0xe0
>> [ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
>> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
>> kernel.perf_event_max_sample_rate to 19000
>> [ 2313.630948]  ip_rcv+0x52/0xd0
>> [ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
>> [ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
>> [ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
>> [ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
>> [ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
>> [ 2313.717786]  ? __queue_work+0x103/0x3e0
>> [ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
>> [ 2313.743125]  ? net_rx_action+0x138/0x360
>> [ 2313.767356]  ? __do_softirq+0xd8/0x2d2
>> [ 2313.767362]  ? irq_exit+0xb4/0xc0
>> [ 2313.790680]  ? do_IRQ+0x85/0xd0
>> [ 2313.790688]  ? common_interrupt+0xf/0xf
>> [ 2313.790694]  </IRQ>
> Something odd is going on.  As far as I can tell from this call trace
>  (which has some weirdness in it; any chance you could reproduce with
>  frame pointers or a lower build optimisation level?) you're in the
>  normal sfc receive path (under efx_process_channel(), although that's
>  one of the functions that hasn't made it into the stack trace), which
>  means you should have a channel->rx_list, and thus efx_rx_deliver()
>  should be putting the packet on that list rather than calling
>  netif_receive_skb().
>
> I don't know how, or if, that could be related to the crash you're
>  getting, but it might be worth looking into.
> (It can't be the whole story, as your other crash is on a mlx5e and
>  AFAIK they don't use list-RX yet.  Though, confusingly, an entry for
>  ip_sublist_rcv still makes it into both stack traces.)
>
> Maybe it's secondary damage from a wild pointer or other mm problem
>  letting memory get scribbled on.
>
> -Ed
Aaaand as Lance has just pointed out, you're running the out-of-tree
 sfc driver, which doesn't have list RX yet.  Disregard the above.

-Ed

      reply	other threads:[~2019-01-30 17:37 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-30 16:51 Crashes in skb clone/allocation in 4.19.18 Ivan Babrou
2019-01-30 17:00 ` Eric Dumazet
2019-01-30 17:15 ` Cong Wang
2019-01-30 17:28   ` Lance Richardson
2019-01-30 17:34     ` Ivan Babrou
2019-01-30 17:33 ` Edward Cree
2019-01-30 17:37   ` Edward Cree [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9d22829-2d33-e856-ee60-a3a0f51750db@solarflare.com \
    --to=ecree@solarflare.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=ignat@cloudflare.com \
    --cc=ivan@cloudflare.com \
    --cc=jakub@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=sbohrer@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox