public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Yuchung Cheng <ycheng@google.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>,
	Roman Gushchin <guro@fb.com>, netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Neal Cardwell <ncardwell@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	brakmo@fb.com
Subject: Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Date: Wed, 25 Oct 2017 19:07:26 -0700	[thread overview]
Message-ID: <20171026020724.bgobtktvcpkhco4h@ast-mbp> (raw)
In-Reply-To: <CAK6E8=eju-MgROynJXntRCQ4KtXZ+qrgQC02bGJ27rp1wMmUjg@mail.gmail.com>

On Thu, Sep 28, 2017 at 04:36:58PM -0700, Yuchung Cheng wrote:
> On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
> <oleksandr@natalenko.name> wrote:
> > Hi.
> >
> > Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> > intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> > warning in tcp_fastretrans_alert() for me.
> 
> Hi Oleksandr: no retrans_collapse should not matter for that warning
> in tcp_fstretrans_alert(). the warning as I explained earlier is
> likely false. Neal and I are more concerned the panic in
> tcp_sacktag_walk. This is just a blind shot but thx for retrying.
> 
> We can submit a one-liner to remove the fast retrans warning but want
> to nail the bigger issue first.

we're still seeing the warnings followed by crashes and it's very concerning.
We hoped that most recent Neal's patches from Sep 18 around this area may
magically fix the issue, but no. The panics are still there.
It's confirmed that net.ipv4.tcp_retrans_collapse=0 does not help
whereas net.ipv4.tcp_recovery=0 works, but obviously undesirable.
We're out of ideas on how to debug this.

Warning + panic:
[ 7678.309043] Call Trace:
[ 7678.313925]  <IRQ>
[ 7678.317955]  dump_stack+0x4d/0x70
[ 7678.324573]  __warn+0xd3/0xf0
[ 7678.330497]  warn_slowpath_null+0x1e/0x20
[ 7678.338510]  tcp_fastretrans_alert+0xacf/0xbd0
[ 7678.347380]  tcp_ack+0xbce/0x1390
[ 7678.354006]  tcp_rcv_established+0x1ce/0x740
[ 7678.362536]  tcp_v6_do_rcv+0x15b/0x400
[ 7678.370025]  tcp_v6_rcv+0x94c/0x9f0
[ 7678.376989]  ? tcp_rcv_established+0x42b/0x740
[ 7678.385866]  ip6_input_finish+0xea/0x430
[ 7678.393702]  ip6_input+0x32/0xa0
[ 7678.400155]  ? ip6_rcv_finish+0xa0/0xa0
[ 7678.407834]  ip6_rcv_finish+0x4b/0xa0
[ 7678.415164]  ipv6_rcv+0x2ec/0x4f0
[ 7678.421778]  ? ip6_make_skb+0x1c0/0x1c0
[ 7678.429447]  __netif_receive_skb_core+0x2d5/0x9a0
[ 7678.438845]  ? tcp6_gro_receive+0x11d/0x1c0
[ 7678.447211]  ? ipv6_gro_receive+0x1d0/0x380
[ 7678.455566]  __netif_receive_skb+0x16/0x70
[ 7678.463748]  netif_receive_skb_internal+0x32/0x3b0
[ 7678.473318]  ? dev_gro_receive+0x27c/0x470
[ 7678.481502]  napi_gro_receive+0x113/0x1d0
[ 7678.489510]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
[ 7678.499077]  mlx5e_poll_rx_cq+0x79/0x830
[ 7678.506911]  mlx5e_napi_poll+0x8c/0x3d0
[ 7678.514579]  ? mlx5_cq_completion+0x54/0xb0
[ 7678.522950]  net_rx_action+0x22e/0x380
[ 7678.530450]  __do_softirq+0x106/0x2e8
[ 7678.537768]  irq_exit+0xb0/0xc0
[ 7678.544042]  do_IRQ+0x4f/0xd0
[ 7678.549970]  common_interrupt+0x86/0x86
...
[ 7685.690697] BUG: unable to handle kernel 
[ 7685.698737] NULL pointer dereference
[ 7685.705889]  at 0000000000000035
[ 7685.712357] IP: tcp_sacktag_walk+0xfd/0x460
[ 7685.720726] PGD 1002d75067 
[ 7685.726320] PUD 102c506067 
[ 7685.731902] PMD 0 
...
[ 7686.070541] task: ffff880ff5959b00 task.stack: ffffc90024708000
[ 7686.082363] RIP: 0010:tcp_sacktag_walk+0xfd/0x460
[ 7686.091754] RSP: 0000:ffff88203eec38b0 EFLAGS: 00010217
[ 7686.102188] RAX: ffff881953620a00 RBX: ffff881009273200 RCX: 00000000fbfa50ec
[ 7686.116439] RDX: 00000000fbfa54a0 RSI: 0000000000000000 RDI: ffff881009273200
[ 7686.130688] RBP: ffff88203eec3908 R08: 0000000000000048 R09: 00000000fbfa55fb
[ 7686.144936] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881ec9ff98c0
[ 7686.159184] R13: ffff88203eec3a10 R14: 0000000000000000 R15: ffff881ec9ff9a18
[ 7686.173431] FS:  00007f2b235ff700(0000) GS:ffff88203eec0000(0000) knlGS:0000000000000000
[ 7686.189589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7686.201062] CR2: 0000000000000035 CR3: 0000000fd8229000 CR4: 00000000003406e0
[ 7686.215310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7686.229560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7686.243807] Call Trace:
[ 7686.248696]  <IRQ>
[ 7686.252727]  tcp_sacktag_write_queue+0x5ff/0x9e0
[ 7686.261943]  tcp_ack+0x677/0x1390
[ 7686.268569]  ? select_task_rq_fair+0x9e3/0xf70
[ 7686.277437]  ? tcp_write_xmit+0x230/0x10a0
[ 7686.285619]  tcp_rcv_established+0x1ce/0x740
[ 7686.294151]  tcp_v6_do_rcv+0x15b/0x400
[ 7686.301636]  tcp_v6_rcv+0x94c/0x9f0
[ 7686.308606]  ? default_wake_function+0x12/0x20
[ 7686.317483]  ? __wake_up_common+0x53/0x90
[ 7686.325492]  ip6_input_finish+0xea/0x430
[ 7686.333329]  ip6_input+0x32/0xa0
[ 7686.339791]  ? ip6_rcv_finish+0xa0/0xa0
[ 7686.347451]  ip6_rcv_finish+0x4b/0xa0
[ 7686.354767]  ipv6_rcv+0x2ec/0x4f0
[ 7686.361389]  ? ip6_make_skb+0x1c0/0x1c0
[ 7686.369050]  __netif_receive_skb_core+0x2d5/0x9a0
[ 7686.378449]  ? tcp6_gro_receive+0x11d/0x1c0
[ 7686.386797]  ? ipv6_gro_receive+0x1d0/0x380
[ 7686.395160]  __netif_receive_skb+0x16/0x70
[ 7686.403342]  netif_receive_skb_internal+0x32/0x3b0
[ 7686.412931]  ? dev_gro_receive+0x27c/0x470
[ 7686.421110]  napi_gro_receive+0x113/0x1d0
[ 7686.429136]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
[ 7686.438722]  mlx5e_poll_rx_cq+0x79/0x830
[ 7686.446572]  mlx5e_napi_poll+0x8c/0x3d0
[ 7686.454233]  ? mlx5_cq_completion+0x54/0xb0
[ 7686.462591]  net_rx_action+0x22e/0x380
[ 7686.470099]  __do_softirq+0x106/0x2e8
[ 7686.477431]  irq_exit+0xb0/0xc0
[ 7686.483700]  do_IRQ+0x4f/0xd0
[ 7686.489629]  common_interrupt+0x86/0x86

Sometimes soft lockup is also seen:

[2676627.345885] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[2676627.717771] RIP: 0010:queued_spin_lock_slowpath+0x15c/0x180
[2676627.729239] RSP: 0018:ffff88103f603e78 EFLAGS: 00000202
[2676627.740019]  ORIG_RAX: ffffffffffffff10
[2676627.748027] RAX: 0000000000000101 RBX: ffff8811181d8840 RCX: 0000000000000101
[2676627.762622] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8811181d88c8
[2676627.777217] RBP: ffff88103f603e88 R08: 0000000000000001 R09: ffff88103f95a800
[2676627.791811] R10: 0000000000037a13 R11: 000000001a37bb20 R12: 0000000000000100
[2676627.806405] R13: ffff8811181d8c18 R14: ffffffff81857270 R15: ffff8811181d8840
[2676627.821000] FS:  0000000000000000(0000) GS:ffff88103f600000(0000) knlGS:0000000000000000
[2676627.837500] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2676627.849318] CR2: 00007f6e44c1f000 CR3: 0000000001e09000 CR4: 00000000003406f0
[2676627.863916] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2676627.878508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2676627.893103] Call Trace:
[2676627.898338]  <IRQ>
[2676627.902711]  ? _raw_spin_lock+0x21/0x30
[2676627.910715]  tcp_write_timer+0x19/0x70
[2676627.918548]  call_timer_fn+0x35/0x150
[2676627.926214]  run_timer_softirq+0x1be/0x3e0
[2676627.934742]  ? timerqueue_add+0x59/0xb0
[2676627.942745]  ? ktime_get+0x3c/0xb0
[2676627.949889]  __do_softirq+0x106/0x2e8
[2676627.957551]  irq_exit+0xb0/0xc0
[2676627.964174]  smp_apic_timer_interrupt+0x68/0x90
[2676627.973563]  apic_timer_interrupt+0x86/0x90
[2676628.084861] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [swapper/11:0]
[2676628.084896] RIP: 0010:tcp_match_skb_to_sack+0x2d/0xb0
[2676628.084897] RSP: 0018:ffff88103f8c38a8 EFLAGS: 00000297
[2676628.084897]  ORIG_RAX: ffffffffffffff10
[2676628.084898] RAX: 0000000000000000 RBX: ffff88010f9dbb18 RCX: 000000002d5da350
[2676628.084899] RDX: 000000002d5da141 RSI: ffff88010f9dbb18 RDI: ffff8811181d8840
[2676628.084899] RBP: ffff88103f8c3908 R08: 00000000ffffffff R09: 00000000d2a25ebe
[2676628.084900] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
[2676628.084900] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
[2676628.084901] FS:  0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
[2676628.084902] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2676628.084903] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
[2676628.084903] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2676628.084904] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2676628.084904] Call Trace:
[2676628.084905]  <IRQ>
[2676628.084907]  ? tcp_sacktag_walk+0x270/0x460
[2676628.084909]  tcp_sacktag_write_queue+0x5ff/0x9e0
[2676628.084911]  tcp_ack+0x677/0x1390
[2676628.084913]  ? tcp_write_xmit+0x230/0x10a0
[2676628.084915]  tcp_rcv_established+0x1ce/0x740
[2676628.084919]  tcp_v6_do_rcv+0x15b/0x400
[2676628.084921]  tcp_v6_rcv+0x94c/0x9f0
[2676628.084924]  ? __dev_queue_xmit+0x251/0x650
[2676628.084928]  ip6_input_finish+0xea/0x430
[2676628.084930]  ip6_input+0x32/0xa0
[2676628.084932]  ? ip6_rcv_finish+0xa0/0xa0
[2676628.084933]  ip6_rcv_finish+0x4b/0xa0
[2676628.084935]  ipv6_rcv+0x2ec/0x4f0
[2676628.084937]  ? ip6_make_skb+0x1c0/0x1c0
[2676628.084938]  __netif_receive_skb_core+0x2d5/0x9a0
[2676628.084941]  ? tcp6_gro_receive+0x11d/0x1c0
[2676628.084943]  ? ipv6_gro_receive+0x1d0/0x380
[2676628.084944]  __netif_receive_skb+0x16/0x70
[2676628.084945]  netif_receive_skb_internal+0x32/0x3b0
[2676628.084947]  ? dev_gro_receive+0x27c/0x470
[2676628.084948]  napi_gro_receive+0x113/0x1d0
[2676628.084952]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
[2676628.084954]  mlx5e_poll_rx_cq+0x79/0x830
[2676628.084956]  mlx5e_napi_poll+0x8c/0x3d0
[2676628.084959]  ? mlx5_cq_completion+0x54/0xb0
[2676628.084960]  net_rx_action+0x22e/0x380
[2676628.084962]  __do_softirq+0x106/0x2e8
[2676628.084965]  irq_exit+0xb0/0xc0
[2676628.084967]  do_IRQ+0x4f/0xd0
[2676656.083964] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [swapper/11:0]
[2676656.083992] RIP: 0010:tcp_sacktag_walk+0x25f/0x460
[2676656.083993] RSP: 0018:ffff88103f8c38b0 EFLAGS: 00000207
[2676656.083993]  ORIG_RAX: ffffffffffffff10
[2676656.083994] RAX: 0000000000021500 RBX: ffff8810d84c3600 RCX: 000000002d5d9e41
[2676656.083994] RDX: 00000000f15de229 RSI: ffff8810d84c0200 RDI: ffff8811181d8840
[2676656.083995] RBP: ffff88103f8c3908 R08: 000000000000ff21 R09: 0000000000000000
[2676656.083996] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
[2676656.083996] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
[2676656.083997] FS:  0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
[2676656.083998] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2676656.083998] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
[2676656.083999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2676656.083999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2676656.084000] Call Trace:
[2676656.084001]  <IRQ>
[2676656.084003]  tcp_sacktag_write_queue+0x5ff/0x9e0
[2676656.084005]  tcp_ack+0x677/0x1390
[2676656.084007]  ? tcp_write_xmit+0x230/0x10a0
[2676656.084009]  tcp_rcv_established+0x1ce/0x740
[2676656.084013]  tcp_v6_do_rcv+0x15b/0x400
[2676656.084015]  tcp_v6_rcv+0x94c/0x9f0
[2676656.084018]  ? __dev_queue_xmit+0x251/0x650
[2676656.084022]  ip6_input_finish+0xea/0x430
[2676656.084024]  ip6_input+0x32/0xa0
[2676656.084026]  ? ip6_rcv_finish+0xa0/0xa0
[2676656.084027]  ip6_rcv_finish+0x4b/0xa0
[2676656.084029]  ipv6_rcv+0x2ec/0x4f0
[2676656.084031]  ? ip6_make_skb+0x1c0/0x1c0
[2676656.084032]  __netif_receive_skb_core+0x2d5/0x9a0
[2676656.084035]  ? tcp6_gro_receive+0x11d/0x1c0
[2676656.084037]  ? ipv6_gro_receive+0x1d0/0x380
[2676656.084038]  __netif_receive_skb+0x16/0x70
[2676656.084039]  netif_receive_skb_internal+0x32/0x3b0
[2676656.084041]  ? dev_gro_receive+0x27c/0x470
[2676656.084042]  napi_gro_receive+0x113/0x1d0
[2676656.084046]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
[2676656.084048]  mlx5e_poll_rx_cq+0x79/0x830
[2676656.084050]  mlx5e_napi_poll+0x8c/0x3d0

Any help greatly appreciated.

  reply	other threads:[~2017-10-26  2:07 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-21  1:46 [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c Roman Gushchin
2017-09-21 17:07 ` Yuchung Cheng
     [not found]   ` <CAK6E8=cGF+xKiixRVvA=3PVPA7OQta9hVLTgCbKgvYf3e9Eu-A@mail.gmail.com>
2017-09-26 13:10     ` Roman Gushchin
2017-09-27  0:12       ` Yuchung Cheng
2017-09-27  0:18         ` Yuchung Cheng
2017-09-28  8:14           ` Oleksandr Natalenko
2017-09-28 23:36             ` Yuchung Cheng
2017-10-26  2:07               ` Alexei Starovoitov [this message]
2017-10-26  5:37                 ` Yuchung Cheng
2017-10-27 20:38                   ` Eric Dumazet
2017-10-31  6:08                     ` [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack Eric Dumazet
2017-10-31  6:17                       ` Alexei Starovoitov
2017-10-31  6:21                         ` Eric Dumazet
2017-10-31  6:30                           ` Alexei Starovoitov
2017-11-01  5:50                         ` Yuchung Cheng
2017-10-31 13:51                       ` Neal Cardwell
2017-11-01 12:20                       ` David Miller
2017-11-03 18:22                       ` Oleksandr Natalenko
2017-11-03 21:31                         ` Eric Dumazet
2017-11-06 22:27                     ` [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c Yuchung Cheng
2017-11-10 13:15                       ` Oleksandr Natalenko
2017-11-10 13:40                         ` Oleksandr Natalenko
  -- strict thread matches above, loose matches on Subject: below --
2017-09-10 20:53 Oleksandr Natalenko
2017-09-10 23:59 ` Neal Cardwell
2017-09-15  5:03   ` Oleksandr Natalenko
2017-09-15 14:03     ` Neal Cardwell
2017-09-15 19:04       ` Oleksandr Natalenko
2017-09-17 18:43         ` Oleksandr Natalenko
2017-09-18 17:18           ` Yuchung Cheng
2017-09-18 17:51             ` Yuchung Cheng
2017-09-18 17:59               ` Oleksandr Natalenko
2017-09-18 18:01                 ` Yuchung Cheng
2017-09-18 18:04                   ` Oleksandr Natalenko
2017-09-18 20:41                   ` Oleksandr Natalenko
2017-09-18 20:46                   ` Oleksandr Natalenko
2017-09-18 21:40                     ` Yuchung Cheng
2017-09-19 11:04                       ` Oleksandr Natalenko
2017-09-19 18:16                         ` Yuchung Cheng
2017-09-19 16:05                       ` Oleksandr Natalenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171026020724.bgobtktvcpkhco4h@ast-mbp \
    --to=alexei.starovoitov@gmail.com \
    --cc=brakmo@fb.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=guro@fb.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox