All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>,
	Ivan <ivan@prestigetransportation.com>,
	virtualization <virtualization@lists.linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: PROBLEM: virtio_net LRO kernel panics
Date: Wed, 11 Aug 2021 04:01:32 -0400	[thread overview]
Message-ID: <20210811035908-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <847f9477-634f-bf14-967e-38852c82a841@redhat.com>

On Wed, Aug 11, 2021 at 03:45:48PM +0800, Jason Wang wrote:
> 
> 在 2021/8/11 下午3:39, Michael S. Tsirkin 写道:
> > On Wed, Aug 11, 2021 at 11:38:59AM +0800, Jason Wang wrote:
> > > On Tue, Aug 10, 2021 at 11:31 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > On Mon, Aug 02, 2021 at 04:23:12PM -0500, Ivan wrote:
> > > > > On Mon, Aug 2, 2021 at 2:52 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > On Mon, Aug 02, 2021 at 01:32:05PM -0500, Ivan wrote:
> > > > > > > On Tue, Jul 27, 2021 at 4:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > On Mon, Jul 26, 2021 at 07:44:43PM -0500, Ivan wrote:
> > > > > > > > > On Sat, Jul 24, 2021 at 11:18 PM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > On Sat, Jul 24, 2021 at 7:17 PM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > > On Fri, Jul 23, 2021 at 7:33 AM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > > > On Fri, Jul 23, 2021 at 7:10 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > > > > > > On Fri, Jul 23, 2021 at 03:06:04AM -0500, Ivan wrote:
> > > > > > > > > > > > > > On Fri, Jul 23, 2021 at 2:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 11:50:11PM -0500, Ivan wrote:
> > > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 11:25 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > > > > 在 2021/7/23 上午10:54, Ivan 写道:
> > > > > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 9:37 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > > > > > > Does it work if you turn off lro before enabling the forwarding?
> > > > > > > > > > > > > > > > > > 0 root@NuRaid:~# ethtool -K eth0 lro off
> > > > > > > > > > > > > > > > > > Actual changes:
> > > > > > > > > > > > > > > > > > rx-lro: on [requested off]
> > > > > > > > > > > > > > > > > > Could not change any device features
> > > > > > > > > > > > > > > > > Ok, it looks like the device misses the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> > > > > > > > > > > > > > > > > which makes it impossible to change the LRO setting.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Did you use qemu? If yes, what's the qemu version you've used?
> > > > > > > > > > > > > > > > These are VirtualBox machines, which I've been using for years with
> > > > > > > > > > > > > > > > longterm kernels 4.19, and I never had such a problem.  But now that I
> > > > > > > > > > > > > > > > tried upgrading to kernels 5.10 or 5.13 -- the panics started.  These
> > > > > > > > > > > > > > > > are just generic kernel builds, and a minimalistic userspace.
> > > > > > > > > > > > > > > I would be useful to see the features your virtualbox instance provides
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > cat /sys/class/net/eth0/device/features
> > > > > > > > > > > > > > # cat /sys/class/net/eth0/device/features
> > > > > > > > > > > > > > 1100010110111011111100000000000000000000000000000000000000000000
> > > > > > > > > > > > > I was able to reproduce the warning but not the panic.
> > > > > > > > > > > > > OTOH if LRO stays on when enabling forwarding that
> > > > > > > > > > > > > is already a problem. Any chance you can bisect to
> > > > > > > > > > > > > find out which change introduced the panic?
> > > > > > > > > > > > 
> > > > > > > > > > > > Any kernels up to 4.19.198 don't panic.
> > > > > > > > > > > > Any kernels 5.10+ panic immediately upon starting forwarding.
> > > > > > > > > > > > I have not tested any kernels between 4.19 and 5.10.
> > > > > > > > > > > > I guess I can build a few kernels inbetween, and try pinpoint where it starts.
> > > > > > > > > > > > That may take a day or so.  I'll get on with it now, and report my findings.
> > > > > > > > > > > So, I narrowed  it down: the panics start with kernel 5.0-rc.
> > > > > > > > > > More narowly, the problem seems be coming from commit
> > > > > > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56.
> > > > > > > > > > Just to test my suspicion, I deleted a few lines from that code,
> > > > > > > > > > and the panic went away.  Hope that helps you guys figure out
> > > > > > > > > > what the problem might be.
> > > > > > > > Well it disables LRO but we knew this :( I'd help if we knew
> > > > > > > > where does it panic, all we see it the warning which is
> > > > > > > > related for sure but not the immediate rootcause ...
> > > > > > > > 
> > > > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > > > @@ -2978,11 +2978,6 @@
> > > > > > > > > >    }
> > > > > > > > > >    if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
> > > > > > > > > >       dev->features |= NETIF_F_RXCSUM;
> > > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> > > > > > > > > > -    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6))
> > > > > > > > > > -    dev->features |= NETIF_F_LRO;
> > > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS))
> > > > > > > > > > -    dev->hw_features |= NETIF_F_LRO;
> > > > > > > > > > 
> > > > > > > > > >    dev->vlan_features = dev->features;
> > > > > > > > > Just FYI, Google turned up two similar bug reposts...
> > > > > > > > > Apr 14, 2020 -- https://github.com/containers/podman/issues/5815
> > > > > > > > > Oct 09. 2020 -- https://bugzilla.kernel.org/show_bug.cgi?id=209593
> > > > > > > > > 
> > > > > > > > > Is there any sensible thing I could do, temporarily, until this
> > > > > > > > > problem is sorted out?
> > > > > > > > > Or am I simply stuck to kernels 4.19 on these machines for now?
> > > > > > > > 
> > > > > > > > Something like this I guess:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > > index 8a58a2f013af..cc5982193a40 100644
> > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > @@ -3063,6 +3063,8 @@ static int virtnet_validate(struct virtio_device *vdev)
> > > > > > > >                          __virtio_clear_bit(vdev, VIRTIO_NET_F_MTU);
> > > > > > > >          }
> > > > > > > > 
> > > > > > > > +       __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO4);
> > > > > > > > +       __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO6);
> > > > > > > >          return 0;
> > > > > > > >   }
> > > > > > > When I apply your patch, then I see drastic (more than half)
> > > > > > > reductions in speed. (confirmed with iperf).
> > > > > > > 
> > > > > > > But if instead I just remove a few lines from commit
> > > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56
> > > > > > > as in my earlier post, then I'm back to full speed
> > > > > > > 
> > > > > > > I understand that this is just temporary workaround, until we figure this out.
> > > > > > 
> > > > > > Oh weird. So it's not about getting some weird LRO packet. We will get it with
> > > > > > VIRTIO_NET_F_GUEST_TSO4 anyway. It's about the LRO flag being set in
> > > > > > features.
> > > > > > 
> > > > > > How about this then? Just pretend to Linux that we disabled LRO.
> > > > > > 
> > > > > > 
> > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > index 8a58a2f013af..8e7e4cea176b 100644
> > > > > > --- a/drivers/net/virtio_net.c
> > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > @@ -2651,8 +2651,9 @@ static int virtnet_set_features(struct net_device *dev,
> > > > > >                                     ~GUEST_OFFLOAD_LRO_MASK;
> > > > > > 
> > > > > >                  err = virtnet_set_guest_offloads(vi, offloads);
> > > > > > -               if (err)
> > > > > > -                       return err;
> > > > > > +               WARN_ON(err);
> > > > > > +               //if (err)
> > > > > > +               //      return err;
> > > > > >                  vi->guest_offloads = offloads;
> > > > > >          }
> > > > > No. With this applied, the problem persists:
> > > > > 
> > > > > # echo "1" > /proc/sys/net/ipv4/ip_forward
> > > > > 
> > > > > kernel: ------------[ cut here ]------------
> > > > > kernel: netdevice: eth0: failed to disable LRO!
> > > > > kernel: WARNING: CPU: 0 PID: 452 at net/core/dev.c:1768
> > > > > dev_disable_lro+0x108/0x150
> > > > > kernel: Modules linked in: sg nls_iso8859_1 nls_cp437 vfat fat
> > > > > hid_generic usbhid hid virtio_net net_failover failover aesni_intel
> > > > > libaes crypto_simd ohci_pci ahci libahci cryptd rapl ehci_pci ohci_hcd
> > > > > ehci_hcd usbcore usb_common libata evdev lpc_ich mfd_core rng_core
> > > > > i2c_piix4 i2c_core virtio_pci virtio_pci_modern_dev virtio_ring virtio
> > > > > rtc_cmos atkbd libps2 i8042 serio battery ac button loop unix
> > > > > kernel: CPU: 0 PID: 452 Comm: bash Not tainted 5.13.7-gnu.1-NuMini #1
> > > > > kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> > > > > VirtualBox 12/01/2006
> > > > > kernel: RIP: 0010:dev_disable_lro+0x108/0x150
> > > > Again the warning isn't a big deal. I agree we should address - Jason
> > > > any update?
> > > I still think using NETIF_F_LRO might not be correct. Since we're
> > > basically receiving GSO packets.
> > > 
> > > And it might cause a lot of issues if the device doesn't have
> > > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
> > > 
> > > I see two possible fixes:
> > > 
> > > 1) using NETIF_F_GRO_HW instead (the patch is attached)
> > It's unfortunate you didn't inline. Anyway.
> > Ivan could you test the patch and report?
> > 
> > > or
> > Hmm. I am not sure we always preserve the GRO_HW requirement that
> > packets can be re-segmented to reconstruct the original packet stream.
> > Do all backends guarantee this?
> 
> 
> I think we can't.
> 
> 
> > Could you explain why?
> 
> 
> Or we probably need another new netdev feature like rx-gso?
> 
> 
> > 
> > 
> > 
> > > 2) set NETIF_F_LRO only if the device has CTRL_GUEST_OFFLOADS
> > > 
> > > Thanks
> > 
> > This one would slow guests on old hosts down significantly.
> 
> 
> Actually, it's not this proposal but see below.
> 
> 
> > 
> > I am not sure why this didn't trigger previously
> 
> 
> It looks to me it was caused by a02e8964eaf9271a8a5fcc0c55bd13f933bafc56
> ("virtio-net: ethtool configurable LRO").
> 
> Before this commit we won't even advertise NETIF_F_LRO, so dev_disable_lro()
> won't warn.
> 
> After this commit, we advertise LRO and dev_disable_lro() will try to
> disable all guest offloads which will:
> 
> 1) slow the traffic
> 
> and
> 
> 2) warn if "lro" can't be disabled on the device without ctrl guest offloads
> (e.g the virtualbox host)
> 
> Thanks

OK. So I think I understand your comment now: GRO_HW makes sense simply
because historically before a02e8964eaf9271a8a5fcc0c55bd13f933bafc56 we
never advertised LRO.

Can you post a patch RFC properly so Ivan can test?



> 
> > btw -
> > we used not to have CTRL_GUEST_OFFLOADS after all.
> > 
> > 
> > 
> > > > But the main issue is you lose connectivity. That still
> > > > persists with this? Can't you get a serial connection
> > > > out? I know qemu Did the kernel oops afterwards?
> > > > 
> > > > --
> > > > MST
> > > > 
> > 
> > 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2021-08-11  8:01 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CACFia2dwacaVVYD+1uG=CDGaJqdCOSBvZ5FcXp04caecaWAY3w@mail.gmail.com>
2021-07-23  1:28 ` PROBLEM: virtio_net LRO kernel panics Tonghao Zhang
     [not found]   ` <CACFia2fDZHUZB5wJ7LK8M2sv_+W58rHw0NzzrwPPoX9=s7yPdQ@mail.gmail.com>
2021-07-23  2:37     ` Jason Wang
     [not found]       ` <CACFia2eLCJuy=w1r20691s_cSYkBkPaY-Dbd-9CkrcpSAe7z6g@mail.gmail.com>
2021-07-23  4:25         ` Jason Wang
     [not found]           ` <CACFia2eH3eCZxtt70LB5zoPbhLXRv=crPh5oOhR=6mY3auDdQA@mail.gmail.com>
2021-07-23  7:59             ` Michael S. Tsirkin
     [not found]               ` <CACFia2fWhWKMGF3g8SfU++2-jQ1rCKtCJo3h08KmhGfMTuZaQQ@mail.gmail.com>
2021-07-23  8:13                 ` Michael S. Tsirkin
2021-07-23 12:10                 ` Michael S. Tsirkin
     [not found]                   ` <CACFia2en0JJDFyz3Umk-JTnMT=kjvRogt4PudED4kiLeMjcHFg@mail.gmail.com>
     [not found]                     ` <CACFia2fx7Lt-4o_uqDznvk-VgdsMtD64qv6RYkrCjKLu2yt8bg@mail.gmail.com>
     [not found]                       ` <CACFia2eUi4KNRC7MYktzUS9Nq2WcBiesX04Tbn2pTuvuGkY4qA@mail.gmail.com>
     [not found]                         ` <CACFia2dns1rTe5OQj4H-kpurVm2CTtGfAXz0aOUS0_cs0QUrsA@mail.gmail.com>
2021-07-27  9:11                           ` Michael S. Tsirkin
     [not found]                             ` <CACFia2dLp19pzJsScSvVYREpQm0n6XCWLieWXzA94=OVYVHTbw@mail.gmail.com>
2021-08-02 19:51                               ` Michael S. Tsirkin
     [not found]                                 ` <CACFia2f8xmOwB69Cj+OUNobNSurVnrLrJFdrxnmurww9aSzJMw@mail.gmail.com>
2021-08-10 15:31                                   ` Michael S. Tsirkin
2021-08-11  3:38                                     ` Jason Wang
2021-08-11  7:39                                       ` Michael S. Tsirkin
2021-08-11  7:45                                         ` Jason Wang
2021-08-11  8:01                                           ` Michael S. Tsirkin [this message]
2021-08-11  8:17                                             ` Jason Wang
     [not found]               ` <CACFia2fYQG4Y3_ffym06C1HGrOiOS38YWxuoUu4HYorwS9qOjA@mail.gmail.com>
2021-07-23  8:59                 ` Michael S. Tsirkin
2021-07-30 11:42 ` Michael S. Tsirkin
2021-07-30 11:42   ` Michael S. Tsirkin
2021-07-30 17:04   ` Ivan
2021-07-31 20:53     ` Michael S. Tsirkin
2021-07-31 20:53       ` Michael S. Tsirkin
2021-07-31 23:52       ` Ivan
2021-08-02  4:35     ` Jason Wang
2021-08-02  4:35       ` Jason Wang
2021-08-02 18:16       ` Ivan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210811035908-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=ivan@prestigetransportation.com \
    --cc=jasowang@redhat.com \
    --cc=kuba@kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.