From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>,
Ivan <ivan@prestigetransportation.com>,
virtualization <virtualization@lists.linux-foundation.org>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
"David S. Miller" <davem@davemloft.net>
Subject: Re: PROBLEM: virtio_net LRO kernel panics
Date: Wed, 11 Aug 2021 04:01:32 -0400 [thread overview]
Message-ID: <20210811035908-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <847f9477-634f-bf14-967e-38852c82a841@redhat.com>
On Wed, Aug 11, 2021 at 03:45:48PM +0800, Jason Wang wrote:
>
> 在 2021/8/11 下午3:39, Michael S. Tsirkin 写道:
> > On Wed, Aug 11, 2021 at 11:38:59AM +0800, Jason Wang wrote:
> > > On Tue, Aug 10, 2021 at 11:31 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > On Mon, Aug 02, 2021 at 04:23:12PM -0500, Ivan wrote:
> > > > > On Mon, Aug 2, 2021 at 2:52 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > On Mon, Aug 02, 2021 at 01:32:05PM -0500, Ivan wrote:
> > > > > > > On Tue, Jul 27, 2021 at 4:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > On Mon, Jul 26, 2021 at 07:44:43PM -0500, Ivan wrote:
> > > > > > > > > On Sat, Jul 24, 2021 at 11:18 PM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > On Sat, Jul 24, 2021 at 7:17 PM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > > On Fri, Jul 23, 2021 at 7:33 AM Ivan <ivan@prestigetransportation.com> wrote:
> > > > > > > > > > > > On Fri, Jul 23, 2021 at 7:10 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > > > > > > On Fri, Jul 23, 2021 at 03:06:04AM -0500, Ivan wrote:
> > > > > > > > > > > > > > On Fri, Jul 23, 2021 at 2:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 11:50:11PM -0500, Ivan wrote:
> > > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 11:25 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > > > > 在 2021/7/23 上午10:54, Ivan 写道:
> > > > > > > > > > > > > > > > > > On Thu, Jul 22, 2021 at 9:37 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > > > > > > Does it work if you turn off lro before enabling the forwarding?
> > > > > > > > > > > > > > > > > > 0 root@NuRaid:~# ethtool -K eth0 lro off
> > > > > > > > > > > > > > > > > > Actual changes:
> > > > > > > > > > > > > > > > > > rx-lro: on [requested off]
> > > > > > > > > > > > > > > > > > Could not change any device features
> > > > > > > > > > > > > > > > > Ok, it looks like the device misses the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> > > > > > > > > > > > > > > > > which makes it impossible to change the LRO setting.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Did you use qemu? If yes, what's the qemu version you've used?
> > > > > > > > > > > > > > > > These are VirtualBox machines, which I've been using for years with
> > > > > > > > > > > > > > > > longterm kernels 4.19, and I never had such a problem. But now that I
> > > > > > > > > > > > > > > > tried upgrading to kernels 5.10 or 5.13 -- the panics started. These
> > > > > > > > > > > > > > > > are just generic kernel builds, and a minimalistic userspace.
> > > > > > > > > > > > > > > I would be useful to see the features your virtualbox instance provides
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > cat /sys/class/net/eth0/device/features
> > > > > > > > > > > > > > # cat /sys/class/net/eth0/device/features
> > > > > > > > > > > > > > 1100010110111011111100000000000000000000000000000000000000000000
> > > > > > > > > > > > > I was able to reproduce the warning but not the panic.
> > > > > > > > > > > > > OTOH if LRO stays on when enabling forwarding that
> > > > > > > > > > > > > is already a problem. Any chance you can bisect to
> > > > > > > > > > > > > find out which change introduced the panic?
> > > > > > > > > > > >
> > > > > > > > > > > > Any kernels up to 4.19.198 don't panic.
> > > > > > > > > > > > Any kernels 5.10+ panic immediately upon starting forwarding.
> > > > > > > > > > > > I have not tested any kernels between 4.19 and 5.10.
> > > > > > > > > > > > I guess I can build a few kernels inbetween, and try pinpoint where it starts.
> > > > > > > > > > > > That may take a day or so. I'll get on with it now, and report my findings.
> > > > > > > > > > > So, I narrowed it down: the panics start with kernel 5.0-rc.
> > > > > > > > > > More narowly, the problem seems be coming from commit
> > > > > > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56.
> > > > > > > > > > Just to test my suspicion, I deleted a few lines from that code,
> > > > > > > > > > and the panic went away. Hope that helps you guys figure out
> > > > > > > > > > what the problem might be.
> > > > > > > > Well it disables LRO but we knew this :( I'd help if we knew
> > > > > > > > where does it panic, all we see it the warning which is
> > > > > > > > related for sure but not the immediate rootcause ...
> > > > > > > >
> > > > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > > > @@ -2978,11 +2978,6 @@
> > > > > > > > > > }
> > > > > > > > > > if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
> > > > > > > > > > dev->features |= NETIF_F_RXCSUM;
> > > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> > > > > > > > > > - virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6))
> > > > > > > > > > - dev->features |= NETIF_F_LRO;
> > > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS))
> > > > > > > > > > - dev->hw_features |= NETIF_F_LRO;
> > > > > > > > > >
> > > > > > > > > > dev->vlan_features = dev->features;
> > > > > > > > > Just FYI, Google turned up two similar bug reposts...
> > > > > > > > > Apr 14, 2020 -- https://github.com/containers/podman/issues/5815
> > > > > > > > > Oct 09. 2020 -- https://bugzilla.kernel.org/show_bug.cgi?id=209593
> > > > > > > > >
> > > > > > > > > Is there any sensible thing I could do, temporarily, until this
> > > > > > > > > problem is sorted out?
> > > > > > > > > Or am I simply stuck to kernels 4.19 on these machines for now?
> > > > > > > >
> > > > > > > > Something like this I guess:
> > > > > > > >
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > > index 8a58a2f013af..cc5982193a40 100644
> > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > @@ -3063,6 +3063,8 @@ static int virtnet_validate(struct virtio_device *vdev)
> > > > > > > > __virtio_clear_bit(vdev, VIRTIO_NET_F_MTU);
> > > > > > > > }
> > > > > > > >
> > > > > > > > + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO4);
> > > > > > > > + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO6);
> > > > > > > > return 0;
> > > > > > > > }
> > > > > > > When I apply your patch, then I see drastic (more than half)
> > > > > > > reductions in speed. (confirmed with iperf).
> > > > > > >
> > > > > > > But if instead I just remove a few lines from commit
> > > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56
> > > > > > > as in my earlier post, then I'm back to full speed
> > > > > > >
> > > > > > > I understand that this is just temporary workaround, until we figure this out.
> > > > > >
> > > > > > Oh weird. So it's not about getting some weird LRO packet. We will get it with
> > > > > > VIRTIO_NET_F_GUEST_TSO4 anyway. It's about the LRO flag being set in
> > > > > > features.
> > > > > >
> > > > > > How about this then? Just pretend to Linux that we disabled LRO.
> > > > > >
> > > > > >
> > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > index 8a58a2f013af..8e7e4cea176b 100644
> > > > > > --- a/drivers/net/virtio_net.c
> > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > @@ -2651,8 +2651,9 @@ static int virtnet_set_features(struct net_device *dev,
> > > > > > ~GUEST_OFFLOAD_LRO_MASK;
> > > > > >
> > > > > > err = virtnet_set_guest_offloads(vi, offloads);
> > > > > > - if (err)
> > > > > > - return err;
> > > > > > + WARN_ON(err);
> > > > > > + //if (err)
> > > > > > + // return err;
> > > > > > vi->guest_offloads = offloads;
> > > > > > }
> > > > > No. With this applied, the problem persists:
> > > > >
> > > > > # echo "1" > /proc/sys/net/ipv4/ip_forward
> > > > >
> > > > > kernel: ------------[ cut here ]------------
> > > > > kernel: netdevice: eth0: failed to disable LRO!
> > > > > kernel: WARNING: CPU: 0 PID: 452 at net/core/dev.c:1768
> > > > > dev_disable_lro+0x108/0x150
> > > > > kernel: Modules linked in: sg nls_iso8859_1 nls_cp437 vfat fat
> > > > > hid_generic usbhid hid virtio_net net_failover failover aesni_intel
> > > > > libaes crypto_simd ohci_pci ahci libahci cryptd rapl ehci_pci ohci_hcd
> > > > > ehci_hcd usbcore usb_common libata evdev lpc_ich mfd_core rng_core
> > > > > i2c_piix4 i2c_core virtio_pci virtio_pci_modern_dev virtio_ring virtio
> > > > > rtc_cmos atkbd libps2 i8042 serio battery ac button loop unix
> > > > > kernel: CPU: 0 PID: 452 Comm: bash Not tainted 5.13.7-gnu.1-NuMini #1
> > > > > kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> > > > > VirtualBox 12/01/2006
> > > > > kernel: RIP: 0010:dev_disable_lro+0x108/0x150
> > > > Again the warning isn't a big deal. I agree we should address - Jason
> > > > any update?
> > > I still think using NETIF_F_LRO might not be correct. Since we're
> > > basically receiving GSO packets.
> > >
> > > And it might cause a lot of issues if the device doesn't have
> > > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
> > >
> > > I see two possible fixes:
> > >
> > > 1) using NETIF_F_GRO_HW instead (the patch is attached)
> > It's unfortunate you didn't inline. Anyway.
> > Ivan could you test the patch and report?
> >
> > > or
> > Hmm. I am not sure we always preserve the GRO_HW requirement that
> > packets can be re-segmented to reconstruct the original packet stream.
> > Do all backends guarantee this?
>
>
> I think we can't.
>
>
> > Could you explain why?
>
>
> Or we probably need another new netdev feature like rx-gso?
>
>
> >
> >
> >
> > > 2) set NETIF_F_LRO only if the device has CTRL_GUEST_OFFLOADS
> > >
> > > Thanks
> >
> > This one would slow guests on old hosts down significantly.
>
>
> Actually, it's not this proposal but see below.
>
>
> >
> > I am not sure why this didn't trigger previously
>
>
> It looks to me it was caused by a02e8964eaf9271a8a5fcc0c55bd13f933bafc56
> ("virtio-net: ethtool configurable LRO").
>
> Before this commit we won't even advertise NETIF_F_LRO, so dev_disable_lro()
> won't warn.
>
> After this commit, we advertise LRO and dev_disable_lro() will try to
> disable all guest offloads which will:
>
> 1) slow the traffic
>
> and
>
> 2) warn if "lro" can't be disabled on the device without ctrl guest offloads
> (e.g the virtualbox host)
>
> Thanks
OK. So I think I understand your comment now: GRO_HW makes sense simply
because historically before a02e8964eaf9271a8a5fcc0c55bd13f933bafc56 we
never advertised LRO.
Can you post a patch RFC properly so Ivan can test?
>
> > btw -
> > we used not to have CTRL_GUEST_OFFLOADS after all.
> >
> >
> >
> > > > But the main issue is you lose connectivity. That still
> > > > persists with this? Can't you get a serial connection
> > > > out? I know qemu Did the kernel oops afterwards?
> > > >
> > > > --
> > > > MST
> > > >
> >
> >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
next prev parent reply other threads:[~2021-08-11 8:01 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CACFia2dwacaVVYD+1uG=CDGaJqdCOSBvZ5FcXp04caecaWAY3w@mail.gmail.com>
2021-07-23 1:28 ` PROBLEM: virtio_net LRO kernel panics Tonghao Zhang
[not found] ` <CACFia2fDZHUZB5wJ7LK8M2sv_+W58rHw0NzzrwPPoX9=s7yPdQ@mail.gmail.com>
2021-07-23 2:37 ` Jason Wang
[not found] ` <CACFia2eLCJuy=w1r20691s_cSYkBkPaY-Dbd-9CkrcpSAe7z6g@mail.gmail.com>
2021-07-23 4:25 ` Jason Wang
[not found] ` <CACFia2eH3eCZxtt70LB5zoPbhLXRv=crPh5oOhR=6mY3auDdQA@mail.gmail.com>
2021-07-23 7:59 ` Michael S. Tsirkin
[not found] ` <CACFia2fWhWKMGF3g8SfU++2-jQ1rCKtCJo3h08KmhGfMTuZaQQ@mail.gmail.com>
2021-07-23 8:13 ` Michael S. Tsirkin
2021-07-23 12:10 ` Michael S. Tsirkin
[not found] ` <CACFia2en0JJDFyz3Umk-JTnMT=kjvRogt4PudED4kiLeMjcHFg@mail.gmail.com>
[not found] ` <CACFia2fx7Lt-4o_uqDznvk-VgdsMtD64qv6RYkrCjKLu2yt8bg@mail.gmail.com>
[not found] ` <CACFia2eUi4KNRC7MYktzUS9Nq2WcBiesX04Tbn2pTuvuGkY4qA@mail.gmail.com>
[not found] ` <CACFia2dns1rTe5OQj4H-kpurVm2CTtGfAXz0aOUS0_cs0QUrsA@mail.gmail.com>
2021-07-27 9:11 ` Michael S. Tsirkin
[not found] ` <CACFia2dLp19pzJsScSvVYREpQm0n6XCWLieWXzA94=OVYVHTbw@mail.gmail.com>
2021-08-02 19:51 ` Michael S. Tsirkin
[not found] ` <CACFia2f8xmOwB69Cj+OUNobNSurVnrLrJFdrxnmurww9aSzJMw@mail.gmail.com>
2021-08-10 15:31 ` Michael S. Tsirkin
2021-08-11 3:38 ` Jason Wang
2021-08-11 7:39 ` Michael S. Tsirkin
2021-08-11 7:45 ` Jason Wang
2021-08-11 8:01 ` Michael S. Tsirkin [this message]
2021-08-11 8:17 ` Jason Wang
[not found] ` <CACFia2fYQG4Y3_ffym06C1HGrOiOS38YWxuoUu4HYorwS9qOjA@mail.gmail.com>
2021-07-23 8:59 ` Michael S. Tsirkin
2021-07-30 11:42 ` Michael S. Tsirkin
[not found] ` <CACFia2d82tqUrLwpt0fowm3DpD7+HXM9Vcfz56eQ_AkTcWmOEg@mail.gmail.com>
2021-07-31 20:53 ` Michael S. Tsirkin
2021-08-02 4:35 ` Jason Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210811035908-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=ivan@prestigetransportation.com \
--cc=jasowang@redhat.com \
--cc=kuba@kernel.org \
--cc=virtualization@lists.linux-foundation.org \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).