netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: netdev@vger.kernel.org
Subject: Fw: [Bug 217678] New: Unexplainable packet drop starting at v6.4
Date: Mon, 17 Jul 2023 11:53:52 -0700	[thread overview]
Message-ID: <20230717115352.79aecc71@hermes.local> (raw)



Begin forwarded message:

Date: Mon, 17 Jul 2023 17:44:27 +0000
From: bugzilla-daemon@kernel.org
To: stephen@networkplumber.org
Subject: [Bug 217678] New: Unexplainable packet drop starting at v6.4


https://bugzilla.kernel.org/show_bug.cgi?id=217678

            Bug ID: 217678
           Summary: Unexplainable packet drop starting at v6.4
           Product: Networking
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: hq.dev+kernel@msdfc.xyz
        Regression: No

Hi,

After I updated to 6.4 through Archlinux kernel update, suddenly I noticed
random packet losses on my routers like nodes. I have these networking relevant
config on my nodes

1. Using archlinux
2. Network config through systemd-networkd
3. Using bird2 for BGP routing, but not relevant to this bug.
4. Using nftables for traffic control, but seems not relevant to this bug. 
5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level

After I ruled out systemd-networkd, nftables related issues. I tracked down
issues to kernel.

Here's the tcpdump I'm seeing on one side of my node ""

```
sudo tcpdump -i fios_wan port 38851
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144
bytes
10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
length 148
10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
length 148
10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
length 148
10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
length 148
10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
length 148
5 packets captured
5 packets received by filter
0 packets dropped by kernel
```

But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying packets
in this wireguard stream. So packet is lost somewhere in the link.

From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found the
moment the link got lost is right at "[BOS1_NODE]", that means "[BOS1_NODE]"'s
networking stack completely drop the inbound packets from specific ip
addresses.

Some more digging

1. This situation began after booting in different delays. Sometimes can
trigger after 30 seconds after booting, and sometimes will be after 18 hours or
more.
2. It can envolve into worse case that when I do "ip neigh show", the ipv4 ARP
table and ipv6 neighbor discovery start to appear as "invalid", meaning the
internet is completely loss.
3. When this happened to wan facing interface, it seems OK with lan facing
interfaces. WAN interface was using Intel X710-T4L using i40e and lan side was
using virtio
4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it
reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not
relevant to networking at all, maybe it's the wrong commit to look at. At the
meantime, because I haven't found a reproducible way of 100% trigger the issue,
it may be the case during bisect some "good" commits are actually bad. 
5. I also tried to look at "dmesg", nothing interesting pop up. But I'll make
it available upon request.

This is my first bug reports. Sorry for any confusion it may lead to and thanks
for reading.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.

             reply	other threads:[~2023-07-17 18:53 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-17 18:53 Stephen Hemminger [this message]
2023-07-18 21:03 ` [Bug 217678] New: Unexplainable packet drop starting at v6.4 Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230717115352.79aecc71@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).