All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Tokarev <mjt@tls.msk.ru>
To: netdev <netdev@vger.kernel.org>
Subject: Re: weird network problem - stalls, reload works
Date: Mon, 10 Jan 2011 15:36:13 +0300	[thread overview]
Message-ID: <4D2AFD3D.3010701@msgid.tls.msk.ru> (raw)
In-Reply-To: <4CFC17B8.4050908@msgid.tls.msk.ru>

Replying to my old email, full details below.

So I replaced the motherboard on this machine,
and now everything is working fine.  Difficult
to tell if it was really hardware issue or a
software problem specific to this hardware,
but the problem is weird enough.

It's more: I can't reproduce the issue on this
motherboard in a test environment.

/mjt

06.12.2010 01:52, Michael Tokarev wrote:
> Hello.
> 
> I've a weird networking problem here, which I'm
> trying to hunt for some time.
> 
> Small LAN, just 3 machines and a server, all in
> single small room, all connected to a 100Mbps switch.
> 
> Sometimes, network between the (linux) server and
> workstations just stops.  It may happen after
> transferring a few megabytes of data (rare), or
> whole thing may work for several days or even
> weeks in a row, but end result is the same: at
> some point it stalls.
> 
> Reloading the interface in question, like this:
> 
>  ifdown eth0; sleep 2; ifup eth0
> 
> restores the network back, till it breaks again.
> Note here that, say, sleep 1 is not sufficient
> to restore the functionality, it has little effect.
> No sleep at all makes almost no difference, ie,
> such reload does not help.
> 
> The stalls looks like the server is suffering from
> massive packet loss in receive path.  It does not
> lose all packets, and the amount of lost packets
> increases with time, in a timeframe of several
> minutes.
> 
> Doing a data transfer from a client machine to this
> linux box, it goes at full ~10MB/s speed, next when
> the stall is about to happen the speed drops to 6MB/s,
> 4, 1MB/s, 600KB/s, till eventually the connection just
> times out.
> 
> The interesting data point is that the NIC does not
> generate any interrupts during such stalls, as if
> there's no packets are coming from the network at
> all - even if during that time, the client workstations
> are sending ARP requests (if nothing more).
> 
> Here's how ping on the server looks like (pinging one
> of the machine on the LAN):
> 
> 64 bytes from 192.168.78.20: icmp_seq=1 ttl=128 time=5008 ms
> 64 bytes from 192.168.78.20: icmp_seq=2 ttl=128 time=5000 ms
> 64 bytes from 192.168.78.20: icmp_seq=3 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=4 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=5 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=6 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=7 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=8 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=9 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=10 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=11 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=12 ttl=128 time=6320 ms
> 64 bytes from 192.168.78.20: icmp_seq=13 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=14 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=15 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=16 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=17 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=18 ttl=128 time=6000 ms
> 64 bytes from 192.168.78.20: icmp_seq=19 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=20 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=21 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=22 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=23 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=24 ttl=128 time=6007 ms
> 64 bytes from 192.168.78.20: icmp_seq=25 ttl=128 time=6001 ms
> 64 bytes from 192.168.78.20: icmp_seq=26 ttl=128 time=6010 ms
> 64 bytes from 192.168.78.20: icmp_seq=27 ttl=128 time=5014 ms
> 64 bytes from 192.168.78.20: icmp_seq=28 ttl=128 time=5011 ms
> 64 bytes from 192.168.78.20: icmp_seq=29 ttl=128 time=5020 ms
> 64 bytes from 192.168.78.20: icmp_seq=30 ttl=128 time=5020 ms
> 64 bytes from 192.168.78.20: icmp_seq=31 ttl=128 time=6018 ms
> 64 bytes from 192.168.78.20: icmp_seq=32 ttl=128 time=7010 ms
> 64 bytes from 192.168.78.20: icmp_seq=33 ttl=128 time=7008 ms
> 64 bytes from 192.168.78.20: icmp_seq=34 ttl=128 time=7000 ms
> 64 bytes from 192.168.78.20: icmp_seq=35 ttl=128 time=7000 ms
> 
> It looks like the NIC does not deliver any packets by its
> own, but notices something arrived when you actually try
> to _send_ sometihng - hence the delays above, almost whole
> seconds (since ping sends data with 1sec intervals).
> 
> Here's normal ping output right after "restarting" the interface:
> 
> 64 bytes from 192.168.78.20: icmp_seq=1 ttl=128 time=0.161 ms
> 64 bytes from 192.168.78.20: icmp_seq=2 ttl=128 time=0.119 ms
> 64 bytes from 192.168.78.20: icmp_seq=3 ttl=128 time=0.117 ms
> 64 bytes from 192.168.78.20: icmp_seq=4 ttl=128 time=0.381 ms
> 64 bytes from 192.168.78.20: icmp_seq=5 ttl=128 time=0.131 ms
> 64 bytes from 192.168.78.20: icmp_seq=6 ttl=128 time=0.133 ms
> 
> And at restart, the following gets printed in dmesg:
> 
> [ 3439.360831] forcedeth 0000:00:0a.0: irq 47 for MSI/MSI-X
> 
> 
> So far we tried to replace everything in this network:
> started with the NIC on the server, all wires, the switch,
> and even replaced the client computers (upgraded them from
> some old to current hardware).  Even changing the NIC on
> the server did not help - rtl8139 behaves the same way,
> but it needs a bit more time to trigger the issue.
> 
> The problem happens with several different kernels - at
> least 2.6.27 triggers it, 2.6.32 and 2.6.35 all behaves
> the same, 32 or 64bit.
> 
> The machine is based on Asus M2N-VM DVI motherboard, which
> is nVidia MCP67-based system.  The NIC is on-board forcedeth
> (and as I mentioned above the same prob happens with rtl8139
> card).
> 
> This machine has 2 more NICs inserted (used for WAN link and
> for another tiny LAN segment) - these does not show the issue,
> but they both run at 10Mbps, so maybe it needs 10x more time.
> When the eth0 LAN segment stops working, the rest of the system
> works just fine, including these 2 NICs and hard drives.
> 
> I also tried to disable MSI, loading forcedeth with msi=0, -
> this results in usage of IO-APIC-fasteoi for the NIC instead
> of usual PCI-MSI-edge, but does not change the situation.
> 
> So I'm quite stuck here, and don't know what to do next.
> My next bet is to try another motherboard, in a hope that
> this is just some broken interrupt controller, but it is
> a bit too unreal...
> 
> Any hints on what to try are greatly apprecated...
> 
> Thanks!
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


      parent reply	other threads:[~2011-01-10 12:36 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-05 22:52 weird network problem - stalls, reload works Michael Tokarev
2010-12-07 19:20 ` Jarek Poplawski
2011-01-10 12:36 ` Michael Tokarev [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D2AFD3D.3010701@msgid.tls.msk.ru \
    --to=mjt@tls.msk.ru \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.