* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 [not found] <bug-12282-10286@http.bugzilla.kernel.org/> @ 2008-12-24 7:20 ` Andrew Morton 2008-12-24 12:17 ` Michel Lespinasse 2008-12-24 13:32 ` J. K. Cliburn 0 siblings, 2 replies; 6+ messages in thread From: Andrew Morton @ 2008-12-24 7:20 UTC (permalink / raw) To: netdev; +Cc: bugme-daemon, walken, J. K. Cliburn, Jie Yang (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 23 Dec 2008 21:24:45 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12282 > > Summary: Network data corruption on eee 1000 > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.28-rc8 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Network > AssignedTo: jgarzik@pobox.com > ReportedBy: walken@zoy.org > > > Latest working kernel version: unknown > Earliest failing kernel version: 2.6.28-rc8 > Distribution: debian lenny > Hardware Environment: eee 1000, no hardware changes except for a 2GB memory > upgrade. > Software Environment: > Problem Description: Intermittent data corruption over wired network > > > Running debian lenny on my eee 1000, I've seen occasional scp failures where > scp would complain about a corrupted MAC when copying files around on my local > network. Also when compiling things over NFS I occasionally got my source files > to appear corrupted on the client (while they were still fine on the server) > and when I tried running things in an nfsroot environment (I know this sounds > silly for a laptop, but I see it as a good way to try new software without > having to install it on disk), I got occasional segfaults in various processes. > Since I've not seen such failures when running with a disk based root, I blame > them all on the networking subsystem. > > > I've been running the following command as a way to try and reproduce the > problem: > > for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do for z in 0 1 > 2 3 4 5 6 7 8 9; do echo $x$y$z; scp server:shared/net_test/data1GB /tmp || > sleep 36000; date; done; done; done > 000 > data1GB 100% 1005MB 5.2MB/s 03:15 > Tue Dec 23 20:17:36 PST 2008 > 001 > data1GB 100% 1005MB 5.2MB/s 03:12 > Tue Dec 23 20:20:49 PST 2008 > 002 > data1GB 100% 1005MB 5.2MB/s 03:13 > Tue Dec 23 20:24:03 PST 2008 > 003 > data1GB 100% 1005MB 6.4MB/s 02:38 > Tue Dec 23 20:26:42 PST 2008 > 004 > data1GB 98% 994MB 5.4MB/s 00:02 > ETADisconnecting: Corrupted MAC on input. > lost connection > > The failures don't always happen at the same place, and they might be slightly > more likely soon after boot, but I'm not sure about that. > > Even after scp detected some data corruption, ifconfig does not report any > errors: > > eth0 Link encap:Ethernet HWaddr 00:22:15:85:7c:94 > inet addr:10.3.0.1 Bcast:10.255.255.255 Mask:255.0.0.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3683950 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1432256 errors:0 dropped:0 overruns:0 carrier:2 > collisions:0 txqueuelen:1000 > RX bytes:1246310892 (1.1 GiB) TX bytes:101092933 (96.4 MiB) > Interrupt:59 > > (Note the RX bytes value is also wrong since I transferred almost 5GB above, > I believe this is because the value wraps around after 4GB ? Also, > /proc/interrupts reports >3 million interrupts (PCI-MSI-edge) on eth0) > > I'm tempted to blame either the hardware or the newish atl1e network driver, > but have no hard proof either way at this point. > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 2008-12-24 7:20 ` [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 Andrew Morton @ 2008-12-24 12:17 ` Michel Lespinasse 2008-12-24 13:32 ` J. K. Cliburn 1 sibling, 0 replies; 6+ messages in thread From: Michel Lespinasse @ 2008-12-24 12:17 UTC (permalink / raw) To: Andrew Morton; +Cc: netdev, bugme-daemon, J. K. Cliburn, Jie Yang At this point I wonder if this could be an issue with marginal memory timings, but which somehow only gets triggered when transfering with the network adapter, and never when being accessed by the CPU. But is that even possible ??? Here are few additional data points I collected: In order to see what the raw data looks like before scp complains about the corrupted MAC, I decided to drop scp and use nfs + cp + md5sum: cp /mnt/shared/net_test/data1GB /tmp; md5sum /tmp/data1GB (/mnt/shared is an nfs3 over tcp mount, and /tmp is a tmpfs). After a few tries I usually get the wrong md5sum in /tmp/data1GB, I then copy the file back to the server, check that it arrived there with the same corrupted md5sum as it had on the eee client side, and use "cmp -l" to figure out what's different between the original and the corrupted file. Turns out that in all cases I've observed, the corrupted file had a 128-byte region with unexpected (garbage) contents. Not just single bits being flipped, but the whole region being entirely different. The regions were not necessarily aligned on a 128 byte boundary relative to the start of the file, though. At this point I wondered "bad memory?" and I swapped back the original 1GB stick that came with the EEE 1000, instead of the 2GB upgrade I had installed on the first day. Turns out that only made things worse ! with that stick, I still see some 128-byte regions getting corrupted, and I additionally see a few bytes here and there (always at an offset multiple of 4 relative to the start of the file) having bit 0x02 set when they should not. If I run md5sum on the /tmp file multiple times I will always get the same hash, but it did take me 3 trials (with a 500MB file, my /tmp is smaller now that I have only 1GB of memeory) before I did end up with a copy on the server that had the same hash as the corrupted /tmp file. The two other copies had a few more 0x02 bits mistakenly set here and there. Both memory sticks do check out fine with "memtester" (I have not tried memtest86 yet), and that I don't observe any trouble when not using the LAN. Could this be a timing issue that would only show up when transfering between memory and the network adapter ? And if so, what can we even do about it ? I'm using bios version 0803 which is the most recent available for the EEE 1000. I won't be able to do much testing in the following week as I'll be away from my LAN :) , I should be able to get wireless and read my email though. On Tue, Dec 23, 2008 at 11:20:35PM -0800, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Tue, 23 Dec 2008 21:24:45 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=12282 > > > > Summary: Network data corruption on eee 1000 > > Product: Drivers > > Version: 2.5 > > KernelVersion: 2.6.28-rc8 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Network > > AssignedTo: jgarzik@pobox.com > > ReportedBy: walken@zoy.org > > > > > > Latest working kernel version: unknown > > Earliest failing kernel version: 2.6.28-rc8 > > Distribution: debian lenny > > Hardware Environment: eee 1000, no hardware changes except for a 2GB memory > > upgrade. > > Software Environment: > > Problem Description: Intermittent data corruption over wired network > > > > > > Running debian lenny on my eee 1000, I've seen occasional scp failures where > > scp would complain about a corrupted MAC when copying files around on my local > > network. Also when compiling things over NFS I occasionally got my source files > > to appear corrupted on the client (while they were still fine on the server) > > and when I tried running things in an nfsroot environment (I know this sounds > > silly for a laptop, but I see it as a good way to try new software without > > having to install it on disk), I got occasional segfaults in various processes. > > Since I've not seen such failures when running with a disk based root, I blame > > them all on the networking subsystem. > > > > > > I've been running the following command as a way to try and reproduce the > > problem: > > > > for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do for z in 0 1 > > 2 3 4 5 6 7 8 9; do echo $x$y$z; scp server:shared/net_test/data1GB /tmp || > > sleep 36000; date; done; done; done > > 000 > > data1GB 100% 1005MB 5.2MB/s 03:15 > > Tue Dec 23 20:17:36 PST 2008 > > 001 > > data1GB 100% 1005MB 5.2MB/s 03:12 > > Tue Dec 23 20:20:49 PST 2008 > > 002 > > data1GB 100% 1005MB 5.2MB/s 03:13 > > Tue Dec 23 20:24:03 PST 2008 > > 003 > > data1GB 100% 1005MB 6.4MB/s 02:38 > > Tue Dec 23 20:26:42 PST 2008 > > 004 > > data1GB 98% 994MB 5.4MB/s 00:02 > > ETADisconnecting: Corrupted MAC on input. > > lost connection > > > > The failures don't always happen at the same place, and they might be slightly > > more likely soon after boot, but I'm not sure about that. > > > > Even after scp detected some data corruption, ifconfig does not report any > > errors: > > > > eth0 Link encap:Ethernet HWaddr 00:22:15:85:7c:94 > > inet addr:10.3.0.1 Bcast:10.255.255.255 Mask:255.0.0.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:3683950 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:1432256 errors:0 dropped:0 overruns:0 carrier:2 > > collisions:0 txqueuelen:1000 > > RX bytes:1246310892 (1.1 GiB) TX bytes:101092933 (96.4 MiB) > > Interrupt:59 > > > > (Note the RX bytes value is also wrong since I transferred almost 5GB above, > > I believe this is because the value wraps around after 4GB ? Also, > > /proc/interrupts reports >3 million interrupts (PCI-MSI-edge) on eth0) > > > > I'm tempted to blame either the hardware or the newish atl1e network driver, > > but have no hard proof either way at this point. > > > -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 2008-12-24 7:20 ` [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 Andrew Morton 2008-12-24 12:17 ` Michel Lespinasse @ 2008-12-24 13:32 ` J. K. Cliburn 2008-12-25 4:19 ` Michel Lespinasse ` (2 more replies) 1 sibling, 3 replies; 6+ messages in thread From: J. K. Cliburn @ 2008-12-24 13:32 UTC (permalink / raw) To: walken; +Cc: netdev, bugme-daemon, Jie Yang On Wed, Dec 24, 2008 at 1:20 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Tue, 23 Dec 2008 21:24:45 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > >> http://bugzilla.kernel.org/show_bug.cgi?id=12282 >> >> Summary: Network data corruption on eee 1000 Do things improve if you turn off TSO in the atl1e driver? ethtool -K eth0 tso off ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 2008-12-24 13:32 ` J. K. Cliburn @ 2008-12-25 4:19 ` Michel Lespinasse 2008-12-31 2:20 ` Michel Lespinasse 2008-12-31 9:34 ` Michel Lespinasse 2 siblings, 0 replies; 6+ messages in thread From: Michel Lespinasse @ 2008-12-25 4:19 UTC (permalink / raw) To: J. K. Cliburn; +Cc: netdev, bugme-daemon, Jie Yang On Wed, Dec 24, 2008 at 07:32:36AM -0600, J. K. Cliburn wrote: > Do things improve if you turn off TSO in the atl1e driver? > > ethtool -K eth0 tso off I'm currently away in vacation but I should be able to test this next week. I will even try with both memory sticks, as the 1GB one seemed to give more issues when using the wired network (but still worked fine when just using the cpu). Merry Christmas everyone ! :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 2008-12-24 13:32 ` J. K. Cliburn 2008-12-25 4:19 ` Michel Lespinasse @ 2008-12-31 2:20 ` Michel Lespinasse 2008-12-31 9:34 ` Michel Lespinasse 2 siblings, 0 replies; 6+ messages in thread From: Michel Lespinasse @ 2008-12-31 2:20 UTC (permalink / raw) To: J. K. Cliburn; +Cc: netdev, bugme-daemon, Jie Yang On Wed, Dec 24, 2008 at 07:32:36AM -0600, J. K. Cliburn wrote: > Do things improve if you turn off TSO in the atl1e driver? > > ethtool -K eth0 tso off Seems to work: I get "operation not supported" when trying this ethtool command. However, I compiled 2.6.28 with a patch to not set NETIF_F_TSO and NETIF_F_TSO6 into netdev->features and I've been able to transfer 180 GB with scp overnight without running into any corrupted MACs. One thing I do not understand - I thought the tso option was only meaningful on the sender side ??? In my case, the transfers are going from the external server to the local atl1e based interface... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 2008-12-24 13:32 ` J. K. Cliburn 2008-12-25 4:19 ` Michel Lespinasse 2008-12-31 2:20 ` Michel Lespinasse @ 2008-12-31 9:34 ` Michel Lespinasse 2 siblings, 0 replies; 6+ messages in thread From: Michel Lespinasse @ 2008-12-31 9:34 UTC (permalink / raw) To: J. K. Cliburn; +Cc: netdev, bugme-daemon, Jie Yang On Wed, Dec 24, 2008 at 07:32:36AM -0600, J. K. Cliburn wrote: > Do things improve if you turn off TSO in the atl1e driver? > > ethtool -K eth0 tso off I said yes in a previous message, but I think I was confused. I'm now running 2.6.28 with a driver change that removes NETIF_F_TSO and NETIF_F_TSO6 from netdev->features. Last night I transferred 180 GB with scp without any corrupted MACs, however today while compiling stuff over NFS I got a 128-byte block of source code that was corrupted (replaced with text from a different source file). So, my issue does not seem to disappear even after disabling TSO support in the driver. Also the issue seems to be somewhat intermittent, since I was not able to triger it at will with scp last night.... :/ -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-12-31 9:34 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <bug-12282-10286@http.bugzilla.kernel.org/> 2008-12-24 7:20 ` [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 Andrew Morton 2008-12-24 12:17 ` Michel Lespinasse 2008-12-24 13:32 ` J. K. Cliburn 2008-12-25 4:19 ` Michel Lespinasse 2008-12-31 2:20 ` Michel Lespinasse 2008-12-31 9:34 ` Michel Lespinasse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).