* data corruption in skge hardware
@ 2011-11-07 16:42 Mikulas Patocka
2011-11-07 17:13 ` Stephen Hemminger
0 siblings, 1 reply; 3+ messages in thread
From: Mikulas Patocka @ 2011-11-07 16:42 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
Hi
I found a data corruption in skge network card.
The card is this: "03:06.0 Ethernet controller: 3Com Corporation 3c940
10/100/1000Base-T [Marvell] (rev 10)"
The machine is two quad core Opterons with HT2000 north bridge and HT1000
south bridge.
When "scatter-gather" and "generic-segmentation-offload" are enabled, the
card sends out corrupted packets.
It normally manifests as a ssh connection drop once per few days, but I
found a workload that triggers this bug quickly.
I ran tcpdump on both sending and receiving machine and caught the packet
corruption:
correct packet (on the sending machine):
19:03:21.131836 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
0x0020: 8018 00c1 81ed 0000 0101 080a 0084 6735
0x0030: 0012 7cd8 4301 4af9 87c9 d2b4 8ba6 aedb
0x0040: 0572 1738 93db 789c 634b 4386 d013 db27
0x0050: 258b 6fa6 743c d429 a5e1 162f 2721 19bf
0x0060: 6669 a5c3 6bea 89ec a635 b8b4 8727 38c1
0x0070: 139f 5989 781b 49dd 79f5 4dfe 78ac ecb0
0x0080: 546c 33e0 0953 04bc 0647 a9d4 2fc4 cba0
0x0090: 44b2 3b01
incorrect packet (on the receiving machine):
19:03:21.133174 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
0x0020: 8018 00c1 6aa4 0000 0101 080a 0084 6735
0x0030: 0012 7cd8 0000 0000 0000 0000 0010 0000
0x0040: 0000 0000 0000 0000 0000 0000 0000 0000
0x0050: 0000 0000 0000 0000 0000 00c0 dc92 4702
0x0060: 88ff ff00 0000 0000 0000 0000 0000 0000
0x0070: 0000 0000 0000 0000 0000 0000 0000 0000
0x0080: 0000 0000 0000 0000 0000 0000 0000 0000
0x0090: 0000 00e0
Obviously, scatter-gather doesn't work, the header is correct, but the
packet body was likely read from random memory.
I tried to use "clflush" instruction on the transmit descriptor and the
packet body to test if it is a cache-coherency issue, but the corruption
was still there.
I tried to limit memory to 2G to test if it was a problem with high
memory, but the corruption was still there.
I tries olded kernels (as far as 2.6.34), the corruption was still there,
but it took much more time to trigger it with old kernels.
Do you have other reports of data corruption with skge hardware? Shouldn't
the driver set "scatter-gather" off by default because it is unreliable?
Mikulas
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: data corruption in skge hardware
2011-11-07 16:42 data corruption in skge hardware Mikulas Patocka
@ 2011-11-07 17:13 ` Stephen Hemminger
2011-11-07 17:34 ` Mikulas Patocka
0 siblings, 1 reply; 3+ messages in thread
From: Stephen Hemminger @ 2011-11-07 17:13 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: Stephen Hemminger, netdev
On Mon, 7 Nov 2011 11:42:11 -0500 (EST)
Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> I found a data corruption in skge network card.
>
> The card is this: "03:06.0 Ethernet controller: 3Com Corporation 3c940
> 10/100/1000Base-T [Marvell] (rev 10)"
>
> The machine is two quad core Opterons with HT2000 north bridge and HT1000
> south bridge.
>
> When "scatter-gather" and "generic-segmentation-offload" are enabled, the
> card sends out corrupted packets.
>
> It normally manifests as a ssh connection drop once per few days, but I
> found a workload that triggers this bug quickly.
>
> I ran tcpdump on both sending and receiving machine and caught the packet
> corruption:
>
> correct packet (on the sending machine):
> 19:03:21.131836 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
> ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
> 0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
> 0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
> 0x0020: 8018 00c1 81ed 0000 0101 080a 0084 6735
> 0x0030: 0012 7cd8 4301 4af9 87c9 d2b4 8ba6 aedb
> 0x0040: 0572 1738 93db 789c 634b 4386 d013 db27
> 0x0050: 258b 6fa6 743c d429 a5e1 162f 2721 19bf
> 0x0060: 6669 a5c3 6bea 89ec a635 b8b4 8727 38c1
> 0x0070: 139f 5989 781b 49dd 79f5 4dfe 78ac ecb0
> 0x0080: 546c 33e0 0953 04bc 0647 a9d4 2fc4 cba0
> 0x0090: 44b2 3b01
>
> incorrect packet (on the receiving machine):
> 19:03:21.133174 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
> ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
> 0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
> 0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
> 0x0020: 8018 00c1 6aa4 0000 0101 080a 0084 6735
> 0x0030: 0012 7cd8 0000 0000 0000 0000 0010 0000
> 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000
> 0x0050: 0000 0000 0000 0000 0000 00c0 dc92 4702
> 0x0060: 88ff ff00 0000 0000 0000 0000 0000 0000
> 0x0070: 0000 0000 0000 0000 0000 0000 0000 0000
> 0x0080: 0000 0000 0000 0000 0000 0000 0000 0000
> 0x0090: 0000 00e0
>
> Obviously, scatter-gather doesn't work, the header is correct, but the
> packet body was likely read from random memory.
>
> I tried to use "clflush" instruction on the transmit descriptor and the
> packet body to test if it is a cache-coherency issue, but the corruption
> was still there.
>
> I tried to limit memory to 2G to test if it was a problem with high
> memory, but the corruption was still there.
>
> I tries olded kernels (as far as 2.6.34), the corruption was still there,
> but it took much more time to trigger it with old kernels.
>
>
> Do you have other reports of data corruption with skge hardware? Shouldn't
> the driver set "scatter-gather" off by default because it is unreliable?
No reports, of problems.
Scatter-gather is used all the time by normal TCP connections.
I suspect something different because of the IOMMU and separate sockets.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: data corruption in skge hardware
2011-11-07 17:13 ` Stephen Hemminger
@ 2011-11-07 17:34 ` Mikulas Patocka
0 siblings, 0 replies; 3+ messages in thread
From: Mikulas Patocka @ 2011-11-07 17:34 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Stephen Hemminger, netdev
On Mon, 7 Nov 2011, Stephen Hemminger wrote:
> On Mon, 7 Nov 2011 11:42:11 -0500 (EST)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> > Hi
> >
> > I found a data corruption in skge network card.
> >
> > The card is this: "03:06.0 Ethernet controller: 3Com Corporation 3c940
> > 10/100/1000Base-T [Marvell] (rev 10)"
> >
> > The machine is two quad core Opterons with HT2000 north bridge and HT1000
> > south bridge.
> >
> > When "scatter-gather" and "generic-segmentation-offload" are enabled, the
> > card sends out corrupted packets.
> >
> > It normally manifests as a ssh connection drop once per few days, but I
> > found a workload that triggers this bug quickly.
> >
> > I ran tcpdump on both sending and receiving machine and caught the packet
> > corruption:
> >
> > correct packet (on the sending machine):
> > 19:03:21.131836 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
> > ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
> > 0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
> > 0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
> > 0x0020: 8018 00c1 81ed 0000 0101 080a 0084 6735
> > 0x0030: 0012 7cd8 4301 4af9 87c9 d2b4 8ba6 aedb
> > 0x0040: 0572 1738 93db 789c 634b 4386 d013 db27
> > 0x0050: 258b 6fa6 743c d429 a5e1 162f 2721 19bf
> > 0x0060: 6669 a5c3 6bea 89ec a635 b8b4 8727 38c1
> > 0x0070: 139f 5989 781b 49dd 79f5 4dfe 78ac ecb0
> > 0x0080: 546c 33e0 0953 04bc 0647 a9d4 2fc4 cba0
> > 0x0090: 44b2 3b01
> >
> > incorrect packet (on the receiving machine):
> > 19:03:21.133174 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808,
> > ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
> > 0x0000: 4510 0094 c7bf 4000 4006 f12d c0a8 8007
> > 0x0010: c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
> > 0x0020: 8018 00c1 6aa4 0000 0101 080a 0084 6735
> > 0x0030: 0012 7cd8 0000 0000 0000 0000 0010 0000
> > 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000
> > 0x0050: 0000 0000 0000 0000 0000 00c0 dc92 4702
> > 0x0060: 88ff ff00 0000 0000 0000 0000 0000 0000
> > 0x0070: 0000 0000 0000 0000 0000 0000 0000 0000
> > 0x0080: 0000 0000 0000 0000 0000 0000 0000 0000
> > 0x0090: 0000 00e0
> >
> > Obviously, scatter-gather doesn't work, the header is correct, but the
> > packet body was likely read from random memory.
> >
> > I tried to use "clflush" instruction on the transmit descriptor and the
> > packet body to test if it is a cache-coherency issue, but the corruption
> > was still there.
> >
> > I tried to limit memory to 2G to test if it was a problem with high
> > memory, but the corruption was still there.
> >
> > I tries olded kernels (as far as 2.6.34), the corruption was still there,
> > but it took much more time to trigger it with old kernels.
> >
> >
> > Do you have other reports of data corruption with skge hardware? Shouldn't
> > the driver set "scatter-gather" off by default because it is unreliable?
>
> No reports, of problems.
> Scatter-gather is used all the time by normal TCP connections.
> I suspect something different because of the IOMMU and separate sockets.
This card has 64-bit addressing, so it doesn't use IOMMU. Or does it?
Anyway, if I booted with 2G RAM, IOMMU was disabled and the corruption was
still there.
Mikulas
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-11-07 17:34 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-07 16:42 data corruption in skge hardware Mikulas Patocka
2011-11-07 17:13 ` Stephen Hemminger
2011-11-07 17:34 ` Mikulas Patocka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).