* Re: Silent corruption on AMD64
[not found] ` <20070401043905.GV15189@vitelus.com>
@ 2007-04-01 13:58 ` Andi Kleen
2007-04-02 17:46 ` NIC data corruption Rick Jones
2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu
0 siblings, 2 replies; 8+ messages in thread
From: Andi Kleen @ 2007-04-01 13:58 UTC (permalink / raw)
To: Aaron Lehmann; +Cc: Jim Paris, linux-kernel, linux-ide, netdev
Aaron Lehmann <aaronl@vitelus.com> writes:
[adding netdev]
[meta-comment: I wish people wouldn't use such unnecessarily broad subjects
-- how is it the x86-64 port's or AMD's fault when you have broken hardware?
Would anybody write "Silent corruption on i386" or "Silent corruption
on Intel" or "Silent corruption on Linux"?]
> On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote:
> > Since it shows up under heavy load that includes unrelated devices, I
> > think ruling out hardware problems is important. Some suggestions:
>
> I've been able to narrow it down to the Realtek Ethernet card. I can't
> reproduce the problem using onboard Ethernet, whereas the Realtek card
> causes trouble in any slot. However, I still don't know whether it's a
> hardware or software issue, or whether it's caused directly or
> indirectly by the Realtek card.
You could disable the hardware checksumming support in the card with
the appended patch. Then hopefully Linux will catch most corruptions
(but perhaps not all because TCP checksums are not very strong)
You can watch failed checksums then with netstat -s
-Andi
Index: linux-2.6.21-rc3-net/drivers/net/r8169.c
===================================================================
--- linux-2.6.21-rc3-net.orig/drivers/net/r8169.c
+++ linux-2.6.21-rc3-net/drivers/net/r8169.c
@@ -2477,6 +2477,7 @@ static inline int rtl8169_fragmented_fra
static inline void rtl8169_rx_csum(struct sk_buff *skb, struct RxDesc *desc)
{
+#if 0
u32 opts1 = le32_to_cpu(desc->opts1);
u32 status = opts1 & RxProtoMask;
@@ -2485,6 +2486,7 @@ static inline void rtl8169_rx_csum(struc
((status == RxProtoIP) && !(opts1 & IPFail)))
skb->ip_summed = CHECKSUM_UNNECESSARY;
else
+#endif
skb->ip_summed = CHECKSUM_NONE;
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* NIC data corruption
2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen
@ 2007-04-02 17:46 ` Rick Jones
2007-04-02 18:07 ` Andi Kleen
2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu
1 sibling, 1 reply; 8+ messages in thread
From: Rick Jones @ 2007-04-02 17:46 UTC (permalink / raw)
To: Andi Kleen; +Cc: Aaron Lehmann, Jim Paris, netdev
I changed the title to be more accurate, and culled the distribution to
individuals and netdev
The mention of trying to turn-off CKO and see if the data corruption
goes away leads me to ask a possibly "delicate" question:
Should "Linux" only enable CKO on those NICs certified to have
ECC/parity throughout their _entire_ data path?
rick jones
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NIC data corruption
2007-04-02 17:46 ` NIC data corruption Rick Jones
@ 2007-04-02 18:07 ` Andi Kleen
2007-04-04 5:00 ` Herbert Xu
0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-04-02 18:07 UTC (permalink / raw)
To: Rick Jones; +Cc: Andi Kleen, Aaron Lehmann, Jim Paris, netdev
On Mon, Apr 02, 2007 at 10:46:00AM -0700, Rick Jones wrote:
> I changed the title to be more accurate, and culled the distribution to
> individuals and netdev
>
> The mention of trying to turn-off CKO and see if the data corruption
> goes away leads me to ask a possibly "delicate" question:
>
> Should "Linux" only enable CKO on those NICs certified to have
> ECC/parity throughout their _entire_ data path?
Even with reliable software checksumming you can have quite a lot of undetected
errors (there was a interesting study about this some years ago).
If you really care about your data you should use SSL or some
other protocol with strong checksums.
That said it would probably make quite a lot of people unhappy
because it would make their NICs much slower and the occasional
bit error that is missed by the NIC is likely not a big issue for them.
What might be a good idea would to have some optional knob
somewhere that allows to disable hardware checksumming for NICs
that are not trusted this way. But then someone would need to
do the necessary research for the hundreds of NICs Linux support
Just providing a general global "disable hardware checksumming"
knob for the paranoid would be much easier. I guess that would
be a good idea.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NIC data corruption
2007-04-02 18:07 ` Andi Kleen
@ 2007-04-04 5:00 ` Herbert Xu
0 siblings, 0 replies; 8+ messages in thread
From: Herbert Xu @ 2007-04-04 5:00 UTC (permalink / raw)
To: Andi Kleen; +Cc: rick.jones2, andi, aaronl, jim, netdev
Andi Kleen <andi@firstfloor.org> wrote:
>
> Just providing a general global "disable hardware checksumming"
> knob for the paranoid would be much easier. I guess that would
> be a good idea.
FWIW you can disable RX checksuming with
ethtool -K <ifname> rx off
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169
2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen
2007-04-02 17:46 ` NIC data corruption Rick Jones
@ 2007-04-04 18:45 ` Francois Romieu
2007-04-04 20:06 ` Aaron Lehmann
1 sibling, 1 reply; 8+ messages in thread
From: Francois Romieu @ 2007-04-04 18:45 UTC (permalink / raw)
To: Andi Kleen; +Cc: Aaron Lehmann, Jim Paris, linux-kernel, linux-ide, netdev
Andi Kleen <andi@firstfloor.org> :
> Aaron Lehmann <aaronl@vitelus.com> writes:
>
> [adding netdev]
> [meta-comment: I wish people wouldn't use such unnecessarily broad subjects
> -- how is it the x86-64 port's or AMD's fault when you have broken hardware?
> Would anybody write "Silent corruption on i386" or "Silent corruption
> on Intel" or "Silent corruption on Linux"?]
I hope you feel better now that I changed the subject.
Aaron, I see no clear suspect between 2.6.20.1 and current -git
that could explain nor fix a corruption in the r8169 driver.
Can you apply on top of latest 2.6.21-rc5-git the patches available at
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402
000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you
should check if they apply or not.
netconsole appears compiled as a module. Is it used ?
--
Ueimor
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169
2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu
@ 2007-04-04 20:06 ` Aaron Lehmann
2007-04-04 20:45 ` Francois Romieu
2007-04-05 11:41 ` Andi Kleen
0 siblings, 2 replies; 8+ messages in thread
From: Aaron Lehmann @ 2007-04-04 20:06 UTC (permalink / raw)
To: Francois Romieu; +Cc: Andi Kleen, Jim Paris, linux-kernel, linux-ide, netdev
On Wed, Apr 04, 2007 at 08:45:04PM +0200, Francois Romieu wrote:
> > [adding netdev]
> > [meta-comment: I wish people wouldn't use such unnecessarily broad subjects
> > -- how is it the x86-64 port's or AMD's fault when you have broken hardware?
> > Would anybody write "Silent corruption on i386" or "Silent corruption
> > on Intel" or "Silent corruption on Linux"?]
>
> I hope you feel better now that I changed the subject.
>
> Aaron, I see no clear suspect between 2.6.20.1 and current -git
> that could explain nor fix a corruption in the r8169 driver.
>
> Can you apply on top of latest 2.6.21-rc5-git the patches available at
> http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402
>
> 000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you
> should check if they apply or not.
I'll try to get to testing this, but I'm wondering if people may have
misunderstood my original post. I don't get any corruption over
Ethernet; it's just corruption on the filesystem during certain load
patterns that involve the Realtek ethernet card.
Aaron
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169
2007-04-04 20:06 ` Aaron Lehmann
@ 2007-04-04 20:45 ` Francois Romieu
2007-04-05 11:41 ` Andi Kleen
1 sibling, 0 replies; 8+ messages in thread
From: Francois Romieu @ 2007-04-04 20:45 UTC (permalink / raw)
To: Aaron Lehmann; +Cc: Andi Kleen, Jim Paris, linux-kernel, linux-ide, netdev
Aaron Lehmann <aaronl@vitelus.com> :
[...]
> I'll try to get to testing this, but I'm wondering if people may have
> misunderstood my original post. I don't get any corruption over
> Ethernet; it's just corruption on the filesystem during certain load
> patterns that involve the Realtek ethernet card.
It is too soon to label it a debug feature or a genuine bug in the
r8169 driver but at least there is a r8169 patchkit to help with
various annoyances (obscure bug on amd platform for instance).
It could make a difference.
The disk io + r8169 driver bugs can be very frustrating to debug :o/
--
Ueimor
Anybody got a battery for my Ultra 10 ?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169
2007-04-04 20:06 ` Aaron Lehmann
2007-04-04 20:45 ` Francois Romieu
@ 2007-04-05 11:41 ` Andi Kleen
1 sibling, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2007-04-05 11:41 UTC (permalink / raw)
To: Aaron Lehmann
Cc: Francois Romieu, Andi Kleen, Jim Paris, linux-kernel, linux-ide,
netdev
> I'll try to get to testing this, but I'm wondering if people may have
> misunderstood my original post. I don't get any corruption over
> Ethernet; it's just corruption on the filesystem during certain load
> patterns that involve the Realtek ethernet card.
When disabling hardware checksums helps then you know the corruption
is on the Ethernetside. Otherwise it's somewhere else.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-04-05 11:41 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070401012736.GT15189@vitelus.com>
[not found] ` <20070401030315.GA24080@jim.sh>
[not found] ` <20070401043905.GV15189@vitelus.com>
2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen
2007-04-02 17:46 ` NIC data corruption Rick Jones
2007-04-02 18:07 ` Andi Kleen
2007-04-04 5:00 ` Herbert Xu
2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu
2007-04-04 20:06 ` Aaron Lehmann
2007-04-04 20:45 ` Francois Romieu
2007-04-05 11:41 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).