* Re: Silent corruption on AMD64 [not found] ` <20070401043905.GV15189@vitelus.com> @ 2007-04-01 13:58 ` Andi Kleen 2007-04-02 17:46 ` NIC data corruption Rick Jones 2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu 0 siblings, 2 replies; 8+ messages in thread From: Andi Kleen @ 2007-04-01 13:58 UTC (permalink / raw) To: Aaron Lehmann; +Cc: Jim Paris, linux-kernel, linux-ide, netdev Aaron Lehmann <aaronl@vitelus.com> writes: [adding netdev] [meta-comment: I wish people wouldn't use such unnecessarily broad subjects -- how is it the x86-64 port's or AMD's fault when you have broken hardware? Would anybody write "Silent corruption on i386" or "Silent corruption on Intel" or "Silent corruption on Linux"?] > On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote: > > Since it shows up under heavy load that includes unrelated devices, I > > think ruling out hardware problems is important. Some suggestions: > > I've been able to narrow it down to the Realtek Ethernet card. I can't > reproduce the problem using onboard Ethernet, whereas the Realtek card > causes trouble in any slot. However, I still don't know whether it's a > hardware or software issue, or whether it's caused directly or > indirectly by the Realtek card. You could disable the hardware checksumming support in the card with the appended patch. Then hopefully Linux will catch most corruptions (but perhaps not all because TCP checksums are not very strong) You can watch failed checksums then with netstat -s -Andi Index: linux-2.6.21-rc3-net/drivers/net/r8169.c =================================================================== --- linux-2.6.21-rc3-net.orig/drivers/net/r8169.c +++ linux-2.6.21-rc3-net/drivers/net/r8169.c @@ -2477,6 +2477,7 @@ static inline int rtl8169_fragmented_fra static inline void rtl8169_rx_csum(struct sk_buff *skb, struct RxDesc *desc) { +#if 0 u32 opts1 = le32_to_cpu(desc->opts1); u32 status = opts1 & RxProtoMask; @@ -2485,6 +2486,7 @@ static inline void rtl8169_rx_csum(struc ((status == RxProtoIP) && !(opts1 & IPFail))) skb->ip_summed = CHECKSUM_UNNECESSARY; else +#endif skb->ip_summed = CHECKSUM_NONE; } ^ permalink raw reply [flat|nested] 8+ messages in thread
* NIC data corruption 2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen @ 2007-04-02 17:46 ` Rick Jones 2007-04-02 18:07 ` Andi Kleen 2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu 1 sibling, 1 reply; 8+ messages in thread From: Rick Jones @ 2007-04-02 17:46 UTC (permalink / raw) To: Andi Kleen; +Cc: Aaron Lehmann, Jim Paris, netdev I changed the title to be more accurate, and culled the distribution to individuals and netdev The mention of trying to turn-off CKO and see if the data corruption goes away leads me to ask a possibly "delicate" question: Should "Linux" only enable CKO on those NICs certified to have ECC/parity throughout their _entire_ data path? rick jones ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NIC data corruption 2007-04-02 17:46 ` NIC data corruption Rick Jones @ 2007-04-02 18:07 ` Andi Kleen 2007-04-04 5:00 ` Herbert Xu 0 siblings, 1 reply; 8+ messages in thread From: Andi Kleen @ 2007-04-02 18:07 UTC (permalink / raw) To: Rick Jones; +Cc: Andi Kleen, Aaron Lehmann, Jim Paris, netdev On Mon, Apr 02, 2007 at 10:46:00AM -0700, Rick Jones wrote: > I changed the title to be more accurate, and culled the distribution to > individuals and netdev > > The mention of trying to turn-off CKO and see if the data corruption > goes away leads me to ask a possibly "delicate" question: > > Should "Linux" only enable CKO on those NICs certified to have > ECC/parity throughout their _entire_ data path? Even with reliable software checksumming you can have quite a lot of undetected errors (there was a interesting study about this some years ago). If you really care about your data you should use SSL or some other protocol with strong checksums. That said it would probably make quite a lot of people unhappy because it would make their NICs much slower and the occasional bit error that is missed by the NIC is likely not a big issue for them. What might be a good idea would to have some optional knob somewhere that allows to disable hardware checksumming for NICs that are not trusted this way. But then someone would need to do the necessary research for the hundreds of NICs Linux support Just providing a general global "disable hardware checksumming" knob for the paranoid would be much easier. I guess that would be a good idea. -Andi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NIC data corruption 2007-04-02 18:07 ` Andi Kleen @ 2007-04-04 5:00 ` Herbert Xu 0 siblings, 0 replies; 8+ messages in thread From: Herbert Xu @ 2007-04-04 5:00 UTC (permalink / raw) To: Andi Kleen; +Cc: rick.jones2, andi, aaronl, jim, netdev Andi Kleen <andi@firstfloor.org> wrote: > > Just providing a general global "disable hardware checksumming" > knob for the paranoid would be much easier. I guess that would > be a good idea. FWIW you can disable RX checksuming with ethtool -K <ifname> rx off Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169 2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen 2007-04-02 17:46 ` NIC data corruption Rick Jones @ 2007-04-04 18:45 ` Francois Romieu 2007-04-04 20:06 ` Aaron Lehmann 1 sibling, 1 reply; 8+ messages in thread From: Francois Romieu @ 2007-04-04 18:45 UTC (permalink / raw) To: Andi Kleen; +Cc: Aaron Lehmann, Jim Paris, linux-kernel, linux-ide, netdev Andi Kleen <andi@firstfloor.org> : > Aaron Lehmann <aaronl@vitelus.com> writes: > > [adding netdev] > [meta-comment: I wish people wouldn't use such unnecessarily broad subjects > -- how is it the x86-64 port's or AMD's fault when you have broken hardware? > Would anybody write "Silent corruption on i386" or "Silent corruption > on Intel" or "Silent corruption on Linux"?] I hope you feel better now that I changed the subject. Aaron, I see no clear suspect between 2.6.20.1 and current -git that could explain nor fix a corruption in the r8169 driver. Can you apply on top of latest 2.6.21-rc5-git the patches available at http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402 000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you should check if they apply or not. netconsole appears compiled as a module. Is it used ? -- Ueimor ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169 2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu @ 2007-04-04 20:06 ` Aaron Lehmann 2007-04-04 20:45 ` Francois Romieu 2007-04-05 11:41 ` Andi Kleen 0 siblings, 2 replies; 8+ messages in thread From: Aaron Lehmann @ 2007-04-04 20:06 UTC (permalink / raw) To: Francois Romieu; +Cc: Andi Kleen, Jim Paris, linux-kernel, linux-ide, netdev On Wed, Apr 04, 2007 at 08:45:04PM +0200, Francois Romieu wrote: > > [adding netdev] > > [meta-comment: I wish people wouldn't use such unnecessarily broad subjects > > -- how is it the x86-64 port's or AMD's fault when you have broken hardware? > > Would anybody write "Silent corruption on i386" or "Silent corruption > > on Intel" or "Silent corruption on Linux"?] > > I hope you feel better now that I changed the subject. > > Aaron, I see no clear suspect between 2.6.20.1 and current -git > that could explain nor fix a corruption in the r8169 driver. > > Can you apply on top of latest 2.6.21-rc5-git the patches available at > http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402 > > 000[12]-r8169-foo-bar.patch have been committed a few minutes ago: you > should check if they apply or not. I'll try to get to testing this, but I'm wondering if people may have misunderstood my original post. I don't get any corruption over Ethernet; it's just corruption on the filesystem during certain load patterns that involve the Realtek ethernet card. Aaron ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169 2007-04-04 20:06 ` Aaron Lehmann @ 2007-04-04 20:45 ` Francois Romieu 2007-04-05 11:41 ` Andi Kleen 1 sibling, 0 replies; 8+ messages in thread From: Francois Romieu @ 2007-04-04 20:45 UTC (permalink / raw) To: Aaron Lehmann; +Cc: Andi Kleen, Jim Paris, linux-kernel, linux-ide, netdev Aaron Lehmann <aaronl@vitelus.com> : [...] > I'll try to get to testing this, but I'm wondering if people may have > misunderstood my original post. I don't get any corruption over > Ethernet; it's just corruption on the filesystem during certain load > patterns that involve the Realtek ethernet card. It is too soon to label it a debug feature or a genuine bug in the r8169 driver but at least there is a r8169 patchkit to help with various annoyances (obscure bug on amd platform for instance). It could make a difference. The disk io + r8169 driver bugs can be very frustrating to debug :o/ -- Ueimor Anybody got a battery for my Ultra 10 ? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Silent corruption with r8169 2007-04-04 20:06 ` Aaron Lehmann 2007-04-04 20:45 ` Francois Romieu @ 2007-04-05 11:41 ` Andi Kleen 1 sibling, 0 replies; 8+ messages in thread From: Andi Kleen @ 2007-04-05 11:41 UTC (permalink / raw) To: Aaron Lehmann Cc: Francois Romieu, Andi Kleen, Jim Paris, linux-kernel, linux-ide, netdev > I'll try to get to testing this, but I'm wondering if people may have > misunderstood my original post. I don't get any corruption over > Ethernet; it's just corruption on the filesystem during certain load > patterns that involve the Realtek ethernet card. When disabling hardware checksums helps then you know the corruption is on the Ethernetside. Otherwise it's somewhere else. -Andi ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-04-05 11:41 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070401012736.GT15189@vitelus.com>
[not found] ` <20070401030315.GA24080@jim.sh>
[not found] ` <20070401043905.GV15189@vitelus.com>
2007-04-01 13:58 ` Silent corruption on AMD64 Andi Kleen
2007-04-02 17:46 ` NIC data corruption Rick Jones
2007-04-02 18:07 ` Andi Kleen
2007-04-04 5:00 ` Herbert Xu
2007-04-04 18:45 ` Silent corruption with r8169 Francois Romieu
2007-04-04 20:06 ` Aaron Lehmann
2007-04-04 20:45 ` Francois Romieu
2007-04-05 11:41 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).