* Re: PROBLEM: Silent data corruption when using sendfile() [not found] <20120713171835.GA26052@vault.local> @ 2012-07-14 8:04 ` Hillf Danton 2012-07-14 8:20 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Hillf Danton @ 2012-07-14 8:04 UTC (permalink / raw) To: Johannes Truschnigg Cc: linux-kernel, Eric Dumazet, Willy Tarreau, Linux-Netdev On Sat, Jul 14, 2012 at 1:18 AM, Johannes Truschnigg <johannes@truschnigg.info> wrote: > Hello good people of linux-kernel. > > I've been bothered by silent data corruption from my personal fileserver - no > matter the Layer 7 protocol used, huge transfers sporadically ended up damaged > in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd, > python and netcat to verify this. > > I think I managed to track down the culprit: as soon as I disable sendfile() > for all programs that support such a configuration (netcat, afaik, won't ever > use sendfile() to transmit data over a socket, so the problem was never > reproducible there in the first place), everything reverts to perfect and > proper working condition. > > I've been experiencing this problem with vanilla kernel releases from the 3.3 > up until 3.4.0 series. I do not know if it also occurs with earlier releases, > but I can verify if that is useful. I set up the environment for a minimal > kind of testcase (a large ISO image file available from the server's local > filesystem, as well as from a mounted NFS export - once via lo, and once via > br0/eth0), and proceeded to do the following: > > i=0; for i in {1..100} > do > echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches > cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso > done > > I then rotated the source of the data, and tested the network-mount against > the loopback-mount, as well as the network-mount against the local filesystem. > > Computing the file's md5sum in a loop whilst dropping caches after each > iteration by reading it directly from its location in the filesystem produces > the very same hash every time - I therefore think it's safe to assume the > corruption is introduced when traversing the networking stack. The hash also > does not change if I repeadetly compute the md5sum of the file as transferred > by, e. g., Apache httpd or smbd with sendfile explicitly disabled. > > Please take a look at the attachment to see the actual output of the above > script. It does not matter if I do an actual transfer over the network from my > server to one of its clients (I verified the problem with two different client > machines, one even running Windows), or if the server is both source and > destination of the transfer - as long as sendfile is involed, some of the data > will always become garbled sooner or later. That also leads me to believe that > my internetworking devices (my switch in particular) is working just fine; > testing bulky transfers from one host to another confirms this insofar as thus > all data makes it through unscathed. > > As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), I > can run a series of thousands and more transfers, and not experience any > corruption at all. Whenever the data gets fubared, there is no hint at > anything fishy going on in the debug ringbuffer - curruption takes place in > total silence. > > The system in question has an Intel Pro/1000 PCI-e NIC for doing the networked > file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB of > system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported > by EDAC, and there are no reported errors. The CPU I have installed is an AMD > Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo > for amd64. The box can run prime96 in torture mode and linpack just fine for > days - I'm therefore assuming the hardware to be working correctly. > > I have attached my kernel's config (from 3.4.0, as that's the image that I > have running right now) attached for sake of completeness, as well as some > information for you to see how I tested, and what these tests actually > produced. If you need any other information to help track this down, please > let me know. > > If you decide to answer please keep me CC'd, as I'm not subscribed to this > list. > > Just in case the numerous attachments get scrubbed/removed, I've also uploaded > them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/ > > Thanks for reading, and have a nice weekend everyone :) > Is the above corruption related to the one below? On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote: > > In fact it has been true zero copy in 2.6.25 until we faced a large > amount of data corruption and the zero copy was disabled in 2.6.25.X. > Since then it remained that way until you brought your patches to > re-instantiate it. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 8:04 ` PROBLEM: Silent data corruption when using sendfile() Hillf Danton @ 2012-07-14 8:20 ` Eric Dumazet 2012-07-14 8:31 ` Willy Tarreau 2012-07-14 14:08 ` Hillf Danton 0 siblings, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2012-07-14 8:20 UTC (permalink / raw) To: Hillf Danton Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev On Sat, 2012-07-14 at 16:04 +0800, Hillf Danton wrote: > On Sat, Jul 14, 2012 at 1:18 AM, Johannes Truschnigg > <johannes@truschnigg.info> wrote: > > Hello good people of linux-kernel. > > > > I've been bothered by silent data corruption from my personal fileserver - no > > matter the Layer 7 protocol used, huge transfers sporadically ended up damaged > > in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd, > > python and netcat to verify this. > > > > I think I managed to track down the culprit: as soon as I disable sendfile() > > for all programs that support such a configuration (netcat, afaik, won't ever > > use sendfile() to transmit data over a socket, so the problem was never > > reproducible there in the first place), everything reverts to perfect and > > proper working condition. > > > > I've been experiencing this problem with vanilla kernel releases from the 3.3 > > up until 3.4.0 series. I do not know if it also occurs with earlier releases, > > but I can verify if that is useful. I set up the environment for a minimal > > kind of testcase (a large ISO image file available from the server's local > > filesystem, as well as from a mounted NFS export - once via lo, and once via > > br0/eth0), and proceeded to do the following: > > > > i=0; for i in {1..100} > > do > > echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches > > cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso > > done > > > > I then rotated the source of the data, and tested the network-mount against > > the loopback-mount, as well as the network-mount against the local filesystem. > > > > Computing the file's md5sum in a loop whilst dropping caches after each > > iteration by reading it directly from its location in the filesystem produces > > the very same hash every time - I therefore think it's safe to assume the > > corruption is introduced when traversing the networking stack. The hash also > > does not change if I repeadetly compute the md5sum of the file as transferred > > by, e. g., Apache httpd or smbd with sendfile explicitly disabled. > > > > Please take a look at the attachment to see the actual output of the above > > script. It does not matter if I do an actual transfer over the network from my > > server to one of its clients (I verified the problem with two different client > > machines, one even running Windows), or if the server is both source and > > destination of the transfer - as long as sendfile is involed, some of the data > > will always become garbled sooner or later. That also leads me to believe that > > my internetworking devices (my switch in particular) is working just fine; > > testing bulky transfers from one host to another confirms this insofar as thus > > all data makes it through unscathed. > > > > As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), I > > can run a series of thousands and more transfers, and not experience any > > corruption at all. Whenever the data gets fubared, there is no hint at > > anything fishy going on in the debug ringbuffer - curruption takes place in > > total silence. > > > > The system in question has an Intel Pro/1000 PCI-e NIC for doing the networked > > file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB of > > system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported > > by EDAC, and there are no reported errors. The CPU I have installed is an AMD > > Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo > > for amd64. The box can run prime96 in torture mode and linpack just fine for > > days - I'm therefore assuming the hardware to be working correctly. > > > > I have attached my kernel's config (from 3.4.0, as that's the image that I > > have running right now) attached for sake of completeness, as well as some > > information for you to see how I tested, and what these tests actually > > produced. If you need any other information to help track this down, please > > let me know. > > > > If you decide to answer please keep me CC'd, as I'm not subscribed to this > > list. > > > > Just in case the numerous attachments get scrubbed/removed, I've also uploaded > > them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/ > > > > Thanks for reading, and have a nice weekend everyone :) > > > > Is the above corruption related to the one below? > > > On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote: > > > > In fact it has been true zero copy in 2.6.25 until we faced a large > > amount of data corruption and the zero copy was disabled in 2.6.25.X. > > Since then it remained that way until you brought your patches to > > re-instantiate it. Might be, or not (could be a NIC bug) Please Johannes could you try latest kernel tree ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 8:20 ` Eric Dumazet @ 2012-07-14 8:31 ` Willy Tarreau 2012-07-14 10:13 ` Johannes Truschnigg 2012-07-14 14:08 ` Hillf Danton 1 sibling, 1 reply; 13+ messages in thread From: Willy Tarreau @ 2012-07-14 8:31 UTC (permalink / raw) To: Eric Dumazet Cc: Hillf Danton, Johannes Truschnigg, linux-kernel, Linux-Netdev On Sat, Jul 14, 2012 at 10:20:41AM +0200, Eric Dumazet wrote: > > On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote: > > > > > > In fact it has been true zero copy in 2.6.25 until we faced a large > > > amount of data corruption and the zero copy was disabled in 2.6.25.X. > > > Since then it remained that way until you brought your patches to > > > re-instantiate it. > > Might be, or not (could be a NIC bug) I may be wrong but what I recall from this bug was an issue when forwarding TCP between two NICs, related to linear vs non-linear data (I have memories of something around data not yet ACKed being replaced before being retransmitted but I may be wrong). Anyway, the way it was fixed consisted in simply disabling the zero-copy code path. So this should be something different from what Johannes reports. Maybe a regression since then though. > Please Johannes could you try latest kernel tree ? It would be useful, especially given the amount of changes you performed in this area in latest version, it could be very possible that this new bug got fixed as a side effect ! Regards, Willy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 8:31 ` Willy Tarreau @ 2012-07-14 10:13 ` Johannes Truschnigg 2012-07-14 10:33 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Johannes Truschnigg @ 2012-07-14 10:13 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, Hillf Danton, linux-kernel, Linux-Netdev [-- Attachment #1: Type: text/plain, Size: 1154 bytes --] On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > Please Johannes could you try latest kernel tree ? > > It would be useful, especially given the amount of changes you performed > in this area in latest version, it could be very possible that this new > bug got fixed as a side effect ! I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) and what can I say - the problem really seems to have disappeared. I performed about 3700 iterations of my previos tests over the night, and the data always turned out to be OK, not a single byte turned out kaput! I wish I would have tested that earlier, and spared you the noise... well, maybe someone who runs into a similar problem in the future will have this discovery save her/him some time and headaches and make her/him just upgrade kernels :) Thanks a lot for your polite and quick responses! -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: http://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 10:13 ` Johannes Truschnigg @ 2012-07-14 10:33 ` Eric Dumazet 2012-07-14 10:44 ` Willy Tarreau 2012-07-14 11:44 ` Thorsten Kranzkowski 0 siblings, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2012-07-14 10:33 UTC (permalink / raw) To: Johannes Truschnigg Cc: Willy Tarreau, Hillf Danton, linux-kernel, Linux-Netdev On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote: > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > > Please Johannes could you try latest kernel tree ? > > > > It would be useful, especially given the amount of changes you performed > > in this area in latest version, it could be very possible that this new > > bug got fixed as a side effect ! > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) > and what can I say - the problem really seems to have disappeared. I performed > about 3700 iterations of my previos tests over the night, and the data always > turned out to be OK, not a single byte turned out kaput! > > I wish I would have tested that earlier, and spared you the noise... well, > maybe someone who runs into a similar problem in the future will have this > discovery save her/him some time and headaches and make her/him just upgrade > kernels :) > > Thanks a lot for your polite and quick responses! > Nice to hear. Now we should make sure we have all needed fixes for prior stable kernels as well ! Still trying to understand the issue, since I thought I only did optimizations, not bug fixes. So maybe real bug is still there but its probability of occurrence lowered enough to not hit your workload. Hmmm... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 10:33 ` Eric Dumazet @ 2012-07-14 10:44 ` Willy Tarreau 2012-07-14 11:06 ` Eric Dumazet 2012-07-14 11:44 ` Thorsten Kranzkowski 1 sibling, 1 reply; 13+ messages in thread From: Willy Tarreau @ 2012-07-14 10:44 UTC (permalink / raw) To: Eric Dumazet Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote: > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote: > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > > > Please Johannes could you try latest kernel tree ? > > > > > > It would be useful, especially given the amount of changes you performed > > > in this area in latest version, it could be very possible that this new > > > bug got fixed as a side effect ! > > > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) > > and what can I say - the problem really seems to have disappeared. I performed > > about 3700 iterations of my previos tests over the night, and the data always > > turned out to be OK, not a single byte turned out kaput! > > > > I wish I would have tested that earlier, and spared you the noise... well, > > maybe someone who runs into a similar problem in the future will have this > > discovery save her/him some time and headaches and make her/him just upgrade > > kernels :) > > > > Thanks a lot for your polite and quick responses! > > > > Nice to hear. Now we should make sure we have all needed fixes for prior > stable kernels as well ! > > Still trying to understand the issue, since I thought I only did > optimizations, not bug fixes. So maybe real bug is still there but its > probability of occurrence lowered enough to not hit your workload. Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc. I'm wondering whether this patch merged into 3.4.2 one has an impact on sendfile : commit b642cb6a143da812f188307c2661c0357776a9d0 Author: Konstantin Khlebnikov <khlebnikov@openvz.org> Date: Tue Jun 5 21:36:33 2012 +0400 radix-tree: fix contiguous iterator commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream. This patch fixes bug in macro radix_tree_for_each_contig(). If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following radix_tree_next_chunk() switches iterating into next chunk. As result iterating becomes non-contiguous and breaks vfs "splice" and all its users. Willy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 10:44 ` Willy Tarreau @ 2012-07-14 11:06 ` Eric Dumazet 2012-07-14 13:15 ` Willy Tarreau 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2012-07-14 11:06 UTC (permalink / raw) To: Willy Tarreau Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev On Sat, 2012-07-14 at 12:44 +0200, Willy Tarreau wrote: > On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote: > > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote: > > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > > > > Please Johannes could you try latest kernel tree ? > > > > > > > > It would be useful, especially given the amount of changes you performed > > > > in this area in latest version, it could be very possible that this new > > > > bug got fixed as a side effect ! > > > > > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) > > > and what can I say - the problem really seems to have disappeared. I performed > > > about 3700 iterations of my previos tests over the night, and the data always > > > turned out to be OK, not a single byte turned out kaput! > > > > > > I wish I would have tested that earlier, and spared you the noise... well, > > > maybe someone who runs into a similar problem in the future will have this > > > discovery save her/him some time and headaches and make her/him just upgrade > > > kernels :) > > > > > > Thanks a lot for your polite and quick responses! > > > > > > > Nice to hear. Now we should make sure we have all needed fixes for prior > > stable kernels as well ! > > > > Still trying to understand the issue, since I thought I only did > > optimizations, not bug fixes. So maybe real bug is still there but its > > probability of occurrence lowered enough to not hit your workload. > > Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc. > > I'm wondering whether this patch merged into 3.4.2 one has an impact on > sendfile : > > commit b642cb6a143da812f188307c2661c0357776a9d0 > Author: Konstantin Khlebnikov <khlebnikov@openvz.org> > Date: Tue Jun 5 21:36:33 2012 +0400 > > radix-tree: fix contiguous iterator > > commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream. > > This patch fixes bug in macro radix_tree_for_each_contig(). > > If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following > radix_tree_next_chunk() switches iterating into next chunk. As result iterating > becomes non-contiguous and breaks vfs "splice" and all its users. > > Willy > Hmmm, this is supposed to fix a bug introduced in 3.4, no ? So 3.3 kernel should work well ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 11:06 ` Eric Dumazet @ 2012-07-14 13:15 ` Willy Tarreau 2012-07-14 17:09 ` Johannes Truschnigg 0 siblings, 1 reply; 13+ messages in thread From: Willy Tarreau @ 2012-07-14 13:15 UTC (permalink / raw) To: Eric Dumazet Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev On Sat, Jul 14, 2012 at 01:06:07PM +0200, Eric Dumazet wrote: > On Sat, 2012-07-14 at 12:44 +0200, Willy Tarreau wrote: > > On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote: > > > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote: > > > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > > > > > Please Johannes could you try latest kernel tree ? > > > > > > > > > > It would be useful, especially given the amount of changes you performed > > > > > in this area in latest version, it could be very possible that this new > > > > > bug got fixed as a side effect ! > > > > > > > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) > > > > and what can I say - the problem really seems to have disappeared. I performed > > > > about 3700 iterations of my previos tests over the night, and the data always > > > > turned out to be OK, not a single byte turned out kaput! > > > > > > > > I wish I would have tested that earlier, and spared you the noise... well, > > > > maybe someone who runs into a similar problem in the future will have this > > > > discovery save her/him some time and headaches and make her/him just upgrade > > > > kernels :) > > > > > > > > Thanks a lot for your polite and quick responses! > > > > > > > > > > Nice to hear. Now we should make sure we have all needed fixes for prior > > > stable kernels as well ! > > > > > > Still trying to understand the issue, since I thought I only did > > > optimizations, not bug fixes. So maybe real bug is still there but its > > > probability of occurrence lowered enough to not hit your workload. > > > > Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc. > > > > I'm wondering whether this patch merged into 3.4.2 one has an impact on > > sendfile : > > > > commit b642cb6a143da812f188307c2661c0357776a9d0 > > Author: Konstantin Khlebnikov <khlebnikov@openvz.org> > > Date: Tue Jun 5 21:36:33 2012 +0400 > > > > radix-tree: fix contiguous iterator > > > > commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream. > > > > This patch fixes bug in macro radix_tree_for_each_contig(). > > > > If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following > > radix_tree_next_chunk() switches iterating into next chunk. As result iterating > > becomes non-contiguous and breaks vfs "splice" and all its users. > > > > Willy > > > > > Hmmm, this is supposed to fix a bug introduced in 3.4, no ? > > So 3.3 kernel should work well ? You're right indeed. So maybe it's not the same bug. Or maybe Johannes was affected by two different bugs in both versions, since Thorsten's report seems to point the finger at the same bug. Johannes, are you certain that you were having the exact same issue with 3.3 ? Willy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 13:15 ` Willy Tarreau @ 2012-07-14 17:09 ` Johannes Truschnigg 0 siblings, 0 replies; 13+ messages in thread From: Johannes Truschnigg @ 2012-07-14 17:09 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, Hillf Danton, linux-kernel, Linux-Netdev [-- Attachment #1: Type: text/plain, Size: 1039 bytes --] On Sat, Jul 14, 2012 at 03:15:40PM +0200, Willy Tarreau wrote: > On Sat, Jul 14, 2012 at 01:06:07PM +0200, Eric Dumazet wrote: > [...] > > Hmmm, this is supposed to fix a bug introduced in 3.4, no ? > > > > So 3.3 kernel should work well ? > > You're right indeed. So maybe it's not the same bug. Or maybe Johannes > was affected by two different bugs in both versions, since Thorsten's > report seems to point the finger at the same bug. > > Johannes, are you certain that you were having the exact same issue > with 3.3 ? I still have the Linux 3.3-series kernel image around that I _think_ I first saw the problem occur with. I'll try to reproduce the problem with that kernel, but I cannot promise that results will be ready before Tuesday. I'll keep you posted! -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: http://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 10:33 ` Eric Dumazet 2012-07-14 10:44 ` Willy Tarreau @ 2012-07-14 11:44 ` Thorsten Kranzkowski 1 sibling, 0 replies; 13+ messages in thread From: Thorsten Kranzkowski @ 2012-07-14 11:44 UTC (permalink / raw) To: Eric Dumazet Cc: Johannes Truschnigg, Willy Tarreau, Hillf Danton, linux-kernel, Linux-Netdev On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote: > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote: > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote: > > > > Please Johannes could you try latest kernel tree ? > > > > > > It would be useful, especially given the amount of changes you performed > > > in this area in latest version, it could be very possible that this new > > > bug got fixed as a side effect ! > > > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running) > > and what can I say - the problem really seems to have disappeared. I performed > > about 3700 iterations of my previos tests over the night, and the data always > > turned out to be OK, not a single byte turned out kaput! > > > > I wish I would have tested that earlier, and spared you the noise... well, > > maybe someone who runs into a similar problem in the future will have this > > discovery save her/him some time and headaches and make her/him just upgrade > > kernels :) > > > > Thanks a lot for your polite and quick responses! > > > > Nice to hear. Now we should make sure we have all needed fixes for prior > stable kernels as well ! > > Still trying to understand the issue, since I thought I only did > optimizations, not bug fixes. So maybe real bug is still there but its > probability of occurrence lowered enough to not hit your workload. > > Hmmm... > Not sure if this is related, but I had a similar data corruption problem: Reading data from filesystem 'normally' (including through nfs) showed corruption at random places, mostly 0xff tuning into 0xfe. Reading with ODIRECT (I used 'dd iflag=direct') was OK. I found my problem to be fixed by fffaee365fded09f9ebf2db19066065fa54323c3 (upstrem) which was backported as b642cb6a143da812f188307c2661c0357776a9d0 (stable, v3.4.1-66-gb642cb6) Bye, Thorsten -- | Thorsten Kranzkowski Internet: dl8bcu@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8bcu@marvin.dl8bcu.ampr.org [44.130.8.19] | ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 8:20 ` Eric Dumazet 2012-07-14 8:31 ` Willy Tarreau @ 2012-07-14 14:08 ` Hillf Danton 2012-07-14 14:19 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: Hillf Danton @ 2012-07-14 14:08 UTC (permalink / raw) To: Eric Dumazet Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Might be, or not (could be a NIC bug) > Dunno why sendfile sits in the layer of NIC and how they interact. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 14:08 ` Hillf Danton @ 2012-07-14 14:19 ` Eric Dumazet 2012-07-14 14:56 ` Willy Tarreau 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2012-07-14 14:19 UTC (permalink / raw) To: Hillf Danton Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev On Sat, 2012-07-14 at 22:08 +0800, Hillf Danton wrote: > On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > Might be, or not (could be a NIC bug) > > > Dunno why sendfile sits in the layer of NIC and > how they interact. sendfile() relies heavily on TSO capabilities, a buggy NIC could corrupt frame content on some obscure occasions. We had some known cases on IPv6 for example. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PROBLEM: Silent data corruption when using sendfile() 2012-07-14 14:19 ` Eric Dumazet @ 2012-07-14 14:56 ` Willy Tarreau 0 siblings, 0 replies; 13+ messages in thread From: Willy Tarreau @ 2012-07-14 14:56 UTC (permalink / raw) To: Eric Dumazet Cc: Hillf Danton, Johannes Truschnigg, linux-kernel, Linux-Netdev On Sat, Jul 14, 2012 at 04:19:00PM +0200, Eric Dumazet wrote: > On Sat, 2012-07-14 at 22:08 +0800, Hillf Danton wrote: > > On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > > Might be, or not (could be a NIC bug) > > > > > Dunno why sendfile sits in the layer of NIC and > > how they interact. > > sendfile() relies heavily on TSO capabilities, a buggy NIC could > corrupt frame content on some obscure occasions. > > We had some known cases on IPv6 for example. Similarly I remind having experienced bugs on early Yukon chips years ago that would regularly emit total crap on the wire. Willy ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-07-14 17:09 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20120713171835.GA26052@vault.local>
2012-07-14 8:04 ` PROBLEM: Silent data corruption when using sendfile() Hillf Danton
2012-07-14 8:20 ` Eric Dumazet
2012-07-14 8:31 ` Willy Tarreau
2012-07-14 10:13 ` Johannes Truschnigg
2012-07-14 10:33 ` Eric Dumazet
2012-07-14 10:44 ` Willy Tarreau
2012-07-14 11:06 ` Eric Dumazet
2012-07-14 13:15 ` Willy Tarreau
2012-07-14 17:09 ` Johannes Truschnigg
2012-07-14 11:44 ` Thorsten Kranzkowski
2012-07-14 14:08 ` Hillf Danton
2012-07-14 14:19 ` Eric Dumazet
2012-07-14 14:56 ` Willy Tarreau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).