* Re: TG3 data corruption (TSO ?) [not found] <1157704257.31071.68.camel@localhost.localdomain> @ 2006-09-08 15:49 ` Michael Chan 2006-09-08 19:29 ` Segher Boessenkool 2006-09-08 21:41 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 22+ messages in thread From: Michael Chan @ 2006-09-08 15:49 UTC (permalink / raw) To: Benjamin Herrenschmidt, netdev Cc: David S. Miller, Segher Boessenkool, Linux Kernel list Copying netdev. Benjamin Herrenschmidt wrote: > Hi ! > > I've been chasing with Segher a data corruption problem lately. > Basically transferring huge amount of data (several Gb) and I get > corrupted data at the rx side. I cannot tell for sure wether what I've > been observing here is the same problem that segher's been seing on is > blades, he will confirm or not. He also seemed to imply that reverting > to an older kernel on the -receiver- side fixed it, which makes me > wonder, since it's looks really like a sending side problem (see > explanation below), if some change in, for exmaple, window scaling, > might hide or trigger it. Please send me lspci and tg3 probing output so that I know what tg3 hardware you're using. I also want to look at the tcpdump or ethereal on the mirrored port that shows the packet being corrupted. > > Now, first, I've been playing with ssh from /dev/zero on one machine > to /dev/zero on the other. That allowed me to run enough > tests all over > the place to have some idea of where the problem comes from since ssh > will shoke at decryption when hitting the corruption. > > The base setup where it happens often is 2 Quad G5's connected to a > gigabit switch. Both were running some versions of 2.6.18-rc4 and -rc5 > (some random git actually, but see below as I've reproduced > the problem > with today's git snapshot which includes the TG3 tx race fix among > others). > > I have reproduced with various machines as the receiver. A sungem in a > Dual G5 and a virtual ethernet in a Power5 partition (so the > packets go > to an e1000 then routed through an AIX IO server to a virtual > ethernet :) are good examples of "variety" :) I haven't tested with > non-PowerPC machines so far. I've also never been able to > reproduce with > TSO disabled on the emitting TG3's > > Then, I've hacked tridge "socklib" test program (a simple TCP server > that pushes a known buffer and a simple TCP receiver that > connects to it > and reads the data). I've added comparison of the data with what they > are supposed to be on the receiving end. The interesting thing is that > is much faster than ssh or whatever else I tried. ssh or rsync between > those 2 Quad G5s give me about 35Mb/sec while I get to > 107Mb/sec average > with the small test program. > > The fun thing is, I've not been able to reproduce at all that > way. When > the link is pretty much saturated, the problem doesn't occur ! > > As soon as I introduce a small delay (some crap waiting loop) in the > sender to slow down the throughput to about 80Mb/sec, then the problem > starts occuring every now and then (I don't have precise frquency data > but I get a corruption every couple of gigabytes I'd say). > > As for my previous tests, disabling TSO on the sending side > "fixes" it. > > Below is a dump of what the corruption look like. I've trimmed the > beginning and end of the dumped packet (the receiver does 8k > reads). The > 0x5a are the expected data, the rest is corruption. They look like > kernel pointers, but that isn't always the case (often though but that > might not be relevant). The size and position within the buffer of the > corrupted data is variable (doesn't seem to be specifically a page or > anything nice and round like that). > > I've configured the switch to send all the traffic between the two > machines to a 3rd box and then recorded it with tcpdump (the > "spy" uses > an e1000) and I can see the corrupted data in the recorded > traces (the exact same pattern as detected by the receiver). > So it seems > very likely at this point that the corruption happens on the sending > side. The TCP checksums are correct I assume. I don't see any error > count on the receiving tg3 nor suspicous message in dmesg indicating > they aren't. > > That's all the data I have at this point. I can't guarantee 100% that > it's a TSO bug (it might be a bug that TSO renders visible > due to timing > effects) but it looks like it since I've not reproduced yet with TSO > disabled. I'll do an overnight test to confirm that though... > sometimes > the bug can take it's time to show up ... I've seen it wait > 20Gb before > it kicked in. Also the fact that fully loading the machine never > produced it is strange.... smells like a race. > > Cheers, > Ben. > > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 00 00 00 00 00 00 00 00 > 2f 63 70 75 73 00 7f 7e c0 00 00 00 01 cb 70 82 > 00 00 00 04 bf 1d db 4d c0 00 00 00 01 cb 92 00 > c0 00 00 01 7b fe 6d 98 c0 00 00 00 01 cb 70 91 > 00 00 00 04 df 5d fe fd c0 00 00 00 01 cb 92 10 > c0 00 00 01 7b fe 6d b8 c0 00 00 00 01 cb 71 0e > 00 00 00 04 fe e2 fb cf c0 00 00 00 01 cb 92 20 > c0 00 00 01 7b fe 6d d8 c0 00 00 00 01 cb 71 1f > 00 00 00 04 73 69 ed ff c0 00 00 00 01 cb 92 30 > c0 00 00 01 7b fe 6d f8 c0 00 00 00 01 cb 70 04 > 00 00 00 04 b9 fe cf ff c0 00 00 00 01 cb 92 40 > c0 00 00 01 7b fe 6e 18 c0 00 00 00 00 3f 7b c8 > 00 00 00 05 ff df b9 bc c0 00 00 01 7b fe 6e 38 > 00 00 00 00 00 00 00 00 63 70 75 73 00 8d f1 ce > c0 00 00 01 7b fe 73 60 c0 00 00 00 01 cb 92 64 > ff 89 d6 80 ff 89 d6 80 00 00 00 01 00 00 00 00 > c0 00 00 01 7b fe 68 00 c0 00 00 01 7b fe 6e c8 > c0 00 00 01 7b fe 6e e0 00 00 00 00 00 00 00 00 > c0 00 00 01 7b fe 6c e8 c0 00 00 01 7b fe 73 70 > c0 00 00 01 7b fe 75 48 c0 00 00 01 7b fe 73 70 > c0 00 00 01 7b fe 73 70 c0 00 00 01 7b fa 9e 80 > 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 2f 63 70 75 73 2f 50 6f > 77 65 72 50 43 2c 47 35 40 30 00 7e 5e 6f 4d ef > c0 00 00 00 01 cb 70 6c 00 00 00 04 7f ee b7 fe > c0 00 00 00 01 cb 92 64 c0 00 00 01 7b fe 6f 00 > c0 00 00 00 01 cb 71 35 00 00 00 04 bb cc e9 67 > c0 00 00 00 01 cb 92 74 c0 00 00 01 7b fe 6f 20 > c0 00 00 00 01 cb 71 39 00 00 00 04 2f fc eb b9 > c0 00 00 00 01 cb 92 84 c0 00 00 01 7b fe 6f 40 > c0 00 00 00 01 cb 71 3e 00 00 00 04 e7 5f be de > c0 00 00 00 01 cb 92 94 c0 00 00 01 7b fe 6f 60 > c0 00 00 00 01 cb 71 49 00 00 00 04 e6 73 e7 a7 > c0 00 00 00 01 cb 92 a4 c0 00 00 01 7b fe 6f 80 > c0 00 00 00 01 cb 71 55 00 00 00 08 1b fb 77 f9 > c0 00 00 00 01 cb 92 b4 c0 00 00 01 7b fe 6f a0 > c0 00 00 00 01 cb 70 9d 00 00 00 04 b6 db 59 ef > c0 00 00 00 01 cb 92 c8 c0 00 00 01 7b fe 6f c0 > c0 00 00 00 01 cb 71 5b 00 00 00 04 69 6f fc da > c0 00 00 00 01 cb 92 d8 c0 00 00 01 7b fe 6f e0 > c0 00 00 00 01 cb 71 69 00 00 00 04 d6 7b 66 de > c0 00 00 00 01 cb 92 e8 c0 00 00 01 7b fe 70 00 > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 15:49 ` TG3 data corruption (TSO ?) Michael Chan @ 2006-09-08 19:29 ` Segher Boessenkool 2006-09-08 19:54 ` Michael Chan 2006-09-08 21:41 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 22+ messages in thread From: Segher Boessenkool @ 2006-09-08 19:29 UTC (permalink / raw) To: Michael Chan Cc: Benjamin Herrenschmidt, netdev, David S. Miller, Linux Kernel list >> I've been chasing with Segher a data corruption problem lately. >> Basically transferring huge amount of data (several Gb) and I get >> corrupted data at the rx side. I cannot tell for sure wether what >> I've >> been observing here is the same problem that segher's been seing >> on is >> blades, he will confirm or not. He also seemed to imply that >> reverting >> to an older kernel on the -receiver- side fixed it, which makes me >> wonder, since it's looks really like a sending side problem (see >> explanation below), if some change in, for exmaple, window scaling, >> might hide or trigger it. > > Please send me lspci and tg3 probing output so that I know what > tg3 hardware you're using. I use a 5780 rev. A3, but the problem is not limited to this chip. > I also want to look at the tcpdump or > ethereal on the mirrored port that shows the packet being corrupted. I don't have such, sorry. >> That's all the data I have at this point. I can't guarantee 100% that >> it's a TSO bug (it might be a bug that TSO renders visible >> due to timing >> effects) but it looks like it since I've not reproduced yet with TSO >> disabled. It seems to indeed to only be exposed by TSO, not actually a bug of it /an sich/. I've got a patch that seems so solve the problem, it needs more testing though (maybe Ben can do this :-) ). The problem is that there should be quite a few wmb()'s in the code that are just not there; adding some to tg3_set_txd() seems to fix the immediate problem but more is needed (and I don't see why those should be needed, unless tg3_set_txd() is updating a life ring entry in place or something like that). More testing is needed, but the problem is definitely the lack of memory ordering. Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 19:29 ` Segher Boessenkool @ 2006-09-08 19:54 ` Michael Chan 2006-09-08 21:46 ` Benjamin Herrenschmidt 2006-09-11 4:53 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 22+ messages in thread From: Michael Chan @ 2006-09-08 19:54 UTC (permalink / raw) To: Segher Boessenkool Cc: Benjamin Herrenschmidt, netdev, David S. Miller, Linux Kernel list On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote: > I've got a patch that seems so solve the problem, it needs more testing > though (maybe Ben can do this :-) ). The problem is that there should > be quite a few wmb()'s in the code that are just not there; adding some > to tg3_set_txd() seems to fix the immediate problem but more is needed > (and I don't see why those should be needed, unless tg3_set_txd() is > updating a life ring entry in place or something like that). > > More testing is needed, but the problem is definitely the lack of memory > ordering. > Oh, we know about this. The powerpc writel() used to have memory barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's version of tg3 has extra wmb()'s to fix this problem. David doesn't think that the upstream version of tg3 should have these wmb()'s, and the problem should instead be fixed in powerpc's writel(). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 19:54 ` Michael Chan @ 2006-09-08 21:46 ` Benjamin Herrenschmidt 2006-09-08 22:22 ` Michael Chan 2006-09-09 9:22 ` David Miller 2006-09-11 4:53 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-08 21:46 UTC (permalink / raw) To: Michael Chan Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list, Paul Mackerras On Fri, 2006-09-08 at 12:54 -0700, Michael Chan wrote: > On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote: > > > I've got a patch that seems so solve the problem, it needs more testing > > though (maybe Ben can do this :-) ). The problem is that there should > > be quite a few wmb()'s in the code that are just not there; adding some > > to tg3_set_txd() seems to fix the immediate problem but more is needed > > (and I don't see why those should be needed, unless tg3_set_txd() is > > updating a life ring entry in place or something like that). > > > > More testing is needed, but the problem is definitely the lack of memory > > ordering. > > > Oh, we know about this. The powerpc writel() used to have memory > barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's > version of tg3 has extra wmb()'s to fix this problem. David doesn't > think that the upstream version of tg3 should have these wmb()'s, and > the problem should instead be fixed in powerpc's writel(). The PowerPC writel has a full sync _after_ the write, mostly to prevent it from leaking out of a spinlock, and for ordering it vs. other writel's or readl's. It doesn't provide any ordering guarantee vs cacheable storage (and was never intended to do so afaik). Such ordering shall be provided explicitely. It's possible that 2.4 used a big hammer approach but we've since been actively fixing drivers for that. It's to be noted that PowerPC might not be the only architecture affected as I don't think that in general, you have ordering guarantees between cacheable and non-cacheable stores unless you use explicit barriers. Thus I disagree with "fixing" the powerpc writel(). The barries shall definitely go into tg3. Cheers, Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 21:46 ` Benjamin Herrenschmidt @ 2006-09-08 22:22 ` Michael Chan 2006-09-09 9:22 ` David Miller 1 sibling, 0 replies; 22+ messages in thread From: Michael Chan @ 2006-09-08 22:22 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list, Paul Mackerras On Sat, 2006-09-09 at 07:46 +1000, Benjamin Herrenschmidt wrote: > The PowerPC writel has a full sync _after_ the write, mostly to prevent > it from leaking out of a spinlock, and for ordering it vs. other > writel's or readl's. It doesn't provide any ordering guarantee vs > cacheable storage (and was never intended to do so afaik). Such ordering > shall > be provided explicitely. It's possible that 2.4 used a big hammer > approach but we've since been actively fixing drivers for that. It's to > be noted that PowerPC might not be the only architecture affected as I > don't think that in general, you have ordering guarantees between > cacheable and non-cacheable stores unless you use explicit barriers. I think 2.4 might have an additional sync before the write which will guarantee that the buffer descriptor is written before telling the chip to DMA it. > > Thus I disagree with "fixing" the powerpc writel(). The barries shall > definitely go into tg3. > You'll have to take this up with David. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 21:46 ` Benjamin Herrenschmidt 2006-09-08 22:22 ` Michael Chan @ 2006-09-09 9:22 ` David Miller 2006-09-09 22:36 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 22+ messages in thread From: David Miller @ 2006-09-09 9:22 UTC (permalink / raw) To: benh; +Cc: mchan, segher, netdev, linux-kernel, paulus From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Sat, 09 Sep 2006 07:46:02 +1000 > I don't think that in general, you have ordering guarantees between > cacheable and non-cacheable stores unless you use explicit barriers. In fact, on most systems you absolutely do have ordering between MMIO and memory accesses. So you are making an extremely poor engineering decision by trying to fixup all the drivers to match PowerPC's semantics. I think a smart engineer would decrease his debugging burdon, by matching his platform's MMIO accessors such that it matches what other platforms do and therefore inheriting the testing coverage provided by all platforms. Otherwise you will be hunting down these kinds of memory barrier issues forever. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-09 9:22 ` David Miller @ 2006-09-09 22:36 ` Benjamin Herrenschmidt 2006-09-10 0:38 ` Alan Cox 0 siblings, 1 reply; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-09 22:36 UTC (permalink / raw) To: David Miller; +Cc: mchan, segher, netdev, linux-kernel, paulus On Sat, 2006-09-09 at 02:22 -0700, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Sat, 09 Sep 2006 07:46:02 +1000 > > > I don't think that in general, you have ordering guarantees between > > cacheable and non-cacheable stores unless you use explicit barriers. > > In fact, on most systems you absolutely do have ordering between > MMIO and memory accesses. Well, at least powerpc and ia64 don't in hw, I don't know about others... out of order in general is getting fascionable in processor design ... > So you are making an extremely poor engineering decision > by trying to fixup all the drivers to match PowerPC's > semantics. I think a smart engineer would decrease his > debugging burdon, by matching his platform's MMIO accessors > such that it matches what other platforms do and therefore > inheriting the testing coverage provided by all platforms. > > Otherwise you will be hunting down these kinds of memory > barrier issues forever. Well, some of you (Alan, you, etc...) seem to imply that it's always been the rule to have a memory store followed by an MMIO write be strongly ordered. However, if you look at drivers like e1000, USB OHCI, or even sungem (:-) they, all have at least wmb()'s between updating descriptor in memory and the MMIO that triggers reading those by the chip. So it seems that I'm not the only one to have thought otherwise ;-) More specificaly, at least ia64 I think, like PowerPC, assumes no ordering requirement here. So they would need fixing too. My main problem is the cost... it's actually very expensive to do that sort of synchronisation. I don't know for ia64 or other potentially out of order architectures, but we do introduced a measureable performance hit by adding the one we already have to guard against spin_unlock. So if we decide to go the way of making writel synchronous vs. previous MMIOs, I'd really like to have a clearly defined "relaxed" version as well. However, I'm not sure any of our current "relaxed" accessor have clear semantics. At least what is implemented currently on PowerPC is the __raw_* versions which not only have no barriers at all (they don't even order between MMIOs, for example, readl might cross writel), and do no endian swap. Quite a mess of semantics if you ask me... Then there has been talks about those *_relaxed operations but those are more a match to the relaxed PCI-X and PCI-E cycels, that is they relax ordering vs. requests in a different direction on the bus, they have nothing to do with storage domains on the CPU. Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-09 22:36 ` Benjamin Herrenschmidt @ 2006-09-10 0:38 ` Alan Cox 2006-09-10 1:17 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 22+ messages in thread From: Alan Cox @ 2006-09-10 0:38 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Miller, mchan, segher, netdev, linux-kernel, paulus Ar Sul, 2006-09-10 am 08:36 +1000, ysgrifennodd Benjamin Herrenschmidt: > Well, some of you (Alan, you, etc...) seem to imply that it's always > been the rule to have a memory store followed by an MMIO write be > strongly ordered. It has always been the rule > However, if you look at drivers like e1000, USB OHCI, or even sungem > (:-) they, all have at least wmb()'s between updating descriptor in Driver hacks to cope with platform authors who got read/writel wrong. > semantics. At least what is implemented currently on PowerPC is the > __raw_* versions which not only have no barriers at all (they don't even > order between MMIOs, for example, readl might cross writel), and do no > endian swap. Quite a mess of semantics if you ask me... Then there has __writel/__readl seems more in keeping for just not locking. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-10 0:38 ` Alan Cox @ 2006-09-10 1:17 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-10 1:17 UTC (permalink / raw) To: Alan Cox; +Cc: David Miller, mchan, segher, netdev, linux-kernel, paulus > > semantics. At least what is implemented currently on PowerPC is the > > __raw_* versions which not only have no barriers at all (they don't even > > order between MMIOs, for example, readl might cross writel), and do no > > endian swap. Quite a mess of semantics if you ask me... Then there has > > __writel/__readl seems more in keeping for just not locking. Not locking... you mean not ordering I suppose. Ok, so the question is no ordering at all (that is even between MMIO read/writes, thus a __readl can cross a __writel), or just no ordering between MMIO and cacheable storage ? It's an important difference and both have their use. For example, on PowerPC, if I completely remove barriers, I get the first semantic and I get the ability to write combine on non-cacheable storage as a benefit (provided we add an ioremap_wc or such, as the Guarded bit we set on normal non-cacheable space does also prevent write combining on most implementations). However, if I keep at least ordering between MMIOs, then I leave an eieio in there, which is not nearly as expensive than a full sync but will not order cacheable cs. non-cacheable. However, it will also prevent write combine as far as I remember. Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 19:54 ` Michael Chan 2006-09-08 21:46 ` Benjamin Herrenschmidt @ 2006-09-11 4:53 ` Benjamin Herrenschmidt 2006-09-11 5:18 ` Michael Chan 1 sibling, 1 reply; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-11 4:53 UTC (permalink / raw) To: Michael Chan Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list > Oh, we know about this. The powerpc writel() used to have memory > barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's > version of tg3 has extra wmb()'s to fix this problem. David doesn't > think that the upstream version of tg3 should have these wmb()'s, and > the problem should instead be fixed in powerpc's writel(). I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still reproduce the problem. I've also done a 2 days run without TSO enabled without a failure (my test program normally fails after a couple of minutes). Thus, do you see any other code path in the driver where a synchronisation might be missing ? Is there any case where the chip might use data in memory before it has been told to do so with a mailbox write ? (There are no "OWN" bits that I can see in the descriptors, thus I doubt it will use a transmit descriptor that is half-built before the store to the mailbox allows using it) but who knows.... That leads to the question that there might be an unrelated bug in the driver. Segher thinks we might be overriding "live" descriptors, though I haven't seen how yet. It seems to be TSO specific tho... maybe some missing smp synchronisation in the driver itself or a problem when the TX ring is full ? I don't have the chip docs and I'm not familiar with the driver, so I'll keep looking, but advice is welcome. I'll also see if I can reproduce with some other TSO capable card, in case the problem is in the kernel TSO code and not in the driver. Cheers, Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 4:53 ` Benjamin Herrenschmidt @ 2006-09-11 5:18 ` Michael Chan 2006-09-11 5:25 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 22+ messages in thread From: Michael Chan @ 2006-09-11 5:18 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list Benjamin Herrenschmidt wrote: > I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still > reproduce the problem. I've also done a 2 days run without TSO enabled > without a failure (my test program normally fails after a couple of > minutes). > Hi Ben, The code is a bit tricky. It uses function pointers for the various register read/write methods. For the 5780, I believe it will be assigned a simple writel() and not tg3_write32_tx_mbox(). Can you double check to make sure you have actually added the wmb()? It's probably easiest to just add the wmb() in tg3_xmit_dma_bug() before the tw32_tx_mbox(). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:18 ` Michael Chan @ 2006-09-11 5:25 ` Benjamin Herrenschmidt 2006-09-11 5:33 ` Michael Chan 0 siblings, 1 reply; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-11 5:25 UTC (permalink / raw) To: Michael Chan Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list On Sun, 2006-09-10 at 22:18 -0700, Michael Chan wrote: > Benjamin Herrenschmidt wrote: > > > I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still > > reproduce the problem. I've also done a 2 days run without TSO enabled > > without a failure (my test program normally fails after a couple of > > minutes). > > > > Hi Ben, > > The code is a bit tricky. It uses function pointers for the various > register read/write methods. For the 5780, I believe it will be > assigned a simple writel() and not tg3_write32_tx_mbox(). Can you > double check to make sure you have actually added the wmb()? > > It's probably easiest to just add the wmb() in tg3_xmit_dma_bug() > before the tw32_tx_mbox(). I've done: #define tw32_rx_mbox(reg, val) do { wmb(); tp->write32_rx_mbox(tp, reg, val); } while(0) #define tw32_tx_mbox(reg, val) do { wmb(); tp->write32_tx_mbox(tp, reg, val); } while(0) Cheers, Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:25 ` Benjamin Herrenschmidt @ 2006-09-11 5:33 ` Michael Chan 2006-09-11 5:52 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 22+ messages in thread From: Michael Chan @ 2006-09-11 5:33 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list Benjamin Herrenschmidt wrote: > I've done: > > #define tw32_rx_mbox(reg, val) do { wmb(); tp->write32_rx_mbox(tp, reg, val); } while(0) > #define tw32_tx_mbox(reg, val) do { wmb(); tp->write32_tx_mbox(tp, reg, val); } while(0) > That should do it. I think we need those tcpdump after all. Can you send it to me? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:33 ` Michael Chan @ 2006-09-11 5:52 ` Benjamin Herrenschmidt 2006-09-11 8:20 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-11 5:52 UTC (permalink / raw) To: Michael Chan Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list On Sun, 2006-09-10 at 22:33 -0700, Michael Chan wrote: > Benjamin Herrenschmidt wrote: > > > I've done: > > > > #define tw32_rx_mbox(reg, val) do { wmb(); > tp->write32_rx_mbox(tp, reg, val); } while(0) > > #define tw32_tx_mbox(reg, val) do { wmb(); > tp->write32_tx_mbox(tp, reg, val); } while(0) > > > > That should do it. > > I think we need those tcpdump after all. Can you send it to me? Looks like adding a sync to writel does fix it though... I'm trying to figure out which specific writel in the driver makes a difference. I'll then look into slicing those tcpdumps. Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:52 ` Benjamin Herrenschmidt @ 2006-09-11 8:20 ` Benjamin Herrenschmidt 2006-09-11 13:54 ` Segher Boessenkool 2006-09-11 16:08 ` Michael Chan 2 siblings, 0 replies; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-11 8:20 UTC (permalink / raw) To: Michael Chan Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list Ok, it seems like we might have more than just the missing barrier in TG3. Possibly some IOMMU problems on some machines as well. Unfortunately, I don't have a tg3 on a PCI-X or PCI-E card to test on a pSeries or some other machine. [Olof: I've disabled the new U4 DART invalidate code (reverted to the old one) and added an unconditional barrier to dart_flush and I yet have to reproduce the problem. I suspect a problem with the DART invalidate one thingy, maybe a HW problem with the U4 chip. Now regarding the barrier in flush, we'll talk about it later, I think we might have a problem with the way we do the DART accesses (they might leak out of the lock) though I yet have to see that cause a problem in practice due to the round-robin nature of our allocation algorithm.] Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:52 ` Benjamin Herrenschmidt 2006-09-11 8:20 ` Benjamin Herrenschmidt @ 2006-09-11 13:54 ` Segher Boessenkool 2006-09-11 16:08 ` Michael Chan 2 siblings, 0 replies; 22+ messages in thread From: Segher Boessenkool @ 2006-09-11 13:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Michael Chan, netdev, David S. Miller, Linux Kernel list >>> #define tw32_rx_mbox(reg, val) do { wmb(); >> tp->write32_rx_mbox(tp, reg, val); } while(0) >>> #define tw32_tx_mbox(reg, val) do { wmb(); >> tp->write32_tx_mbox(tp, reg, val); } while(0) >>> >> >> That should do it. >> >> I think we need those tcpdump after all. Can you send it to me? > > Looks like adding a sync to writel does fix it though... I'm trying to > figure out which specific writel in the driver makes a difference. > I'll > then look into slicing those tcpdumps. Adding a PowerPC "sync" to every writel() will slow down things by so much that it could well just hide the problem, not solve it... Michael, we see this problem only with TSO on, and then the failure we see is bad data being sent out, with the correct header (but the header is constructed by the tg3 in this case, so no surprise). I'm theorizing that this same failure can happen with TSO off as well, but the header sent on the wire will be bogus as well as the data, so the receiving NIC won't pick it up. tcpdump probably won't see it either... will need a low-level ethernet analyser. Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-11 5:52 ` Benjamin Herrenschmidt 2006-09-11 8:20 ` Benjamin Herrenschmidt 2006-09-11 13:54 ` Segher Boessenkool @ 2006-09-11 16:08 ` Michael Chan 2 siblings, 0 replies; 22+ messages in thread From: Michael Chan @ 2006-09-11 16:08 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Segher Boessenkool, netdev, David S. Miller, Linux Kernel list On Mon, 2006-09-11 at 15:52 +1000, Benjamin Herrenschmidt wrote: > Looks like adding a sync to writel does fix it though... I'm trying to > figure out which specific writel in the driver makes a difference. I'll > then look into slicing those tcpdumps. During runtime in the fast path, the only writel()'s we do in tg3 are to the tx mbox, rx_mbox, and the interrupt mbox. The interrupt mbox shouldn't matter that much since it has no dependencies on other memory writes before it. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 15:49 ` TG3 data corruption (TSO ?) Michael Chan 2006-09-08 19:29 ` Segher Boessenkool @ 2006-09-08 21:41 ` Benjamin Herrenschmidt 2006-09-08 22:07 ` Michael Chan 1 sibling, 1 reply; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-08 21:41 UTC (permalink / raw) To: Michael Chan Cc: netdev, David S. Miller, Segher Boessenkool, Linux Kernel list > Please send me lspci and tg3 probing output so that I know what > tg3 hardware you're using. I also want to look at the tcpdump or > ethereal on the mirrored port that shows the packet being corrupted. Hi Michael ! It's the dual controller of an Apple Quad G5, thus afaik in an HT2000 chip: 0001:05:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) Subsystem: Apple Computer Inc. Unknown device 0085 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min) Interrupt: pin A routed to IRQ 66 Region 0: Memory at fa530000 (64-bit, non-prefetchable) [size=64K] Region 2: Memory at fa520000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO- RBC=512 OST=1 Status: Dev=05:04.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: 00011aa5c8ce4904 Data: 18d8 0001:05:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) Subsystem: Apple Computer Inc. Unknown device 0085 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 16 (16000ns min), Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 67 Region 0: Memory at fa510000 (64-bit, non-prefetchable) [size=64K] Region 2: Memory at fa500000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO+ RBC=512 OST=1 Status: Dev=05:04.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: 4e001a0002804460 Data: 00a2 And the dmesg bits: tg3.c:v3.65 (August 07, 2006) eth0: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:90 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[76144000] dma_mask[40-bit] eth1: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:91 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth1: dma_rwctrl[76144000] dma_mask[40-bit] As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to get only the interesting part. I'll try to do that later today (but it may have to wait for monday). Cheers, Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 21:41 ` Benjamin Herrenschmidt @ 2006-09-08 22:07 ` Michael Chan 2006-09-08 22:25 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 22+ messages in thread From: Michael Chan @ 2006-09-08 22:07 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: netdev, David S. Miller, Segher Boessenkool, Linux Kernel list On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote: > As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to > get only the interesting part. I'll try to do that later today (but it may have to wait for monday). > Ben, We probably don't need the tcpdump anymore now that we know it's a memory ordering issue. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 22:07 ` Michael Chan @ 2006-09-08 22:25 ` Benjamin Herrenschmidt 2006-09-08 22:40 ` Michael Chan 0 siblings, 1 reply; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-08 22:25 UTC (permalink / raw) To: Michael Chan Cc: netdev, David S. Miller, Segher Boessenkool, Linux Kernel list, Paul Mackerras, Anton Blanchard On Fri, 2006-09-08 at 15:07 -0700, Michael Chan wrote: > On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote: > > > As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to > > get only the interesting part. I'll try to do that later today (but it may have to wait for monday). > > > Ben, We probably don't need the tcpdump anymore now that we know it's a > memory ordering issue. Ok. I'm trying to figure out what's the best way with fixing that. I can see the flamewar coming on wether stores to memory vs. writel shall be ordered or not :) I'm very reluctant to add another sync instruction to our writel though. It needs one already after the stores to prevent leaking out of spinlocks (and thus possible mmio vs. mmio order issues on SMP with stores from different CPUs being re-ordered). Fixing the above would require one before the store as well. We already pay a pretty high price for that sync, having 2 would be a real shame. (Unfortunately, there is no cheap barrier available for ordering cacheable vs. non cacheable storage on PowerPC, they are completely separate domains). One option I was discussing with others would be to drop that sync after the store, and instead start requiring drivers to use mmiowb() (as defined by the ia64 folks) to provide ordering of writel's vs. locks. But that probably means breaking and then having to fix a while bunch of drivers in the tree who haven't been updated to use it... I'd rather not have to do that, or even if I go that way, not have to add that sync at all before the store and thus get back the few percent of perfs lost due to those sync's on some heavy IO benchmarks. Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 22:25 ` Benjamin Herrenschmidt @ 2006-09-08 22:40 ` Michael Chan 2006-09-08 22:49 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 22+ messages in thread From: Michael Chan @ 2006-09-08 22:40 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: netdev, David S. Miller, Segher Boessenkool, Linux Kernel list, Paul Mackerras, Anton Blanchard On Sat, 2006-09-09 at 08:25 +1000, Benjamin Herrenschmidt wrote: > Ok. I'm trying to figure out what's the best way with fixing that. I can > see the flamewar coming on wether stores to memory vs. writel shall be > ordered or not :) > > I'm very reluctant to add another sync instruction to our writel though. > It needs one already after the stores to prevent leaking out of > spinlocks (and thus possible mmio vs. mmio order issues on SMP with > stores from different CPUs being re-ordered). Fixing the above would > require one before the store as well. We already pay a pretty high price > for that sync, having 2 would be a real shame. > > (Unfortunately, there is no cheap barrier available for ordering > cacheable vs. non cacheable storage on PowerPC, they are completely > separate domains). > > One option I was discussing with others would be to drop that sync after > the store, and instead start requiring drivers to use mmiowb() (as > defined by the ia64 folks) to provide ordering of writel's vs. locks. > But that probably means breaking and then having to fix a while bunch of > drivers in the tree who haven't been updated to use it... > > I'd rather not have to do that, or even if I go that way, not have to > add that sync at all before the store and thus get back the few percent > of perfs lost due to those sync's on some heavy IO benchmarks. > Another way to fix this without requiring drivers to add all kinds of barriers in the driver code is to add a writel_sync() variant. So on powerpc, writel_sync() will have a sync before and after the write. On most other architectures, writel_sync() is the same as writel() if the ordering is guaranteed. We'll then convert tg3 and other drivers to use writel_sync() in places where they're needed. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: TG3 data corruption (TSO ?) 2006-09-08 22:40 ` Michael Chan @ 2006-09-08 22:49 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 22+ messages in thread From: Benjamin Herrenschmidt @ 2006-09-08 22:49 UTC (permalink / raw) To: Michael Chan Cc: netdev, David S. Miller, Segher Boessenkool, Linux Kernel list, Paul Mackerras, Anton Blanchard > > I'd rather not have to do that, or even if I go that way, not have to > > add that sync at all before the store and thus get back the few percent > > of perfs lost due to those sync's on some heavy IO benchmarks. > > > Another way to fix this without requiring drivers to add all kinds of > barriers in the driver code is to add a writel_sync() variant. So on > powerpc, writel_sync() will have a sync before and after the write. On > most other architectures, writel_sync() is the same as writel() if the > ordering is guaranteed. We'll then convert tg3 and other drivers to use > writel_sync() in places where they're needed. I think the preferred approach for that sort of thing is to have writel be the "sync" version and add special "relaxed" version. Now there have been talks and debates about relaxed IOs but they generally map to something different, typically IOs that are relaxed vs. DMA (PCI-X/PCIe relaxed ordering options for example). Adding yet another round of IO accessors sounds like a bit nasty to me, driver writers will potentially not understand which ones to use etc... Anyway, I think I'll let Anton and Paulus argue that one for now. Cheers, Ben. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2006-09-11 16:10 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1157704257.31071.68.camel@localhost.localdomain>
2006-09-08 15:49 ` TG3 data corruption (TSO ?) Michael Chan
2006-09-08 19:29 ` Segher Boessenkool
2006-09-08 19:54 ` Michael Chan
2006-09-08 21:46 ` Benjamin Herrenschmidt
2006-09-08 22:22 ` Michael Chan
2006-09-09 9:22 ` David Miller
2006-09-09 22:36 ` Benjamin Herrenschmidt
2006-09-10 0:38 ` Alan Cox
2006-09-10 1:17 ` Benjamin Herrenschmidt
2006-09-11 4:53 ` Benjamin Herrenschmidt
2006-09-11 5:18 ` Michael Chan
2006-09-11 5:25 ` Benjamin Herrenschmidt
2006-09-11 5:33 ` Michael Chan
2006-09-11 5:52 ` Benjamin Herrenschmidt
2006-09-11 8:20 ` Benjamin Herrenschmidt
2006-09-11 13:54 ` Segher Boessenkool
2006-09-11 16:08 ` Michael Chan
2006-09-08 21:41 ` Benjamin Herrenschmidt
2006-09-08 22:07 ` Michael Chan
2006-09-08 22:25 ` Benjamin Herrenschmidt
2006-09-08 22:40 ` Michael Chan
2006-09-08 22:49 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).