From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@arm.linux.org.uk (Russell King - ARM Linux) Date: Thu, 3 Apr 2014 11:32:06 +0100 Subject: FEC ethernet issues [Was: PL310 errata workarounds] In-Reply-To: References: <20140401225149.GC7528@n2100.arm.linux.org.uk> <20140402085914.GG7528@n2100.arm.linux.org.uk> <20140402104644.GI7528@n2100.arm.linux.org.uk> <20140402165113.GJ7528@n2100.arm.linux.org.uk> <20140403085636.GL7528@n2100.arm.linux.org.uk> Message-ID: <20140403103206.GN7528@n2100.arm.linux.org.uk> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Thu, Apr 03, 2014 at 09:55:06AM +0000, fugang.duan at freescale.com wrote: > From: Russell King - ARM Linux > Data: Thursday, April 03, 2014 4:57 PM > > >To: Duan Fugang-B38611 > >Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; > >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org > >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > > > >On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan at freescale.com wrote: > >> From: Russell King - ARM Linux > >> Data: Thursday, April 03, 2014 12:51 AM > >> > >> >Checksum and... presumably you're referring to NAPI don't get you to > >> >that kind of speed. Even on x86, you can't get close to wire speed > >> >without GSO, which you need scatter-gather for, and you don't support > >> >that. So I don't believe your 900Mbps figure. > >> > > >> >Plus, as you're memcpy'ing every packet received, I don't believe you > >> >can reach 940Mbps receive either. > >> > > >> Since Imx6sx enet still don't support TSO and Jumbo packet, > >> scatter-gather cannot improve ethernet performance in Most cases > >> special for iperf test. > > > >Again, you are losing credibility every time you deny stuff like this. > >I'm now at the point of just not listening to you anymore because you're > >contradicting what I know to be solid fact through my own measurements. > > > >This seems to be Freescale's overall attitude - as I've read on Freescale's > >forums. Your customers/users are always wrong, you're always right. Eg, any > >performance issues are not the fault of Freescale stuff, it's tarnished > >connectors or similar. > > > Hi, Russell, > > I don't contradict your thinking/solution and measurements. You are > expert on arm/modules, we keep study attitude to dicuss with you. > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you > did great job on performance tuning, and the test result is similar > To our internal test result. Your suggestion for the optimiztion is > meaningful. Pls understand my thinking. The reason I said what I said is because I'm not talking about TSO. I'm talking about GSO. They're similar features, but are done in totally different ways. TSO requires either hardware support, or driver cooperation to segment the data. GSO does not. There's several points here which make me discount your figures: 1. The overhead of routing packets is not insigificant. GSO is a feature where the higher levels (such as TCP) can submit large socket buffers to the lower levels, all the way through to packet queues to just before the device. At the very last moment, just before the buffer is handed over to the driver's start_xmit function, the buffer is carved into appropriately sized chunks, and new skbuffs allocated. Each skbuff "head" contains the protocol headers, and the associated fragments contain pointers/lengths into the large buffer. (This is why SG is required for GSO.) What this means is that all the overhead from the layers such as TCP / IP / routing / packet queuing are all reduced because rather than those code paths having to be run for every single 1500 byte packet, they're run once for maybe 16 to 32K of data to be sent. This avoids the overhead from a lot of code. I have monitored the RGMII TX_CTL signal while performing iperf tests without GSO, and I can see from that the gap between packets, it is very rare that it gets to the point of transmitting two packets back to back. 2. At 500Mbps transmit on the iMX6S, performance testing shows the majority of CPU time is spent in the cache cleaning/flushing code. This overhead is necessary and can't be removed without risking data corruption. 3. I discount your receive figure because even with all my work, if I set my x86 box to perform a UDP iperf test with x86 sending to iMX6S, the result remains extremely poor, with 95% (modified) to 99% (unmodified) of UDP packets lost. This is not entirely the fault of the FEC. Around 70% of the packet loss is in the UDP receive path - iperf seems to be unable to read the packets from UDP fast enough. (This confirmed by checking the statistics in /proc/net/snmp.) This means 30% of packet loss is due to the FEC not keeping up with the on-wire packet rate (which is around 810Mbps.) Therefore, I place the maximum receive packet rate of the FEC with the current packet-copy strategy in the receive path and 128 entry receive ring size at around 550Mbps, and again, much of the overhead comes from the cache handling code according to perf. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it.