From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Gallatin Subject: Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment Date: Tue, 28 Apr 2009 11:00:16 -0400 Message-ID: <49F71A00.5090701@myri.com> References: <49E5DABB.9070806@myri.com> <49E64BE4.1050908@myri.com> <20090415.164248.188350673.davem@davemloft.net> <20090416085022.GA19731@gondor.apana.org.au> <49EE1C32.1060202@myri.com> <20090422104811.GA30981@gondor.apana.org.au> <49EF39B4.1040607@myri.com> <20090424054557.GA24575@gondor.apana.org.au> <49F1E5C8.7010303@myri.com> <20090427080501.GA21433@gondor.apana.org.au> <20090428061225.GA1591@gondor.apana.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: David Miller , brice@myri.com, sgruszka@redhat.com, netdev@vger.kernel.org To: Herbert Xu Return-path: Received: from mailbox2.myri.com ([64.172.73.26]:1951 "EHLO myri.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1761518AbZD1PBS (ORCPT ); Tue, 28 Apr 2009 11:01:18 -0400 In-Reply-To: <20090428061225.GA1591@gondor.apana.org.au> Sender: netdev-owner@vger.kernel.org List-ID: Herbert Xu wrote: > On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote: >> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote: >>> These results are indeed quite close, so the performance problem seems >>> isolated to AMD CPUS, and perhaps due to the smaller caches. >>> Do you have any AMD you can use as a receiver? >> I now have an AMD with 512K cache to test this. Unfortunately >> I'd just locked it up before I got a chance to do any serious >> testing. So it might take a while. > > OK that's been fixed up. Indeed the AMD can't do wire speed. > But still the performance seems comparable. Both of them sit > between 6600Mb/s and 7100Mb/s. The sender is running at about > 66% idle in either case. Its strange, I still consistently see about 1Gb/s better performance from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO) when binding everything to the same CPU. Mpstat -P 0 shows roughly 10% more time spent in "soft" when using GRO vs LRO: GRO: 10:17:45 CPU %user %nice %system %iowait %irq %soft %idle intr/s 10:17:46 0 0.00 0.00 54.00 0.00 0.00 46.00 0.00 11754.00 10:17:47 0 0.00 0.00 54.00 0.00 1.00 45.00 0.00 11718.00 10:17:48 0 0.00 0.00 47.00 0.00 2.00 51.00 0.00 11639.00 LRO: 10:21:55 CPU %user %nice %system %iowait %irq %soft %idle intr/s 10:21:56 0 0.00 0.00 66.00 0.00 1.00 33.00 0.00 13228.00 10:21:57 0 0.00 0.00 65.35 0.00 1.98 32.67 0.00 13118.81 10:21:58 0 0.00 0.00 63.00 0.00 1.00 36.00 0.00 13238.00 According to oprofile, the top 20 samples running GRO are: CPU: AMD64 processors, speed 2050.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 4382 30.5408 vmlinux vmlinux copy_user_generic_string 534 3.7218 myri10ge.ko myri10ge myri10ge_poll 463 3.2269 vmlinux vmlinux _raw_spin_lock 394 2.7460 vmlinux vmlinux rb_get_reader_page 382 2.6624 vmlinux vmlinux acpi_pm_read 356 2.4812 vmlinux vmlinux inet_gro_receive 293 2.0421 oprofiled oprofiled (no symbols) 268 1.8679 vmlinux vmlinux find_next_bit 268 1.8679 vmlinux vmlinux tg_shares_up 257 1.7912 vmlinux vmlinux ring_buffer_consume 247 1.7215 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 247 1.7215 vmlinux vmlinux tcp_gro_receive 228 1.5891 vmlinux vmlinux __free_pages_ok 219 1.5263 vmlinux vmlinux skb_gro_receive 167 1.1639 vmlinux vmlinux skb_gro_header 149 1.0385 bash bash (no symbols) 141 0.9827 vmlinux vmlinux skb_copy_datagram_iovec 132 0.9200 vmlinux vmlinux rb_buffer_peek 129 0.8991 vmlinux vmlinux _raw_spin_unlock 123 0.8573 vmlinux vmlinux delay_tsc Nothing really stands out for me. Here is LRO: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 4884 33.1164 vmlinux vmlinux copy_user_generic_string 721 4.8888 myri10ge.ko myri10ge myri10ge_poll 580 3.9327 vmlinux vmlinux _raw_spin_lock 409 2.7733 vmlinux vmlinux acpi_pm_read 306 2.0749 vmlinux vmlinux rb_get_reader_page 293 1.9867 oprofiled oprofiled (no symbols) 286 1.9392 myri10ge.ko myri10ge myri10ge_get_frag_header 253 1.7155 vmlinux vmlinux __lro_proc_segment 250 1.6951 vmlinux vmlinux rb_buffer_peek 247 1.6748 vmlinux vmlinux ring_buffer_consume 232 1.5731 vmlinux vmlinux __free_pages_ok 211 1.4307 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 206 1.3968 vmlinux vmlinux tg_shares_up 175 1.1866 vmlinux vmlinux skb_copy_datagram_iovec 158 1.0713 vmlinux vmlinux find_next_bit 146 0.9900 vmlinux vmlinux lro_tcp_ip_check 131 0.8883 oprofile.ko oprofile op_cpu_buffer_read_entry 127 0.8611 vmlinux vmlinux delay_tsc 125 0.8476 bash bash (no symbols) 125 0.8476 vmlinux vmlinux _raw_spin_unlock If I can't figure out why LRO is so much faster in some cases, then I think maybe I'll just put together a patch which keeps LRO, and does GRO only if LRO is disabled. Kind of ugly, but better than loosing 15% performance on some machines. Drew