From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Re: UDP splice Date: Mon, 24 Jun 2013 14:33:50 -0700 Message-ID: <51C8BB3E.8090701@hp.com> References: <1372088554.1896.3.camel@bwh-desktop.uk.level5networks.com> <20130624155154.GD10413@order.stressinduktion.org> <1372089776.1896.9.camel@bwh-desktop.uk.level5networks.com> <20130624170119.GE10413@order.stressinduktion.org> <1372096418.3301.75.camel@edumazet-glaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Eric Dumazet , Ben Hutchings , netdev@vger.kernel.org To: Ricardo Landim Return-path: Received: from g1t0027.austin.hp.com ([15.216.28.34]:39681 "EHLO g1t0027.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751013Ab3FXVdv (ORCPT ); Mon, 24 Jun 2013 17:33:51 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 06/24/2013 11:08 AM, Ricardo Landim wrote: > Help in zero copy and improve in cost of syscalls. > > In my intel xeon(3.3ghz), read udp socket and write udp socket (proxy) > spends ~40000 cycles (~12 us). Are you quite certain your Xeon was actually running at 3.3GHz at the time? I just did a quick netperf UDP_RR test between an old Centrino-based laptop (HP 8510w) pegged at 1.6 GHz (cpufreq-set) and it was reporting a service demand of 12.2 microseconds per transaction, which is, basically, a send and recv pair plus stack: root@raj-8510w:~# netperf -t UDP_RR -c -i 30,3 -H tardy.usa.hp.com -- -r 140,1MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 () port 0 AF_INET to tardy.usa.hp.com () port 0 AF_INET : +/-2.500% @ 99% conf. : demo : first burst 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 1.120% !!! Local CPU util : 6.527% !!! Remote CPU util : 0.000% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr 180224 180224 140 1 10.00 12985.58 7.93 -1.00 12.221 -1.000 212992 212992 (Don't fret too much about the confidence intervals bit, it almost made it.) Also, my 1400 byte test didn't have all that different a service demand: root@raj-8510w:~# netperf -t UDP_RR -c -i 30,3 -H tardy.usa.hp.com -- -r 1400,1 MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 () port 0 AF_INET to tardy.usa.hp.com () port 0 AF_INET : +/-2.500% @ 99% conf. : demo : first burst 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 1.123% !!! Local CPU util : 6.991% !!! Remote CPU util : 0.000% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr 180224 180224 1400 1 10.00 10055.33 6.27 -1.00 12.469 -1.000 212992 212992 Of course I didn't try very hard to force cache misses (eg using a big send/recv ring) and there may have been other things happening on the system causing a change between the two tests (separated by an hour or so). I didn't make sure that interrupts stayed assigned to a specific CPU, nor that netperf did. The kernel: root@raj-8510w:~# uname -a Linux raj-8510w 3.8.0-25-generic #37-Ubuntu SMP Thu Jun 6 20:47:30 UTC 2013 i686 i686 i686 GNU/Linux In general, I suppose if you want to quantify the overhead of copies, you can try something like the two tests above, but for longer run times and with more intermediate data points, as you walk the request or response size up. Watch the change in service demand as you go. So long as you stay below 1472 bytes (assuming IPv4 over a "standard" 1500 byte MTU Ethernet) you won't generate fragments, and so will still have the same number of packets per transaction. Or you could "perf" profile and look for copy routines. happy benchmarking, rick jones