From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Re: [RFC] TCP congestion schedulers Date: Mon, 21 Mar 2005 16:10:36 -0800 Message-ID: <423F627C.2060100@hp.com> References: <421CF5E5.1060606@ev-en.org> <20050223135732.39e62c6c.davem@davemloft.net> <421D1E66.5090301@osdl.org> <421D30FA.1060900@ev-en.org> <20050225120814.5fa77b13@dxpl.pdx.osdl.net> <20050309210442.3e9786a6.davem@davemloft.net> <4230288F.1030202@ev-en.org> <20050310182629.1eab09ec.davem@davemloft.net> <20050311120054.4bbf675a@dxpl.pdx.osdl.net> <20050311201011.360c00da.davem@davemloft.net> <20050314151726.532af90d@dxpl.pdx.osdl.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit To: netdev@oss.sgi.com In-Reply-To: Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org John Heffner wrote: > On Sat, 19 Mar 2005, Andi Kleen wrote: > > >>Stephen Hemminger writes: >> >> >>>Since developers want to experiment with different congestion >>>control mechanisms, and the kernel is getting bloated with overlapping >>>data structure and code for multiple algorithms; here is a patch to >>>split out the Reno, Vegas, Westwood, BIC congestion control stuff >>>into an infrastructure similar to the I/O schedulers. >> >>[...] >> >>Did you do any benchmarks to check that wont slow it down? >> >>I would recommend to try it on a IA64 machine if possible. In the >>past we found that adding indirect function calls on IA64 to networking >>caused measurable slowdowns in macrobenchmarks. >>In that case it was LSM callbacks, but your code looks like it will >>add even more. > > > Is there a canonical benchmark? I would put-forth netperf - but then I'm of course biased. It is reasonably straightforward to run, is sophisticated enough to look for interesting things, and not so big as some benchmarketing benchmarks that require other software besides the stack (eg web servers and whatnot). If using netperf (not to be confused with Linux versions) versions < 2.4.0 then make sure it is compiled with the makefile edited to have -DUSE_PROC_STAT and _NOT_ have -DHISTOGRAM or -DINTERVALS. If using the rc1 of 2.4.0, just typing "configure" after unpacking the tar file should suffice under linux, but before compiling make sure config.h has a "USE_PROC_STAT" in it. If it is missing USE_PROC_STAT then add a --enable-cpuutil=procstat to the configure step. Be certain to request CPU utilization numbers with the -c/-C options. Probably best to request confidence intervals. I'd suggest a "128x32" TCP_STREAM test and a "1x1" TCP_RR test. So, something along the lines of: netperf -H -i 10,3 -I 99,5 -l 60 -t TCP_STREAM -- -s 128K -S 128K -m 32K to have netperf reqeust 128KB socket buffers, and pass 32KB in each call to send. Each iteration lasting 60 seconds, and running at least three and no more than 10 iterations to get to the point that it is 99% certain ("confident") that the reported mean for throughput and CPU util is within +/- 2.5% of the actual mean. You can make that -I 99,2 to be +/- 1% at the risk of having a harder time hitting the confidence intervals. If at first you do not hit the confidence intervals you can increase the values in -i up to 30 and/or increase the iteration run time with -l. For the TCP_RR test: netperf -H -I 10,3 -I 99,5 -l 60 -t TCP_RR which will be as above except running a TCP_RR test. The default in a TCP_RR test is to have a single-byte request and a single-byte response. If you grab 2.4.0rc1 and run on an MP system, it may be good for reproducability to use the -T option to pin netperf and/or netserver to specific CPUs. -T 0 will attempt to bind both netperf and netserver to CPU 0 -T 1,0 will attempt to bind netperf to CPU 1 and netserver to CPU 0 -T 0, will bind netperf to CPU 0 and leave netserver floating -T ,1 will bind netserver to CPU 1 and leave netperf floating I would suggest two sitations - netperf/netserver bound to the same CPU as that taking interrupts from the NIC, and one where it is not. How broad the "where it is not" case needs/wants to be depends on just how many degrees of "not the same CPU" as one hase on the system (thinking NUMA). netperf bits can be found at: ftp://ftp.cup.hp.com/dist/networking/benchmarks/netperf/ with the 2.4.0rc1 bits in the experimental/ subdirectory. There is a Debian package floating around somewhere but I cannot recall the revision of netperf on which it is based so probably best to grab source bits and compile them. Interrupt avoidance/coalescing may have a noticable effect on the single-stream netperf TCP_RR performance, capping it at a lower transaction per second rating no matter the increase in CPU util. So, it is very important to include the CPU util measurements. Similarly, if a system can already max-out a GbE link, just looking at Bits per second does not sufice. For situations where the CPU utilization measurement mechanism is questionable (I'm still not sure about the -DUSE_PROC_STAT stuff and interrupt time...any comments there most welcome) it may be preferred to run aggregate tests. Netperf2 has no explicit synchronization, but if one is content with "stopwatch" accuracy, aggregate performance along the lines of: for i in 1 2 ... N do netperf -t TCP_RR -H -i 30 -L 60 -P 0 -v 0 & done may suffice. The -P 0 stuff disables output of the test headers. The -v 0 will cause just the Single Figure of Merit (SFM) to be displayed - in this case the transaction per second rate. Here the -i 30 is to make each instance of netperf run 30 iterations. The idea being that at least 28 of them will be while the other N-1 netperfs are running. And, hitting the (default -I 99,5) confidence interval gives us some confidence that any skew is reasonably close to epsilon. The idea is to take N high enough to saturate the CPU(s) in the system and peak the aggregate transaction rate. Single-byte is used to avoid pegging the link on bits per second. Since this is "stopwatch" I tend to watch to make sure that they all start and end "close" to one another. (NB the combination of -i 30 and -l 60 means the test will run for an hour...alter at your discression) For aggregate tests it is generally best to have three systems - the System Under Test (SUT) and a pair or more of LG's - sometimes just using a pair of systems saturates before driving the SUT with two or more LGs would. > Would you really expect a single extra indirect call per ack to have a > significant performance impact? This is surprising to me. Where does the > cost come from? Replacing instruction cache lines? I don't have specific data on hand, but the way the selinux stuff (used?) to be implemented did indeed not run very well at all even when selinux was disabled (enabled was another story entirely...) Even if a single extra indirect call is nearly epsilon, the "thousand cuts" principle would apply. Enough of them and the claims about other OSes having faster networking may actually become true - if it isn't true already. But I may be drifting... rick jones