netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* performance regression with skb_add_data_nocache
@ 2012-10-12 14:29 Animesh K Trivedi1
  2012-10-12 17:38 ` Rick Jones
  0 siblings, 1 reply; 2+ messages in thread
From: Animesh K Trivedi1 @ 2012-10-12 14:29 UTC (permalink / raw)
  To: netdev; +Cc: Bernard Metzler, Animesh K Trivedi1


Hi all,

I recently upgraded from 2.6.36 to 3.2.28 and saw performance regression
for
TCP performance. Upon further investigation it looked that
skb_add_data_nocache()
was the culprit.

I am getting following performance numbers on my Nehalem (Xeon E7520) box
connected using 10GbE cards (transmit side, and netperf client). Server is
a another
box with E5540 CPU (receiver of the request) . For my netperf TCP_RR tests:

-  1,400 bytes request, 1 byte response:
No cache copy (enabled)  : 26,623 tps, 22.72% utilization
No cache copy (disabled) : 26,710 tps, 21.76% utilization

- 14,000 bytes request, 1 byte response:
No cache copy (enabled)  : 14,245 tps, 23.04% utilization
No cache copy (disabled) : 14,850 tps, 21.6% utilization

and for even larger buffer the performance lag is increases with
significant CPU load

- 1 MBytes request, 1 byte response:
No cache copy (enabled)  : 1,032 tps, 98.96% utilization
No cache copy (disabled) : 1,081 tps, 74.86% utilization

Though there isn't a lot performance difference, but notice the significant
CPU
utilization in case of nocache copy for 1MB buffer size. Thoughts?

Thanks,
--
Animesh

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: performance regression with skb_add_data_nocache
  2012-10-12 14:29 performance regression with skb_add_data_nocache Animesh K Trivedi1
@ 2012-10-12 17:38 ` Rick Jones
  0 siblings, 0 replies; 2+ messages in thread
From: Rick Jones @ 2012-10-12 17:38 UTC (permalink / raw)
  To: Animesh K Trivedi1; +Cc: netdev, Bernard Metzler

On 10/12/2012 07:29 AM, Animesh K Trivedi1 wrote:
>
> Hi all,
>
> I recently upgraded from 2.6.36 to 3.2.28 and saw performance
> regression for TCP performance. Upon further investigation it looked
> that skb_add_data_nocache() was the culprit.
>
> I am getting following performance numbers on my Nehalem (Xeon E7520)
> box connected using 10GbE cards (transmit side, and netperf client).
> Server is a another box with E5540 CPU (receiver of the request) .
> For my netperf TCP_RR tests:
>
> -  1,400 bytes request, 1 byte response:
> No cache copy (enabled)  : 26,623 tps, 22.72% utilization
> No cache copy (disabled) : 26,710 tps, 21.76% utilization
>
> - 14,000 bytes request, 1 byte response:
> No cache copy (enabled)  : 14,245 tps, 23.04% utilization
> No cache copy (disabled) : 14,850 tps, 21.6% utilization
>
> and for even larger buffer the performance lag is increases with
> significant CPU load
>
> - 1 MBytes request, 1 byte response:
> No cache copy (enabled)  : 1,032 tps, 98.96% utilization
> No cache copy (disabled) : 1,081 tps, 74.86% utilization
>
> Though there isn't a lot performance difference, but notice the
> significant CPU utilization in case of nocache copy for 1MB buffer
> size. Thoughts?

Over the years I have found there can be some run-to-run variability 
with the TCP_RR test - certainly the single byte one.

To have the i's dotted and t's crossed, I would suggest you shoot the 
irqbalanced in the head and make sure that all the IRQs of the 10GbE 
card (which?) are bound to the same CPU, so as you go from netperf run 
to netperf run you do not go from interrupt CPU to interrupt CPU. (1) 
You should also make certain that netperf and netserver are bound to the 
same CPU each time - initially I would suggest the same CPU as is taking 
the interrupts from the NIC. You can do that either with taskset or with 
the netperf global -T option.  Whether you then move-on to "same chip, 
same core, different thread" and/or "same chip, different core" and/or 
"different chip" (and/or "different chip, other side of "glue" if this 
is a > 4 socket box with glue) is up to you.

I would also suggest enabling the confidence interval feature of netperf 
with a global -i 30,3 option.  You can make the interval narrower or 
wider with the global -I option.  The idea would be to make sure the 
interval is narrower than the difference you are seeing.

Netperf will normalize the throughput and CPU utilization to a service 
demand.  Reporting that can help make the overhead differences more 
clear.   It might be interesting to incldue TCP_STREAM results with the 
test-specific -m option set to 1400 14000 and 1048576 as well.

That there would be greater overhead with a no cache copy seems 
unsurprising to me, particularly if it has a side effect of minimizing 
or eliminating pre-fetching.

happy benchmarking,

rick jones

(1) that is what I like to do anyway because it is easier (IMO) than 
making netperf use the same four-tuple for the data connection each 
time, so it gets hashed/whatnot by the NIC the same way each time.  I 
just send all the queues to the same CPU and am done with it - for 
single-instance testing a la 
http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomni.sh . 
If I am running aggregate netperf tests I'll either let irqbalanced do 
its thing, or leave it off and spread the IRQs around by hand.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-10-12 17:38 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-12 14:29 performance regression with skb_add_data_nocache Animesh K Trivedi1
2012-10-12 17:38 ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).