netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* "meaningful" spinlock contention when bound to non-intr CPU?
@ 2007-02-01 19:43 Rick Jones
  2007-02-01 19:46 ` Rick Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Rick Jones @ 2007-02-01 19:43 UTC (permalink / raw)
  To: Linux Network Development list

For various nefarious porpoises relating to comparing and contrasting a 
single 10G NIC with N 1G ports and hopefully finding interesting 
processor cache (mis)behaviour in the stack, I got my hands on a pair of 
8 core systems with plenty of RAM and I/O slots.  (rx6600 with 1.6 GHz 
dual-core Itanium2, aka Montecito)

A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier.

Into each went a quartet of dual-port 1G NICs driven by e1000 
7.3.15-k2-NAPI and I connected them back to back.  I tweaked 
smp_affinity to have each port's interrupts go to a separate core.

Netperf2 configured with --enable-burst.

When I run eight concurrent netperf TCP_RR tests, each doing 24 
concurrent single-byte transactions (test-specific -b 24), TCP_NODELAY 
set, (test-specific -D) and bind each netserver/netperf to the same CPU 
as is taking the interrupts of the NIC handling that connection (global 
-T) I see things looking pretty good.  Decent aggregate transactions per 
second, and nothing in the CPU profiles to suggest spinlock contention.

Happiness and joy.  An N CPU system behaving (at this level at least) 
like N, 1 CPU systems.

When I then decide to bind the netperf/netservers to CPU(s) other than 
the ones taking the interrupts from the NIC(s) the aggregate 
transactions per second drops by roughly 40/135 or ~30%.  I was indeed 
expecting a delta - no idea if that is in the realm of "to be expected" 
- but decided to go ahead and look at the profiles.

The profiles (either via q-syscollect or caliper) show upwards of 3% of 
the CPU consumed by spinlock contention (ie time spent in 
ia64_spinlock_contention). (I'm guessing some of the rest of the perf 
drop comes from those "interesting" cache behaviours still to be sought)

With some help from Lee Schermerhorn and Alan Brunelle I got a lockmeter 
kernel going, and it is suggesting that the greatest spinlock contention 
comes from the routines:

SPINLOCKS         HOLD            WAIT
   UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN 
RJECT  NAME

   7.4%  2.8%  0.1us( 143us)  3.3us( 147us)( 1.4%)  75262432 97.2%  2.8% 
    0%  lock_sock_nested+0x30
  29.5%  6.6%  0.5us( 148us)  0.9us( 143us)(0.49%)  37622512 93.4%  6.6% 
    0%  tcp_v4_rcv+0xb30
   3.0%  5.6%  0.1us( 142us)  0.9us( 143us)(0.14%)  13911325 94.4%  5.6% 
    0%  release_sock+0x120
   9.6% 0.75%  0.1us( 144us)  0.7us( 139us)(0.08%)  75262432 99.2% 0.75% 
    0%  release_sock+0x30

I suppose it stands to some reason that there would be contention 
associated with the socket since there will be two things going for the 
socket (a netperf/netserver and an interrupt/upthestack) each running on 
separate CPUs.  Some of it looks like it _may_ be inevitable? - 
waking-up the user who will now  be racing to grab the socket before the 
stack releases it? (I may have been mis-interpreting some of the code I 
was checking)

Still, does this look like something worth persuing?  In a past life/OS 
when one was able to eliminate one percentage point of spinlock 
contention, two percentage points of improvement ensued.

rick jones


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-01 19:43 "meaningful" spinlock contention when bound to non-intr CPU? Rick Jones
@ 2007-02-01 19:46 ` Rick Jones
  2007-02-02 16:47 ` Jesse Brandeburg
  2007-02-02 19:21 ` Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2007-02-01 19:46 UTC (permalink / raw)
  To: Linux Network Development list

Rick Jones wrote:
> A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier.

gaak - 2.6.20-rc5 that is.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-01 19:43 "meaningful" spinlock contention when bound to non-intr CPU? Rick Jones
  2007-02-01 19:46 ` Rick Jones
@ 2007-02-02 16:47 ` Jesse Brandeburg
  2007-02-02 18:17   ` Rick Jones
  2007-02-02 19:21 ` Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Jesse Brandeburg @ 2007-02-02 16:47 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list

On 2/1/07, Rick Jones <rick.jones2@hp.com> wrote:
<snip>
> With some help from Lee Schermerhorn and Alan Brunelle I got a lockmeter
> kernel going, and it is suggesting that the greatest spinlock contention
> comes from the routines:
>
> SPINLOCKS         HOLD            WAIT
>    UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
> RJECT  NAME
>
>    7.4%  2.8%  0.1us( 143us)  3.3us( 147us)( 1.4%)  75262432 97.2%  2.8%
>     0%  lock_sock_nested+0x30
>   29.5%  6.6%  0.5us( 148us)  0.9us( 143us)(0.49%)  37622512 93.4%  6.6%
>     0%  tcp_v4_rcv+0xb30
>    3.0%  5.6%  0.1us( 142us)  0.9us( 143us)(0.14%)  13911325 94.4%  5.6%
>     0%  release_sock+0x120
>    9.6% 0.75%  0.1us( 144us)  0.7us( 139us)(0.08%)  75262432 99.2% 0.75%
>     0%  release_sock+0x30
>
> I suppose it stands to some reason that there would be contention
> associated with the socket since there will be two things going for the
> socket (a netperf/netserver and an interrupt/upthestack) each running on
> separate CPUs.  Some of it looks like it _may_ be inevitable? -
> waking-up the user who will now  be racing to grab the socket before the
> stack releases it? (I may have been mis-interpreting some of the code I
> was checking)
>
> Still, does this look like something worth persuing?  In a past life/OS
> when one was able to eliminate one percentage point of spinlock
> contention, two percentage points of improvement ensued.

Rick, this looks like good stuff, we're seeing more and more issues
like this as systems become more multi-core and have more interrupts
per NIC (think MSI-X)

Let me know if there is something I can do to help.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 16:47 ` Jesse Brandeburg
@ 2007-02-02 18:17   ` Rick Jones
  0 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2007-02-02 18:17 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Linux Network Development list

>> SPINLOCKS         HOLD            WAIT
>>    UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
>> RJECT  NAME
>>
>>    7.4%  2.8%  0.1us( 143us)  3.3us( 147us)( 1.4%)  75262432 97.2%  2.8%
>>     0%  lock_sock_nested+0x30
>>   29.5%  6.6%  0.5us( 148us)  0.9us( 143us)(0.49%)  37622512 93.4%  6.6%
>>     0%  tcp_v4_rcv+0xb30
>>    3.0%  5.6%  0.1us( 142us)  0.9us( 143us)(0.14%)  13911325 94.4%  5.6%
>>     0%  release_sock+0x120
>>    9.6% 0.75%  0.1us( 144us)  0.7us( 139us)(0.08%)  75262432 99.2% 0.75%
>>     0%  release_sock+0x30
>> ...
>> Still, does this look like something worth persuing?  In a past life/OS
>> when one was able to eliminate one percentage point of spinlock
>> contention, two percentage points of improvement ensued.
> 
> 
> Rick, this looks like good stuff, we're seeing more and more issues
> like this as systems become more multi-core and have more interrupts
> per NIC (think MSI-X)

MSI-X - haven't even gotten to that - discussion of that probably 
overlaps with some "pci" mailing list right?

> Let me know if there is something I can do to help.

I suppose one good step would be to reproduce the results on some other 
platform.  After that, I need to understand what those routines are 
doing much better than I currently do, particularly from an 
"architecture" perspective - I think that it may involve all the 
prequeue/try to get the TCP processing on the user's stack stuff but I'm 
_far_ from certain.

rick jones


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 19:21 ` Andi Kleen
@ 2007-02-02 18:46   ` Rick Jones
  2007-02-02 19:06     ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Rick Jones @ 2007-02-02 18:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Network Development list

Andi Kleen wrote:
> Rick Jones <rick.jones2@hp.com> writes:
> 
>>Still, does this look like something worth persuing?  In a past
>>life/OS when one was able to eliminate one percentage point of
>>spinlock contention, two percentage points of improvement ensued.
> 
> 
> The stack is really designed to go fast with per CPU local RX processing 
> of packets. This normally works because waking on up a task 
> the scheduler tries to move it to that CPU. Since the wakeups are
> on the CPU that process the incoming packets it should usually
> end up correctly.
> 
> The trouble is when your NICs are so fast that a single
> CPU can't keep up, or when you have programs that process many
> different sockets from a single thread.
> 
> The fast NIC case will be eventually fixed by adding proper
> support for MSI-X and connection hashing. Then the NIC can fan 
> out to multiple interrupts and use multiple CPUs to process
> the incoming packets. 

If that is implemented "well" (for some definition of well) then it 
might address the many sockets from a thread issue too, but if not...

If it is simple "hash on the headers" then you still have issues with a 
process/thread servicing mutiple connections - the hash of the different 
  headers will take things up different CPUs and you induce the 
scheduler to flip the process back and forth between them.

The meta question behind all that would seem to be whether the scheduler 
should be telling us where to perform the network processing, or should 
the network processing be telling the scheduler what to do? (eg all my 
old blathering about IPS vs TOPS in HP-UX...)

> Then there is the case of a single process having many 
> sockets from different NICs This will be of course somewhat slower
> because there will be cross CPU traffic. 

The extreme case I see with the netperf test suggests it will be a 
pretty big hit.  Dragging cachelines from CPU to CPU is evil.  Sometimes 
a necessary evil of course, but still evil.

> However there should
> be not much socket lock contention because a process handling
> many sockets will be hopefully unlikely to bang on each of
> its many sockets at the exactly same time as the stack
> receives RX packets. This should also eliminate the spinlock
> contenion.
> 
> From that theory your test sounds somewhat unrealistic to me. 
> 
> Do you have any evidence you're modelling a real world scenario
> here? I somehow doubt it.

Well, yes and no.  If I drop the "burst" and instead have N times more 
netperf's going, I see the same lock contention situation.  I wasn't 
expecting to - thinking that if there were then N different processes on 
each CPU the likelihood of there being a contention on any one socket 
was low, but it was there just the same.

That is part of what makes me wonder if there is a race between wakeup 
and release of a lock.


rick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 18:46   ` Rick Jones
@ 2007-02-02 19:06     ` Andi Kleen
  2007-02-02 19:54       ` Rick Jones
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2007-02-02 19:06 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list


> 
> The meta question behind all that would seem to be whether the scheduler 
> should be telling us where to perform the network processing, or should 
> the network processing be telling the scheduler what to do? (eg all my 
> old blathering about IPS vs TOPS in HP-UX...)

That's an unsolved problem.  But past experiments suggest that giving
the scheduler more imperatives than just "use CPUs well" are often net-losses.

I suspect it cannot be completely solved in the general case. 

> Well, yes and no.  If I drop the "burst" and instead have N times more 
> netperf's going, I see the same lock contention situation.  I wasn't 
> expecting to - thinking that if there were then N different processes on 
> each CPU the likelihood of there being a contention on any one socket 
> was low, but it was there just the same.
> 
> That is part of what makes me wonder if there is a race between wakeup 

A race?

> and release of a lock.

You could try with echo 1 > /proc/sys/net/ipv4/tcp_low_latency.
That should change RX locking behaviour significantly.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-01 19:43 "meaningful" spinlock contention when bound to non-intr CPU? Rick Jones
  2007-02-01 19:46 ` Rick Jones
  2007-02-02 16:47 ` Jesse Brandeburg
@ 2007-02-02 19:21 ` Andi Kleen
  2007-02-02 18:46   ` Rick Jones
  2 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2007-02-02 19:21 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list

Rick Jones <rick.jones2@hp.com> writes:
> 
> Still, does this look like something worth persuing?  In a past
> life/OS when one was able to eliminate one percentage point of
> spinlock contention, two percentage points of improvement ensued.

The stack is really designed to go fast with per CPU local RX processing 
of packets. This normally works because waking on up a task 
the scheduler tries to move it to that CPU. Since the wakeups are
on the CPU that process the incoming packets it should usually
end up correctly.

The trouble is when your NICs are so fast that a single
CPU can't keep up, or when you have programs that process many
different sockets from a single thread.

The fast NIC case will be eventually fixed by adding proper
support for MSI-X and connection hashing. Then the NIC can fan 
out to multiple interrupts and use multiple CPUs to process
the incoming packets. 

Then there is the case of a single process having many 
sockets from different NICs This will be of course somewhat slower
because there will be cross CPU traffic. However there should
be not much socket lock contention because a process handling
many sockets will be hopefully unlikely to bang on each of
its many sockets at the exactly same time as the stack
receives RX packets. This should also eliminate the spinlock
contenion.

>From that theory your test sounds somewhat unrealistic to me. 

Do you have any evidence you're modelling a real world scenario
here? I somehow doubt it.

-Andi 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 19:06     ` Andi Kleen
@ 2007-02-02 19:54       ` Rick Jones
  2007-02-02 20:20         ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Rick Jones @ 2007-02-02 19:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Network Development list

Andi Kleen wrote:
>>The meta question behind all that would seem to be whether the scheduler 
>>should be telling us where to perform the network processing, or should 
>>the network processing be telling the scheduler what to do? (eg all my 
>>old blathering about IPS vs TOPS in HP-UX...)
> 
> 
> That's an unsolved problem.  But past experiments suggest that giving
> the scheduler more imperatives than just "use CPUs well" are often net-losses.

I wasn't thinking about giving the scheduler more imperitives really 
(?), just letting "networking" know more about where threads executed 
accessing given connections. (eg TOPS)

> I suspect it cannot be completely solved in the general case. 

Not unless the NIC can peer into the connection table and see where each 
connection was last accessed by user-space.

>>Well, yes and no.  If I drop the "burst" and instead have N times more 
>>netperf's going, I see the same lock contention situation.  I wasn't 
>>expecting to - thinking that if there were then N different processes on 
>>each CPU the likelihood of there being a contention on any one socket 
>>was low, but it was there just the same.
>>
>>That is part of what makes me wonder if there is a race between wakeup 
> 
> 
> A race?

Perhaps a poor choice of words on my part - something along the lines of:

hold_lock();
wake_up_someone();
release_lock();

where the someone being awoken can try to grab the lock before the path 
doing the waking manages to release it.

> 
> 
>>and release of a lock.
> 
> 
> You could try with echo 1 > /proc/sys/net/ipv4/tcp_low_latency.
> That should change RX locking behaviour significantly.

Running the same 8 netperf's with TCP_RR and burst bound to different 
CPU than the NIC interrupt, the lockmeter output looks virtually 
unchanged.  Still release_sock, tcp_v4_rcv, lock_sock_nested at their 
same offsets.

However, if I run the multiple-connection-per-thread code, and have each 
service 32 concurrent connections, and bind to a CPU other than the 
interrupt CPU, the lock contention in this case does appear to go away.

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 19:54       ` Rick Jones
@ 2007-02-02 20:20         ` Andi Kleen
  2007-02-02 20:41           ` Rick Jones
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2007-02-02 20:20 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list


> Perhaps a poor choice of words on my part - something along the lines of:
> 
> hold_lock();
> wake_up_someone();
> release_lock();
> 
> where the someone being awoken can try to grab the lock before the path 
> doing the waking manages to release it.

Yes the wakeup happens deep inside the critical section and if the process
is running on another CPU it could race to the lock.

Hmm, i suppose the wakeup could be moved out, but it would need some restructuring
of the code. Also to be safe the code would still need to at least hold a 
reference count of the sock during the wakeup, and when that is released
then you have another cache line to bounce, which might not be any better
than the lock. So it might not be actually worth it.

I suppose the socket release could be at least partially protected with
RCU against this case so that could be done without a reference count, but 
it might be tricky to get this right.

Again still not sure it's worth handling this.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "meaningful" spinlock contention when bound to non-intr CPU?
  2007-02-02 20:20         ` Andi Kleen
@ 2007-02-02 20:41           ` Rick Jones
  0 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2007-02-02 20:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Network Development list

> Yes the wakeup happens deep inside the critical section and if the process
> is running on another CPU it could race to the lock.
> 
> Hmm, i suppose the wakeup could be moved out, but it would need some 
> restructuring of the code. Also to be safe the code would still need
> to at least hold a reference count of the sock during the wakeup, and
> when that is released then you have another cache line to bounce,
> which might not be any better than the lock. So it might not be
> actually worth it.
> 
> I suppose the socket release could be at least partially protected with
> RCU against this case so that could be done without a reference count, but 
> it might be tricky to get this right.
> 
> Again still not sure it's worth handling this.

Based on my experiments thusfar I'd have to agree/accept (I wasn't 
certain to begin with - hence the post in the first place :)  but I do 
need/want to see what happens with a single-stream through a 10G NIC - 
on the receive side at least with a 1500 byte MTU.

I was using the burst-mode aggregate RR over the 1G NICs to get the CPU 
util up without need for considerable bandwidth, since the system 
handled 8 TCP_STREAM tests across the 8 NICs without working-up a sweat. 
  I suppose I could instead chop the MTU on the 1G NICs and use that to 
increase the CPU util on the receive side.

rick

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-02-02 20:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-01 19:43 "meaningful" spinlock contention when bound to non-intr CPU? Rick Jones
2007-02-01 19:46 ` Rick Jones
2007-02-02 16:47 ` Jesse Brandeburg
2007-02-02 18:17   ` Rick Jones
2007-02-02 19:21 ` Andi Kleen
2007-02-02 18:46   ` Rick Jones
2007-02-02 19:06     ` Andi Kleen
2007-02-02 19:54       ` Rick Jones
2007-02-02 20:20         ` Andi Kleen
2007-02-02 20:41           ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).