parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
       [not found]       ` <1191850490.4352.41.camel@localhost>
@ 2007-10-08 14:22         ` Jeff Garzik
  2007-10-08 15:18           ` jamal
  2007-10-08 21:11           ` parallel networking David Miller
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff Garzik @ 2007-10-08 14:22 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert,
	kaber, shemminger, jagana, Robert.Olsson, rick.jones2, xma,
	gaagaan, netdev, rdreier, Ingo Molnar, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri, Linux Kernel Mailing List

jamal wrote:
> On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote:
> 
>> For these high performance 10Gbit cards it's a load balancing
>> function, really, as all of the transmit queues go out to the same
>> physical port so you could:
>>
>> 1) Load balance on CPU number.
>> 2) Load balance on "flow"
>> 3) Load balance on destination MAC
>>
>> etc. etc. etc.
> 
> The brain-block i am having is the parallelization aspect of it.
> Whatever scheme it is - it needs to ensure the scheduler works as
> expected. For example, if it was a strict prio scheduler i would expect
> that whatever goes out is always high priority first and never ever
> allow a low prio packet out at any time theres something high prio
> needing to go out. If i have the two priorities running on two cpus,
> then i cant guarantee that effect.

Any chance the NIC hardware could provide that guarantee?

8139cp, for example, has two TX DMA rings, with hardcoded 
characteristics:  one is a high prio q, and one a low prio q.  The logic 
is pretty simple:   empty the high prio q first (potentially starving 
low prio q, in worst case).

In terms of overall parallelization, both for TX as well as RX, my gut 
feeling is that we want to move towards an MSI-X, multi-core friendly 
model where packets are LIKELY to be sent and received by the same set 
of [cpus | cores | packages | nodes] that the [userland] processes 
dealing with the data.

There are already some primitive NUMA bits in skbuff allocation, but 
with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
along the lines of better CPU scheduling decisions, directing flows to 
clusters of cpus, and generally doing a better job of maximizing cache 
efficiency in a modern multi-thread environment.

IMO the current model where each NIC's TX completion and RX processes 
are both locked to the same CPU is outmoded in a multi-core world with 
modern NICs.  :)

But I readily admit general ignorance about the kernel process 
scheduling stuff, so my only idea about a starting point was to see how 
far to go with the concept of "skb affinity" -- a mask in sk_buff that 
is a hint about which cpu(s) on which the NIC should attempt to send and 
receive packets.  When going through bonding or netfilter, it is trivial 
to 'or' together affinity masks.  All the various layers of net stack 
should attempt to honor the skb affinity, where feasible (requires 
interaction with CFS scheduler?).

Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
on the bigger picture.  Parallelization starts at the user process.

	Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
  2007-10-08 14:22         ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
@ 2007-10-08 15:18           ` jamal
  2007-10-08 21:11           ` parallel networking David Miller
  1 sibling, 0 replies; 8+ messages in thread
From: jamal @ 2007-10-08 15:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert,
	kaber, shemminger, jagana, Robert.Olsson, rick.jones2, xma,
	gaagaan, netdev, rdreier, Ingo Molnar, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri, Linux Kernel Mailing List

On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote:

> Any chance the NIC hardware could provide that guarantee?

If you can get the scheduling/dequeuing to run on one CPU (as we do
today) it should work; alternatively you can totaly bypass the qdisc
subystem and go direct to the hardware for devices that are capable
and that would work but would require huge changes. 
My fear is there's a mini-scheduler pieces running on multi cpus which
is what i understood as being described.

> 8139cp, for example, has two TX DMA rings, with hardcoded 
> characteristics:  one is a high prio q, and one a low prio q.  The logic 
> is pretty simple:   empty the high prio q first (potentially starving 
> low prio q, in worst case).

sounds like strict prio scheduling to me which says "if low prio starves
so be it"

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

Does putting things in the same core help? But overall i agree with your
views. 

> There are already some primitive NUMA bits in skbuff allocation, but 
> with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
> along the lines of better CPU scheduling decisions, directing flows to 
> clusters of cpus, and generally doing a better job of maximizing cache 
> efficiency in a modern multi-thread environment.

I think i see the receive with a lot of clarity, i am still foggy on the
txmit path mostly because of the qos/scheduling issues. 

> IMO the current model where each NIC's TX completion and RX processes 
> are both locked to the same CPU is outmoded in a multi-core world with 
> modern NICs.  :)

Infact even with status quo theres a case that can be made to not bind
to interupts.
In my recent experience with batching, due to the nature of my test app,
if i let the interupts float across multiple cpus i benefit.
My app runs/binds a thread per CPU and so benefits from having more
juice to send more packets per unit of time - something i wouldnt get if
i was always running on one cpu.
But when i do this i found that just because i have bound a thread to
cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup
happens on cpu1, scheduler will put the thread on cpu1 if it is to be
run. It made sense to do that, it just took me a while to digest.

> But I readily admit general ignorance about the kernel process 
> scheduling stuff, so my only idea about a starting point was to see how 
> far to go with the concept of "skb affinity" -- a mask in sk_buff that 
> is a hint about which cpu(s) on which the NIC should attempt to send and 
> receive packets.  When going through bonding or netfilter, it is trivial 
> to 'or' together affinity masks.  All the various layers of net stack 
> should attempt to honor the skb affinity, where feasible (requires 
> interaction with CFS scheduler?).

There would be cache benefits if you can free the packet on the same cpu
it was allocated; so the idea of skb affinity is useful in the minimal
in that sense if you can pull it. Assuming hardware is capable, even if
you just tagged it on xmit to say which cpu it was sent out on, and made
sure thats where it is freed, that would be a good start. 

Note: The majority of the packet processing overhead is _still_ the
memory subsystem latency; in my tests with batched pktgen improving the 
xmit subsystem meant the overhead on allocing and freeing the packets
went to something > 80%.
So something along the lines of parallelizing based on a split of alloc
free of sksb IMO on more cpus than where xmit/receive run would see
more performance improvements.

> Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
> on the bigger picture.  Parallelization starts at the user process.

cheers,
jamal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking
  2007-10-08 14:22         ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
  2007-10-08 15:18           ` jamal
@ 2007-10-08 21:11           ` David Miller
  2007-10-08 22:30             ` jamal
  2007-10-09  1:53             ` Jeff Garzik
  1 sibling, 2 replies; 8+ messages in thread
From: David Miller @ 2007-10-08 21:11 UTC (permalink / raw)
  To: jeff
  Cc: hadi, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

From: Jeff Garzik <jeff@garzik.org>
Date: Mon, 08 Oct 2007 10:22:28 -0400

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

The problem is that the packet schedulers want global guarantees
on packet ordering, not flow centric ones.

That is the issue Jamal is concerned about.

The more I think about it, the more inevitable it seems that we really
might need multiple qdiscs, one for each TX queue, to pull this full
parallelization off.

But the semantics of that don't smell so nice either.  If the user
attaches a new qdisc to "ethN", does it go to all the TX queues, or
what?

All of the traffic shaping technology deals with the device as a unary
object.  It doesn't fit to multi-queue at all.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking
  2007-10-08 21:11           ` parallel networking David Miller
@ 2007-10-08 22:30             ` jamal
  2007-10-08 22:33               ` David Miller
  2007-10-09  1:53             ` Jeff Garzik
  1 sibling, 1 reply; 8+ messages in thread
From: jamal @ 2007-10-08 22:30 UTC (permalink / raw)
  To: David Miller
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

On Mon, 2007-08-10 at 14:11 -0700, David Miller wrote:

> The problem is that the packet schedulers want global guarantees
> on packet ordering, not flow centric ones.
> 
> That is the issue Jamal is concerned about.

indeed, thank you for giving it better wording. 

> The more I think about it, the more inevitable it seems that we really
> might need multiple qdiscs, one for each TX queue, to pull this full
> parallelization off.
> 
> But the semantics of that don't smell so nice either.  If the user
> attaches a new qdisc to "ethN", does it go to all the TX queues, or
> what?
> 
> All of the traffic shaping technology deals with the device as a unary
> object.  It doesn't fit to multi-queue at all.

If you let only one CPU at a time access the "xmit path" you solve all
the reordering. If you want to be more fine grained you make the
serialization point as low as possible in the stack - perhaps in the
driver.
But I think even what we have today with only one cpu entering the
dequeue/scheduler region, _for starters_, is not bad actually ;->  What
i am finding (and i can tell you i have been trying hard;->) is that a
sufficiently fast cpu doesnt sit in the dequeue area for "too long" (and
batching reduces the time spent further). Very quickly there are no more
packets for it to dequeue from the qdisc or the driver is stoped and it
has to get out of there. If you dont have any interupt tied to a
specific cpu then you can have many cpus enter and leave that region all
the time. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking
  2007-10-08 22:30             ` jamal
@ 2007-10-08 22:33               ` David Miller
  2007-10-08 22:35                 ` Waskiewicz Jr, Peter P
  2007-10-08 23:42                 ` jamal
  0 siblings, 2 replies; 8+ messages in thread
From: David Miller @ 2007-10-08 22:33 UTC (permalink / raw)
  To: hadi
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

From: jamal <hadi@cyberus.ca>
Date: Mon, 08 Oct 2007 18:30:18 -0400

> Very quickly there are no more packets for it to dequeue from the
> qdisc or the driver is stoped and it has to get out of there. If you
> dont have any interupt tied to a specific cpu then you can have many
> cpus enter and leave that region all the time.

With the lock shuttling back and forth between those cpus, which is
what we're trying to avoid.

Multiply whatever effect you think you might be able to measure due to
that on your 2 or 4 way system, and multiple it up to 64 cpus or so
for machines I am using.  This is where machines are going, and is
going to become the norm.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: parallel networking
  2007-10-08 22:33               ` David Miller
@ 2007-10-08 22:35                 ` Waskiewicz Jr, Peter P
  2007-10-08 23:42                 ` jamal
  1 sibling, 0 replies; 8+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-10-08 22:35 UTC (permalink / raw)
  To: David Miller, hadi
  Cc: jeff, krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier, mingo,
	mchan, general, kumarkr, tgraf, randy.dunlap, sri, linux-kernel

> Multiply whatever effect you think you might be able to 
> measure due to that on your 2 or 4 way system, and multiple 
> it up to 64 cpus or so for machines I am using.  This is 
> where machines are going, and is going to become the norm.

That along with speeds going to 10 GbE with multiple Tx/Rx queues (with
40 and 100 GbE under discussion now), where multiple CPU's hitting the
driver are needed to push line rate without cratering the entire
machine.

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking
  2007-10-08 22:33               ` David Miller
  2007-10-08 22:35                 ` Waskiewicz Jr, Peter P
@ 2007-10-08 23:42                 ` jamal
  1 sibling, 0 replies; 8+ messages in thread
From: jamal @ 2007-10-08 23:42 UTC (permalink / raw)
  To: David Miller
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote:

> Multiply whatever effect you think you might be able to measure due to
> that on your 2 or 4 way system, and multiple it up to 64 cpus or so
> for machines I am using.  This is where machines are going, and is
> going to become the norm.

Yes, i keep forgetting that ;-> I need to train my brain to remember
that.

cheers,
jamal




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: parallel networking
  2007-10-08 21:11           ` parallel networking David Miller
  2007-10-08 22:30             ` jamal
@ 2007-10-09  1:53             ` Jeff Garzik
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2007-10-09  1:53 UTC (permalink / raw)
  To: David Miller
  Cc: hadi, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

David Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Mon, 08 Oct 2007 10:22:28 -0400
> 
>> In terms of overall parallelization, both for TX as well as RX, my gut 
>> feeling is that we want to move towards an MSI-X, multi-core friendly 
>> model where packets are LIKELY to be sent and received by the same set 
>> of [cpus | cores | packages | nodes] that the [userland] processes 
>> dealing with the data.
> 
> The problem is that the packet schedulers want global guarantees
> on packet ordering, not flow centric ones.
> 
> That is the issue Jamal is concerned about.

Oh, absolutely.

I think, fundamentally, any amount of cross-flow resource management 
done in software is an obstacle to concurrency.

That's not a value judgement, just a statement of fact.

"traffic cops" are intentional bottlenecks we add to the process, to 
enable features like priority flows, filtering, or even simple socket 
fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
but at the end of the day, it's still a bottleneck.

So, improving concurrency may require turning off useful features that 
nonetheless hurt concurrency.

> The more I think about it, the more inevitable it seems that we really
> might need multiple qdiscs, one for each TX queue, to pull this full
> parallelization off.
> 
> But the semantics of that don't smell so nice either.  If the user
> attaches a new qdisc to "ethN", does it go to all the TX queues, or
> what?
> 
> All of the traffic shaping technology deals with the device as a unary
> object.  It doesn't fit to multi-queue at all.

Well the easy solutions to networking concurrency are

* use virtualization to carve up the machine into chunks

* use multiple net devices

Since new NIC hardware is actively trying to be friendly to 
multi-channel/virt scenarios, either of these is reasonably 
straightforward given the current state of the Linux net stack.  Using 
multiple net devices is especially attractive because it works very well 
with the existing packet scheduling.

Both unfortunately impose a burden on the developer and admin, to force 
their apps to distribute flows across multiple [VMs | net devs].

The third alternative is to use a single net device, with SMP-friendly 
packet scheduling.  Here you run into the problems you described "device 
as a unary object" etc. with the current infrastructure.

With multiple TX rings, consider that we are pushing the packet 
scheduling from software to hardware...  which implies
* hardware-specific packet scheduling
* some TC/shaping features not available, because hardware doesn't 
support it

	Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-10-09  1:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1190674298.4264.24.camel@localhost>
     [not found] ` <D5C1322C3E673F459512FB59E0DDC32903A51462@orsmsx414.amr.corp.intel.com>
     [not found]   ` <1190677099.4264.37.camel@localhost>
     [not found]     ` <20071007.215124.85709188.davem@davemloft.net>
     [not found]       ` <1191850490.4352.41.camel@localhost>
2007-10-08 14:22         ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
2007-10-08 15:18           ` jamal
2007-10-08 21:11           ` parallel networking David Miller
2007-10-08 22:30             ` jamal
2007-10-08 22:33               ` David Miller
2007-10-08 22:35                 ` Waskiewicz Jr, Peter P
2007-10-08 23:42                 ` jamal
2007-10-09  1:53             ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox