[PATCH 0/2]: Remote softirq invocation infrastructure.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2]: Remote softirq invocation infrastructure.
@ 2008-09-20  6:48 David Miller
  2008-09-20 15:29 ` Daniel Walker
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: David Miller @ 2008-09-20  6:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, jens.axboe, steffen.klassert

Jens Axboe has written some hacks for the block layer that allow
queueing softirq work to remote cpus.  In the context of the block
layer he used this facility to trigger the softirq block I/O
completion on the same cpu where the I/O was submitted.

I want to make use of a similar facility for the networking, so we
should make this thing generic.

It depends upon the generic SMP call function infrastructure, which
Jens wrote specifically to do these remote softirq hacks.

For each softirq there is a per-cpu list head which is where the work
is queued up.

If the platform doesn't support the generic SMP call function bits,
the work is queued onto the local cpu.

The first patch adds a NR_SOFTIRQS so that we can size these arrays
by the actual number of softirqs instead of the magic number "32"
which is what is used now.

The second patch adds the infrastructure and provides intefaces to
invoke softirqs on remove cpus.

Jen's, as stated, has block layer uses for this.  I intend to use this
for receive side flow seperation on non-multiqueue network cards.  And
Steffen Klassert has a set of IPSEC parallelization changes that can
very likely make use of this.

These patches are against current 2.6.27-rcX

I would suggest that if nobody has any problems with this, we put it
into a GIT tree on kernel.org and any subsystem that wants to use it
can just pull that tree into their GIT tree.  This way, it doesn't
matter which tree Linus pulls in first, he'll get this stuff properly
regardless of ordering.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20  6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller
@ 2008-09-20 15:29 ` Daniel Walker
  2008-09-20 15:45   ` Arjan van de Ven
  2008-09-22 21:22 ` Chris Friesen
  2008-09-24  7:42 ` David Miller
  2 siblings, 1 reply; 26+ messages in thread
From: Daniel Walker @ 2008-09-20 15:29 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert

On Fri, 2008-09-19 at 23:48 -0700, David Miller wrote:
> Jens Axboe has written some hacks for the block layer that allow
> queueing softirq work to remote cpus.  In the context of the block
> layer he used this facility to trigger the softirq block I/O
> completion on the same cpu where the I/O was submitted.
> 

> Jen's, as stated, has block layer uses for this.  I intend to use this
> for receive side flow seperation on non-multiqueue network cards.  And
> Steffen Klassert has a set of IPSEC parallelization changes that can
> very likely make use of this.

What's the benefit that you (or Jens) sees from migrating softirqs from
specific cpu's to others?

Daniel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 15:29 ` Daniel Walker
@ 2008-09-20 15:45   ` Arjan van de Ven
  2008-09-20 16:02     ` Daniel Walker
  0 siblings, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-09-20 15:45 UTC (permalink / raw)
  To: Daniel Walker
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 20 Sep 2008 08:29:21 -0700
> 
> > Jen's, as stated, has block layer uses for this.  I intend to use
> > this for receive side flow seperation on non-multiqueue network
> > cards.  And Steffen Klassert has a set of IPSEC parallelization
> > changes that can very likely make use of this.
> 
> What's the benefit that you (or Jens) sees from migrating softirqs
> from specific cpu's to others?

it means you do all the processing on the CPU that submitted the IO in
the first place, and likely still has the various metadata pieces in
its CPU cache (or at least you know you won't need to bounce them over)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 15:45   ` Arjan van de Ven
@ 2008-09-20 16:02     ` Daniel Walker
  2008-09-20 16:19       ` Arjan van de Ven
  2008-09-20 20:00       ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Daniel Walker @ 2008-09-20 16:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote:
> On Sat, 20 Sep 2008 08:29:21 -0700
> > 
> > > Jen's, as stated, has block layer uses for this.  I intend to use
> > > this for receive side flow seperation on non-multiqueue network
> > > cards.  And Steffen Klassert has a set of IPSEC parallelization
> > > changes that can very likely make use of this.
> > 
> > What's the benefit that you (or Jens) sees from migrating softirqs
> > from specific cpu's to others?
> 
> it means you do all the processing on the CPU that submitted the IO in
> the first place, and likely still has the various metadata pieces in
> its CPU cache (or at least you know you won't need to bounce them over)


In the case of networking and block I would think a lot of the softirq
activity is asserted from userspace.. Maybe the scheduler shouldn't be
migrating these tasks, or could take this softirq activity into
account ..

Daniel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 16:02     ` Daniel Walker
@ 2008-09-20 16:19       ` Arjan van de Ven
  2008-09-20 17:40         ` Daniel Walker
  2008-09-20 20:00       ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-09-20 16:19 UTC (permalink / raw)
  To: Daniel Walker
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 20 Sep 2008 09:02:09 -0700
Daniel Walker <dwalker@mvista.com> wrote:

> On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote:
> > On Sat, 20 Sep 2008 08:29:21 -0700
> > > 
> > > > Jen's, as stated, has block layer uses for this.  I intend to
> > > > use this for receive side flow seperation on non-multiqueue
> > > > network cards.  And Steffen Klassert has a set of IPSEC
> > > > parallelization changes that can very likely make use of this.
> > > 
> > > What's the benefit that you (or Jens) sees from migrating softirqs
> > > from specific cpu's to others?
> > 
> > it means you do all the processing on the CPU that submitted the IO
> > in the first place, and likely still has the various metadata
> > pieces in its CPU cache (or at least you know you won't need to
> > bounce them over)
> 
> 
> In the case of networking and block I would think a lot of the softirq
> activity is asserted from userspace.. Maybe the scheduler shouldn't be
> migrating these tasks, or could take this softirq activity into
> account ..

well a lot of it comes from completion interrupts.

and moving userspace isn't a good option; think of the case of 1 nic
but 4 apache processes doing the work...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 16:19       ` Arjan van de Ven
@ 2008-09-20 17:40         ` Daniel Walker
  2008-09-20 18:09           ` Arjan van de Ven
  2008-09-20 19:59           ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Daniel Walker @ 2008-09-20 17:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 2008-09-20 at 09:19 -0700, Arjan van de Ven wrote:
> On Sat, 20 Sep 2008 09:02:09 -0700
> Daniel Walker <dwalker@mvista.com> wrote:
> 
> > On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote:
> > > On Sat, 20 Sep 2008 08:29:21 -0700
> > > > 
> > > > > Jen's, as stated, has block layer uses for this.  I intend to
> > > > > use this for receive side flow seperation on non-multiqueue
> > > > > network cards.  And Steffen Klassert has a set of IPSEC
> > > > > parallelization changes that can very likely make use of this.
> > > > 
> > > > What's the benefit that you (or Jens) sees from migrating softirqs
> > > > from specific cpu's to others?
> > > 
> > > it means you do all the processing on the CPU that submitted the IO
> > > in the first place, and likely still has the various metadata
> > > pieces in its CPU cache (or at least you know you won't need to
> > > bounce them over)
> > 
> > 
> > In the case of networking and block I would think a lot of the softirq
> > activity is asserted from userspace.. Maybe the scheduler shouldn't be
> > migrating these tasks, or could take this softirq activity into
> > account ..
> 
> well a lot of it comes from completion interrupts.

Yeah, partly I would think.

> and moving userspace isn't a good option; think of the case of 1 nic
> but 4 apache processes doing the work...
> 

One nic, so one interrupt ? I guess we're talking about an SMP machine? 
It seems case dependent .. If you send a lot, or receive a lot.. BUT
it's all speculation on my part..

Dave didn't supply the users of his code, or what kind of improvement
was seen, or the case in which it would be needed. I think Dave knowns
his subsystem, but the code on the surface looks like an end run around
some other problem area..

Daniel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 17:40         ` Daniel Walker
@ 2008-09-20 18:09           ` Arjan van de Ven
  2008-09-20 18:52             ` Daniel Walker
  2008-09-20 19:59           ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-09-20 18:09 UTC (permalink / raw)
  To: Daniel Walker
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 20 Sep 2008 10:40:04 -0700
Daniel Walker <dwalker@mvista.com> wrote:

> On Sat, 2008-09-20 at 09:19 -0700, Arjan van de Ven wrote:
> > On Sat, 20 Sep 2008 09:02:09 -0700
> > Daniel Walker <dwalker@mvista.com> wrote:
> > 
> > > On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote:
> > > > On Sat, 20 Sep 2008 08:29:21 -0700
> > > > > 
> > > > > > Jen's, as stated, has block layer uses for this.  I intend
> > > > > > to use this for receive side flow seperation on
> > > > > > non-multiqueue network cards.  And Steffen Klassert has a
> > > > > > set of IPSEC parallelization changes that can very likely
> > > > > > make use of this.
> > > > > 
> > > > > What's the benefit that you (or Jens) sees from migrating
> > > > > softirqs from specific cpu's to others?
> > > > 
> > > > it means you do all the processing on the CPU that submitted
> > > > the IO in the first place, and likely still has the various
> > > > metadata pieces in its CPU cache (or at least you know you
> > > > won't need to bounce them over)
> > > 
> > > 
> > > In the case of networking and block I would think a lot of the
> > > softirq activity is asserted from userspace.. Maybe the scheduler
> > > shouldn't be migrating these tasks, or could take this softirq
> > > activity into account ..
> > 
> > well a lot of it comes from completion interrupts.
> 
> Yeah, partly I would think.

completions trigger the next send as well (for both block and net) so
it's quite common
> 
> > and moving userspace isn't a good option; think of the case of 1 nic
> > but 4 apache processes doing the work...
> > 
> 
> One nic, so one interrupt ? I guess we're talking about an SMP
> machine?

or multicore

doing IPI's for this on an UP machine is a rather boring exercise

> 
> Dave didn't supply the users of his code, or what kind of improvement
> was seen, or the case in which it would be needed. I think Dave knowns
> his subsystem, but the code on the surface looks like an end run
> around some other problem area..

it's very fundamental, and has been talked about at various conferences
as well.

the basic problem is that the submitter of the IO (be it block or net)
creates a ton of metadata state on submit, and ideally the completion
processing happens on the same CPU, for two reasons
1) to use the state in the cache
2) for the case where you touch userland data/structures, we assume the
scheduler kept affinity

it's a Moses-to-the-Mountain problem, except we have four Moses' but
only one Mountain. 

Or in CS terms: we move the work to the CPU where the userland is
rather than moving the userland to the IRQ CPU, since there is usually
only one IRQ but many userlands and many cpu cores.

(for the UP case this is all very irrelevant obviously)

I assume Dave will pipe in if he disagrees with me ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 18:09           ` Arjan van de Ven
@ 2008-09-20 18:52             ` Daniel Walker
  2008-09-20 20:04               ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Walker @ 2008-09-20 18:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

On Sat, 2008-09-20 at 11:09 -0700, Arjan van de Ven wrote:


> > 
> > Dave didn't supply the users of his code, or what kind of improvement
> > was seen, or the case in which it would be needed. I think Dave knowns
> > his subsystem, but the code on the surface looks like an end run
> > around some other problem area..
> 
> it's very fundamental, and has been talked about at various conferences
> as well.

At least you understand that not everyone goes to conferences..

> the basic problem is that the submitter of the IO (be it block or net)
> creates a ton of metadata state on submit, and ideally the completion
> processing happens on the same CPU, for two reasons
> 1) to use the state in the cache
> 2) for the case where you touch userland data/structures, we assume the
> scheduler kept affinity
> 
> it's a Moses-to-the-Mountain problem, except we have four Moses' but
> only one Mountain. 
> 
> Or in CS terms: we move the work to the CPU where the userland is
> rather than moving the userland to the IRQ CPU, since there is usually
> only one IRQ but many userlands and many cpu cores.

There must be some kind of trade off here .. There's a fairly good
performance gain from have the softirq asserted and run on the same cpu
since it runs in interrupt context right after the interrupt.

If you move the softirq to another cpu then you have to re-assert and
either wait for ksoftirqd to handle it or wait for an interrupt on the
new cpu .. Neither is very predictable..

All that vs. bouncing data around the caches.. To what degree has all
that been handled or thought about?

Daniel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 18:52             ` Daniel Walker
@ 2008-09-20 20:04               ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2008-09-20 20:04 UTC (permalink / raw)
  To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert

From: Daniel Walker <dwalker@mvista.com>
Date: Sat, 20 Sep 2008 11:52:31 -0700

> All that vs. bouncing data around the caches.. To what degree has all
> that been handled or thought about?

Give concrete things to discuss or just be quiet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 17:40         ` Daniel Walker
  2008-09-20 18:09           ` Arjan van de Ven
@ 2008-09-20 19:59           ` David Miller
  2008-09-21  6:05             ` Herbert Xu
  1 sibling, 1 reply; 26+ messages in thread
From: David Miller @ 2008-09-20 19:59 UTC (permalink / raw)
  To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert

From: Daniel Walker <dwalker@mvista.com>
Date: Sat, 20 Sep 2008 10:40:04 -0700

> Dave didn't supply the users of his code, or what kind of improvement
> was seen, or the case in which it would be needed. I think Dave knowns
> his subsystem, but the code on the surface looks like an end run around
> some other problem area..

I posted an example use case on netdev a few days ago, and
the block layer example is in Jen's block layer tree.

It's for networking cards that don't do flow seperation on
receive using multiple RX queues and MSI-X interrupts.  It's
also for things like IPSEC where the per-packet cpu usage
is so huge (to do the crypto) that it makes sense to even
split up the work to multiple cpus within the same flow.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 19:59           ` David Miller
@ 2008-09-21  6:05             ` Herbert Xu
  2008-09-21  6:57               ` David Miller
                                 ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Herbert Xu @ 2008-09-21  6:05 UTC (permalink / raw)
  To: David Miller
  Cc: dwalker, arjan, linux-kernel, netdev, jens.axboe,
	steffen.klassert

David Miller <davem@davemloft.net> wrote:
>
> receive using multiple RX queues and MSI-X interrupts.  It's
> also for things like IPSEC where the per-packet cpu usage
> is so huge (to do the crypto) that it makes sense to even
> split up the work to multiple cpus within the same flow.

Unfortunately doing this with IPsec is going to be non-trivial
since we still want to maintain packet ordering inside IPsec
and you don't get the inner flow information until you decrypt
the packet.

So if we want to process IPsec packets in parallel it's best to
implement that from within the crypto API where we can queue the
result in order to ensure proper ordering.

Of course, we need to balance any effort spent on this with the
likelihood that hardware improvements will soon make this obsolete
(for IPsec anyway).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  6:05             ` Herbert Xu
@ 2008-09-21  6:57               ` David Miller
  2008-09-22 10:36                 ` Ilpo Järvinen
  2008-09-21  9:13               ` James Courtier-Dutton
  2008-09-21  9:46               ` Steffen Klassert
  2 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-09-21  6:57 UTC (permalink / raw)
  To: herbert; +Cc: dwalker, arjan, linux-kernel, netdev, jens.axboe,
	steffen.klassert

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 21 Sep 2008 15:05:45 +0900

> Unfortunately doing this with IPsec is going to be non-trivial
> since we still want to maintain packet ordering inside IPsec
> and you don't get the inner flow information until you decrypt
> the packet.

Steffen has mechanisms by which to deal with this in his patches.

> So if we want to process IPsec packets in parallel it's best to
> implement that from within the crypto API where we can queue the
> result in order to ensure proper ordering.

That's another option, of course.  And crypto could use remote
softirqs even for that :-)

> Of course, we need to balance any effort spent on this with the
> likelihood that hardware improvements will soon make this obsolete
> (for IPsec anyway).

True, but old hardware will always exist.

A lot of very reasonable machines out there will benefit from software
RX flow seperation.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  6:57               ` David Miller
@ 2008-09-22 10:36                 ` Ilpo Järvinen
  2008-09-24  4:54                   ` Herbert Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-09-22 10:36 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, dwalker, arjan, LKML, Netdev, jens.axboe,
	steffen.klassert

On Sat, 20 Sep 2008, David Miller wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sun, 21 Sep 2008 15:05:45 +0900
> 
> > Of course, we need to balance any effort spent on this with the
> > likelihood that hardware improvements will soon make this obsolete
> > (for IPsec anyway).
> 
> True, but old hardware will always exist.

...Also, producing buggy hardware will not suddently just vanish either 
("can you please turn of ipsec offloading and see if you can still
reproduce" :-))...

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-22 10:36                 ` Ilpo Järvinen
@ 2008-09-24  4:54                   ` Herbert Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Herbert Xu @ 2008-09-24  4:54 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, dwalker, arjan, LKML, Netdev, jens.axboe,
	steffen.klassert

On Mon, Sep 22, 2008 at 01:36:24PM +0300, Ilpo Järvinen wrote:
> On Sat, 20 Sep 2008, David Miller wrote:
> 
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Sun, 21 Sep 2008 15:05:45 +0900
> > 
> > > Of course, we need to balance any effort spent on this with the
> > > likelihood that hardware improvements will soon make this obsolete
> > > (for IPsec anyway).
> > 
> > True, but old hardware will always exist.
> 
> ...Also, producing buggy hardware will not suddently just vanish either 
> ("can you please turn of ipsec offloading and see if you can still
> reproduce" :-))...

That's fine.  If your AES hardware is buggy you just fall back
to using the software version.

In fact if this was done through the crypto API it would even
happen automatically.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  6:05             ` Herbert Xu
  2008-09-21  6:57               ` David Miller
@ 2008-09-21  9:13               ` James Courtier-Dutton
  2008-09-21  9:17                 ` David Miller
  2008-09-21  9:46               ` Steffen Klassert
  2 siblings, 1 reply; 26+ messages in thread
From: James Courtier-Dutton @ 2008-09-21  9:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe,
	steffen.klassert

Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
>> receive using multiple RX queues and MSI-X interrupts.  It's
>> also for things like IPSEC where the per-packet cpu usage
>> is so huge (to do the crypto) that it makes sense to even
>> split up the work to multiple cpus within the same flow.
> 
> Unfortunately doing this with IPsec is going to be non-trivial
> since we still want to maintain packet ordering inside IPsec
> and you don't get the inner flow information until you decrypt
> the packet.
> 

Why do you have to preserve packet ordering?
TCP/IP does not preserve packet ordering across the network.
IPSEC uses a sliding window for anti-relay detection precisely because
it has to be able to handle out-of-order packets.

Sharing the sliding window between CPUs might be interesting!

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  9:13               ` James Courtier-Dutton
@ 2008-09-21  9:17                 ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2008-09-21  9:17 UTC (permalink / raw)
  To: James
  Cc: herbert, dwalker, arjan, linux-kernel, netdev, jens.axboe,
	steffen.klassert

From: James Courtier-Dutton <James@superbug.co.uk>
Date: Sun, 21 Sep 2008 10:13:47 +0100

> Why do you have to preserve packet ordering?
> TCP/IP does not preserve packet ordering across the network.

Yes, but we should preserve per-flow ordering as much as
possible within the local system for optimal performance.

Things fall apart completely, even with TCP, once you reorder more
than 2 or 3 packets deep.

> Sharing the sliding window between CPUs might be interesting!

Again, Steffen's patches take care of this issue.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  6:05             ` Herbert Xu
  2008-09-21  6:57               ` David Miller
  2008-09-21  9:13               ` James Courtier-Dutton
@ 2008-09-21  9:46               ` Steffen Klassert
  2008-09-22  8:23                 ` Herbert Xu
  2 siblings, 1 reply; 26+ messages in thread
From: Steffen Klassert @ 2008-09-21  9:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe

On Sun, Sep 21, 2008 at 03:05:45PM +0900, Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
> >
> > receive using multiple RX queues and MSI-X interrupts.  It's
> > also for things like IPSEC where the per-packet cpu usage
> > is so huge (to do the crypto) that it makes sense to even
> > split up the work to multiple cpus within the same flow.
> 
> Unfortunately doing this with IPsec is going to be non-trivial
> since we still want to maintain packet ordering inside IPsec
> and you don't get the inner flow information until you decrypt
> the packet.

It's non-trivial but possible. I have a test implementation that
runs the whole IP layer in parallel. The basic idea to keep track
of the packet ordering is to give the packets sequence numbers
befor we run in parallel. Befor we push the packets to the upper
layers or to the neighboring subsystem I have a mechanism that
brings them back to the right order. 

With my test environment (two quad core boxes) I get with IPSEC
aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s
compared to about 200 Mbit/s without the parallel processing. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-21  9:46               ` Steffen Klassert
@ 2008-09-22  8:23                 ` Herbert Xu
  2008-09-22 13:54                   ` Steffen Klassert
  0 siblings, 1 reply; 26+ messages in thread
From: Herbert Xu @ 2008-09-22  8:23 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe

On Sun, Sep 21, 2008 at 11:46:28AM +0200, Steffen Klassert wrote:
>
> It's non-trivial but possible. I have a test implementation that
> runs the whole IP layer in parallel. The basic idea to keep track
> of the packet ordering is to give the packets sequence numbers
> befor we run in parallel. Befor we push the packets to the upper
> layers or to the neighboring subsystem I have a mechanism that
> brings them back to the right order. 

Yes that should do the trick.

> With my test environment (two quad core boxes) I get with IPSEC
> aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s
> compared to about 200 Mbit/s without the parallel processing. 

Yes this would definitely help IPsec.  However, I'm not so sure
of its benefit to routing and other parts of networking.  That's
why I'd rather have this sort of hack stay in the crypto system
where it's isolated rather than having it proliferate throughout
the network stack.

When the time comes to weed out this because all CPUs that matter
have encryption in hardware then it'll be much easier to delete a
crypto algorithm as opposed to removing parts of the network
infrastructure :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-22  8:23                 ` Herbert Xu
@ 2008-09-22 13:54                   ` Steffen Klassert
  0 siblings, 0 replies; 26+ messages in thread
From: Steffen Klassert @ 2008-09-22 13:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe

On Mon, Sep 22, 2008 at 04:23:09PM +0800, Herbert Xu wrote:
> 
> > With my test environment (two quad core boxes) I get with IPSEC
> > aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s
> > compared to about 200 Mbit/s without the parallel processing. 
> 
> Yes this would definitely help IPsec.  However, I'm not so sure
> of its benefit to routing and other parts of networking.  That's
> why I'd rather have this sort of hack stay in the crypto system
> where it's isolated rather than having it proliferate throughout
> the network stack.

The crypto benefits the most of course, but routing and xfrm lookups
could benefit on bigger networks too. However, the method to bring
the packets back to order is quite generic and could be used
even in the crypto system. The important thing for me is that we
can run in parallel even if we have just one flow.

> 
> When the time comes to weed out this because all CPUs that matter
> have encryption in hardware then it'll be much easier to delete a
> crypto algorithm as opposed to removing parts of the network
> infrastructure :)
> 

Yes, if you think about how to remove it I agree here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20 16:02     ` Daniel Walker
  2008-09-20 16:19       ` Arjan van de Ven
@ 2008-09-20 20:00       ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2008-09-20 20:00 UTC (permalink / raw)
  To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert

From: Daniel Walker <dwalker@mvista.com>
Date: Sat, 20 Sep 2008 09:02:09 -0700

> In the case of networking and block I would think a lot of the softirq
> activity is asserted from userspace.. Maybe the scheduler shouldn't be
> migrating these tasks, or could take this softirq activity into
> account ..

Absolutely wrong.

On a per-flow basis you want to push the work down as far
as possible down to individual cpus.  Why do you think the
hardware folks are devoting silicon to RX multiqueue facilities
that spread the RX work amongst available cpus using MSI-X?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20  6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller
  2008-09-20 15:29 ` Daniel Walker
@ 2008-09-22 21:22 ` Chris Friesen
  2008-09-22 22:12   ` David Miller
  2008-09-24  7:42 ` David Miller
  2 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-09-22 21:22 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert

David Miller wrote:
> Jens Axboe has written some hacks for the block layer that allow
> queueing softirq work to remote cpus.  In the context of the block
> layer he used this facility to trigger the softirq block I/O
> completion on the same cpu where the I/O was submitted.

<snip>

> I intend to use this
> for receive side flow seperation on non-multiqueue network cards.

I'm not sure this belongs in this particular thread but I was
interested in how you're planning on doing this?  Is there going to be a
way for userspace to specify which traffic flows they'd like to direct
to particular cpus, or will the kernel try to figure it out on the fly?

We have application guys that would like very much to be able to nail 
specific apps to specific cores and have the kernel send all their 
packets to those cores for processing.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-22 21:22 ` Chris Friesen
@ 2008-09-22 22:12   ` David Miller
  2008-09-23 17:03     ` Chris Friesen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-09-22 22:12 UTC (permalink / raw)
  To: cfriesen; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert

From: "Chris Friesen" <cfriesen@nortel.com>
Date: Mon, 22 Sep 2008 15:22:36 -0600

> I'm not sure this belongs in this particular thread but I was
> interested in how you're planning on doing this?

Something like this patch which I posted last week on
netdev.

net: Do software flow seperation on receive.

Push netif_receive_skb() work to remote cpus via flow
hashing and remove softirqs.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/interrupt.h |    1 +
 include/linux/netdevice.h |    2 -
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  273 +++++++++++++++++++++++++--------------------
 4 files changed, 157 insertions(+), 122 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 806b38f..223e68f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,6 +247,7 @@ enum
 	TIMER_SOFTIRQ,
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
+	NET_RECEIVE_SOFTIRQ,
 	BLOCK_SOFTIRQ,
 	TASKLET_SOFTIRQ,
 	SCHED_SOFTIRQ,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 488c56e..a044caa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -965,11 +965,9 @@ static inline int unregister_gifconf(unsigned int family)
 struct softnet_data
 {
 	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	struct napi_struct	backlog;
 #ifdef CONFIG_NET_DMA
 	struct dma_chan		*net_dma;
 #endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9099237..e36bc86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -18,6 +18,7 @@
 #include <linux/compiler.h>
 #include <linux/time.h>
 #include <linux/cache.h>
+#include <linux/smp.h>
 
 #include <asm/atomic.h>
 #include <asm/types.h>
@@ -255,6 +256,8 @@ struct sk_buff {
 	struct sk_buff		*next;
 	struct sk_buff		*prev;
 
+	struct call_single_data	csd;
+
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
diff --git a/net/core/dev.c b/net/core/dev.c
index e719ed2..09827c7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1660,8 +1660,8 @@ out_kfree_skb:
 	return 0;
 }
 
-static u32 simple_tx_hashrnd;
-static int simple_tx_hashrnd_initialized = 0;
+static u32 simple_hashrnd;
+static int simple_hashrnd_initialized = 0;
 
 static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb)
 {
@@ -1669,9 +1669,9 @@ static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb)
 	u32 hash, ihl;
 	u8 ip_proto;
 
-	if (unlikely(!simple_tx_hashrnd_initialized)) {
-		get_random_bytes(&simple_tx_hashrnd, 4);
-		simple_tx_hashrnd_initialized = 1;
+	if (unlikely(!simple_hashrnd_initialized)) {
+		get_random_bytes(&simple_hashrnd, 4);
+		simple_hashrnd_initialized = 1;
 	}
 
 	switch (skb->protocol) {
@@ -1708,7 +1708,7 @@ static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb)
 		break;
 	}
 
-	hash = jhash_3words(addr1, addr2, ports, simple_tx_hashrnd);
+	hash = jhash_3words(addr1, addr2, ports, simple_hashrnd);
 
 	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }
@@ -1878,75 +1878,6 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 
-/**
- *	netif_rx	-	post buffer to the network code
- *	@skb: buffer to post
- *
- *	This function receives a packet from a device driver and queues it for
- *	the upper (protocol) levels to process.  It always succeeds. The buffer
- *	may be dropped during processing for congestion control or by the
- *	protocol layers.
- *
- *	return values:
- *	NET_RX_SUCCESS	(no congestion)
- *	NET_RX_DROP     (packet was dropped)
- *
- */
-
-int netif_rx(struct sk_buff *skb)
-{
-	struct softnet_data *queue;
-	unsigned long flags;
-
-	/* if netpoll wants it, pretend we never saw it */
-	if (netpoll_rx(skb))
-		return NET_RX_DROP;
-
-	if (!skb->tstamp.tv64)
-		net_timestamp(skb);
-
-	/*
-	 * The code is rearranged so that the path is the most
-	 * short when CPU is congested, but is still operating.
-	 */
-	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);
-
-	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
-		napi_schedule(&queue->backlog);
-		goto enqueue;
-	}
-
-	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
-
-	kfree_skb(skb);
-	return NET_RX_DROP;
-}
-
-int netif_rx_ni(struct sk_buff *skb)
-{
-	int err;
-
-	preempt_disable();
-	err = netif_rx(skb);
-	if (local_softirq_pending())
-		do_softirq();
-	preempt_enable();
-
-	return err;
-}
-
-EXPORT_SYMBOL(netif_rx_ni);
-
 static void net_tx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
@@ -2177,7 +2108,7 @@ void netif_nit_deliver(struct sk_buff *skb)
  *	NET_RX_SUCCESS: no congestion
  *	NET_RX_DROP: packet was dropped
  */
-int netif_receive_skb(struct sk_buff *skb)
+static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	struct net_device *orig_dev;
@@ -2185,10 +2116,6 @@ int netif_receive_skb(struct sk_buff *skb)
 	int ret = NET_RX_DROP;
 	__be16 type;
 
-	/* if we've gotten here through NAPI, check netpoll */
-	if (netpoll_receive_skb(skb))
-		return NET_RX_DROP;
-
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
 
@@ -2275,45 +2202,152 @@ out:
 	return ret;
 }
 
-/* Network device is going away, flush any packets still pending  */
-static void flush_backlog(void *arg)
+static void net_receive_action(struct softirq_action *h)
 {
-	struct net_device *dev = arg;
-	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	struct sk_buff *skb, *tmp;
+	struct list_head *cpu_list, local_list;
 
-	skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
-		if (skb->dev == dev) {
-			__skb_unlink(skb, &queue->input_pkt_queue);
-			kfree_skb(skb);
-		}
+	local_irq_disable();
+	cpu_list = &__get_cpu_var(softirq_work_list[NET_RECEIVE_SOFTIRQ]);
+	list_replace_init(cpu_list, &local_list);
+	local_irq_enable();
+
+	while (!list_empty(&local_list)) {
+		struct sk_buff *skb;
+
+		skb = list_entry(local_list.next, struct sk_buff, csd.list);
+		list_del_init(&skb->csd.list);
+		__netif_receive_skb(skb);
+	}
 }
 
-static int process_backlog(struct napi_struct *napi, int quota)
+static u16 *rxflow_cpu_map;
+static int rxflow_num_cpus;
+
+/* skb->data points at the network header, but that is the only thing
+ * we can rely upon.
+ */
+static u16 simple_rx_hash(struct sk_buff *skb)
 {
-	int work = 0;
-	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	unsigned long start_time = jiffies;
+	u32 addr1, addr2, ports;
+	struct ipv6hdr *ip6;
+	struct iphdr *ip;
+	u32 hash, ihl;
+	u8 ip_proto;
 
-	napi->weight = weight_p;
-	do {
-		struct sk_buff *skb;
+	if (unlikely(!simple_hashrnd_initialized)) {
+		get_random_bytes(&simple_hashrnd, 4);
+		simple_hashrnd_initialized = 1;
+	}
 
-		local_irq_disable();
-		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
-			__napi_complete(napi);
-			local_irq_enable();
-			break;
-		}
-		local_irq_enable();
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, sizeof(*ip)))
+			return 0;
 
-		netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+		ip = (struct iphdr *) skb->data;
+		ip_proto = ip->protocol;
+		addr1 = ip->saddr;
+		addr2 = ip->daddr;
+		ihl = ip->ihl;
+		break;
+	case __constant_htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, sizeof(*ip6)))
+			return 0;
+
+		ip6 = (struct ipv6hdr *) skb->data;
+		ip_proto = ip6->nexthdr;
+		addr1 = ip6->saddr.s6_addr32[3];
+		addr2 = ip6->daddr.s6_addr32[3];
+		ihl = (40 >> 2);
+		break;
+	default:
+		return 0;
+	}
+
+	ports = 0;
+	switch (ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_AH:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		if (pskb_may_pull(skb, (ihl * 4) + 4))
+			ports = *((u32 *) (skb->data + (ihl * 4)));
+		break;
 
-	return work;
+	default:
+		break;
+	}
+
+	hash = jhash_3words(addr1, addr2, ports, simple_hashrnd);
+
+	return (u16) (((u64) hash * rxflow_num_cpus) >> 32);
 }
 
+/* Since we are already in softirq context via NAPI, it makes no
+ * sense to reschedule a softirq locally, so we optimize that case.
+ */
+int netif_receive_skb(struct sk_buff *skb)
+{
+	int target_cpu, this_cpu, do_direct;
+	unsigned long flags;
+
+	/* If we've gotten here through NAPI, check netpoll.  This part
+	 * has to be synchronous and not get pushed to remote softirq
+	 * receive packet processing.
+	 */
+	if (netpoll_receive_skb(skb))
+		return NET_RX_DROP;
+
+	target_cpu = rxflow_cpu_map[simple_rx_hash(skb)];
+
+	local_irq_save(flags);
+	this_cpu = smp_processor_id();
+	do_direct = 0;
+	if (target_cpu != this_cpu)
+		__send_remote_softirq(&skb->csd, target_cpu, this_cpu, NET_RECEIVE_SOFTIRQ);
+	else
+		do_direct = 1;
+
+	local_irq_restore(flags);
+
+	if (do_direct)
+		return __netif_receive_skb(skb);
+
+	return NET_RX_SUCCESS;
+}
+
+int netif_rx(struct sk_buff *skb)
+{
+	int target_cpu;
+
+	/* if netpoll wants it, pretend we never saw it */
+	if (netpoll_rx(skb))
+		return NET_RX_DROP;
+
+	target_cpu = rxflow_cpu_map[simple_rx_hash(skb)];
+	send_remote_softirq(&skb->csd, target_cpu, NET_RECEIVE_SOFTIRQ);
+
+	return NET_RX_SUCCESS;
+}
+
+int netif_rx_ni(struct sk_buff *skb)
+{
+	int err;
+
+	preempt_disable();
+	err = netif_rx(skb);
+	if (local_softirq_pending())
+		do_softirq();
+	preempt_enable();
+
+	return err;
+}
+
+EXPORT_SYMBOL(netif_rx_ni);
+
 /**
  * __napi_schedule - schedule for receive
  * @n: entry to schedule
@@ -4182,8 +4216,6 @@ void netdev_run_todo(void)
 
 		dev->reg_state = NETREG_UNREGISTERED;
 
-		on_each_cpu(flush_backlog, dev, 1);
-
 		netdev_wait_allrefs(dev);
 
 		/* paranoia */
@@ -4489,7 +4521,6 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 {
 	struct sk_buff **list_skb;
 	struct Qdisc **list_net;
-	struct sk_buff *skb;
 	unsigned int cpu, oldcpu = (unsigned long)ocpu;
 	struct softnet_data *sd, *oldsd;
 
@@ -4520,10 +4551,6 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
 
-	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
-		netif_rx(skb);
-
 	return NOTIFY_OK;
 }
 
@@ -4793,7 +4820,7 @@ static struct pernet_operations __net_initdata default_device_ops = {
  */
 static int __init net_dev_init(void)
 {
-	int i, rc = -ENOMEM;
+	int i, index, rc = -ENOMEM;
 
 	BUG_ON(!dev_boot_phase);
 
@@ -4813,6 +4840,15 @@ static int __init net_dev_init(void)
 	if (register_pernet_device(&default_device_ops))
 		goto out;
 
+	rxflow_cpu_map = kzalloc(sizeof(u16) * num_possible_cpus(), GFP_KERNEL);
+	if (!rxflow_cpu_map)
+		goto out;
+	rxflow_num_cpus = num_online_cpus();
+
+	index = 0;
+	for_each_online_cpu(i)
+		rxflow_cpu_map[index++] = i;
+
 	/*
 	 *	Initialise the packet receive queues.
 	 */
@@ -4821,12 +4857,8 @@ static int __init net_dev_init(void)
 		struct softnet_data *queue;
 
 		queue = &per_cpu(softnet_data, i);
-		skb_queue_head_init(&queue->input_pkt_queue);
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
-
-		queue->backlog.poll = process_backlog;
-		queue->backlog.weight = weight_p;
 	}
 
 	netdev_dma_register();
@@ -4835,6 +4867,7 @@ static int __init net_dev_init(void)
 
 	open_softirq(NET_TX_SOFTIRQ, net_tx_action);
 	open_softirq(NET_RX_SOFTIRQ, net_rx_action);
+	open_softirq(NET_RECEIVE_SOFTIRQ, net_receive_action);
 
 	hotcpu_notifier(dev_cpu_callback, 0);
 	dst_init();
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-22 22:12   ` David Miller
@ 2008-09-23 17:03     ` Chris Friesen
  2008-09-23 21:10       ` Tom Herbert
  2008-09-23 21:51       ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Chris Friesen @ 2008-09-23 17:03 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert

David Miller wrote:
> From: "Chris Friesen" <cfriesen@nortel.com>
> Date: Mon, 22 Sep 2008 15:22:36 -0600
> 
>> I'm not sure this belongs in this particular thread but I was
>> interested in how you're planning on doing this?
> 
> Something like this patch which I posted last week on
> netdev.

That patch basically just picks an arbitrary cpu for each flow.  This 
would spread the load out across cpus, but it doesn't allow any input 
from userspace.

We have a current application where there are 16 cores and 16 threads. 
They would really like to be able to pin one thread to each core and 
tell the kernel what packets they're interested in so that the kernel 
can process those packets on that core to gain the maximum caching 
benefit as well as reduce reordering issues.  In our case the hardware 
supports filtering for multiqueues, so we could pass this information 
down to the hardware to avoid software filtering.

Either way, it requires some way for userspace to indicate interest in a 
particular flow.  Has anyone given any thought to what an API like this 
would look like?

I suppose we could automatically look at bound network sockets owned by 
tasks that are affined to single cpus.  This would simplify userspace 
but would reduce flexibility for things like packet sockets with socket 
filters applied.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-23 17:03     ` Chris Friesen
@ 2008-09-23 21:10       ` Tom Herbert
  2008-09-23 21:51       ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: Tom Herbert @ 2008-09-23 21:10 UTC (permalink / raw)
  To: Chris Friesen
  Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert

>
> That patch basically just picks an arbitrary cpu for each flow.  This would
> spread the load out across cpus, but it doesn't allow any input from
> userspace.
>

We've been running softRSS for a while
(http://marc.info/?l=linux-netdev&m=120475045519940&w=2) which I
believe has very similar functionality to this patch.  From this work
we found some nice ways to improve scaling that might be applicable:

- When routing packets to CPU based on hash, sending to another CPU
sharing L2 or L3 cache is best performance.
- We added a simple functionality to route packets to the CPU on which
the application last did a read for the socket.  This seems to be a
win for cache locality.
- We added a lookup table that maps the Toeplitz hash to the receiving
CPU where the application is running.  This is for those devices that
provide the Toeplitz hash in the receive descriptor.  This is a win
since the CPU receiving the interrupt doesn't need to take any cache
misses on the packet itself.
- In our (preliminary) 10G testing we found that routing packets in
software with the the above trick actually allows higher PPS and
better CPU utilization than using hardware RSS.  Also, using both the
software routing and hardware RSS yields the best results.

Tom

> We have a current application where there are 16 cores and 16 threads. They
> would really like to be able to pin one thread to each core and tell the
> kernel what packets they're interested in so that the kernel can process
> those packets on that core to gain the maximum caching benefit as well as
> reduce reordering issues.  In our case the hardware supports filtering for
> multiqueues, so we could pass this information down to the hardware to avoid
> software filtering.
>
> Either way, it requires some way for userspace to indicate interest in a
> particular flow.  Has anyone given any thought to what an API like this
> would look like?
>
> I suppose we could automatically look at bound network sockets owned by
> tasks that are affined to single cpus.  This would simplify userspace but
> would reduce flexibility for things like packet sockets with socket filters
> applied.
>
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-23 17:03     ` Chris Friesen
  2008-09-23 21:10       ` Tom Herbert
@ 2008-09-23 21:51       ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2008-09-23 21:51 UTC (permalink / raw)
  To: cfriesen; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert

From: "Chris Friesen" <cfriesen@nortel.com>
Date: Tue, 23 Sep 2008 11:03:48 -0600

> That patch basically just picks an arbitrary cpu for each flow.
> This would spread the load out across cpus, but it doesn't allow any
> input from userspace.

With hardware RX flow seperation, the same exact thing happens.

> We have a current application where there are 16 cores and 16
> threads. They would really like to be able to pin one thread to each
> core and tell the kernel what packets they're interested in so that
> the kernel can process those packets on that core to gain the
> maximum caching benefit as well as reduce reordering issues.  In our
> case the hardware supports filtering for multiqueues, so we could
> pass this information down to the hardware to avoid software
> filtering.
>
> Either way, it requires some way for userspace to indicate interest
> in a particular flow.  Has anyone given any thought to what an API
> like this would look like?

Many cards cannot configure this, but yes we should allow an interface to configure
RX flow seperation preferences, and we do plan on adding that at some point.

It's probably be an ethtool operation of some sort.  We already have a minimalistic
RX flow hashing configuration knob, see ETHTOOL_GRXFH and ETHTOOL_SRXFH.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/2]: Remote softirq invocation infrastructure.
  2008-09-20  6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller
  2008-09-20 15:29 ` Daniel Walker
  2008-09-22 21:22 ` Chris Friesen
@ 2008-09-24  7:42 ` David Miller
  2 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2008-09-24  7:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, jens.axboe, steffen.klassert

From: David Miller <davem@davemloft.net>
Date: Fri, 19 Sep 2008 23:48:24 -0700 (PDT)

> Jens Axboe has written some hacks for the block layer that allow
> queueing softirq work to remote cpus.  In the context of the block
> layer he used this facility to trigger the softirq block I/O
> completion on the same cpu where the I/O was submitted.

As a followup to this, I've refreshed my patches and put them
in a tree cloned from Linus's current GIT tree:

	master.kernel.org:/pub/scm/linux/kernel/git/davem/softirq-2.6.git

I made minor touchups to the second patch, such as adding a few more
descriptive comments, and adding the missing export of the softirq_work
list array.

Updated version below for reference:

softirq: Add support for triggering softirq work on softirqs.

This is basically a genericization of Jens Axboe's block layer
remote softirq changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/interrupt.h |   21 +++++++
 include/linux/smp.h       |    4 +-
 kernel/softirq.c          |  129 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 153 insertions(+), 1 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index fdd7b90..0a7a14b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -11,6 +11,8 @@
 #include <linux/hardirq.h>
 #include <linux/sched.h>
 #include <linux/irqflags.h>
+#include <linux/smp.h>
+#include <linux/percpu.h>
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
@@ -272,6 +274,25 @@ extern void softirq_init(void);
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 
+/* This is the worklist that queues up per-cpu softirq work.
+ *
+ * send_remote_sendirq() adds work to these lists, and
+ * the softirq handler itself dequeues from them.  The queues
+ * are protected by disabling local cpu interrupts and they must
+ * only be accessed by the local cpu that they are for.
+ */
+DECLARE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list);
+
+/* Try to send a softirq to a remote cpu.  If this cannot be done, the
+ * work will be queued to the local cpu.
+ */
+extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq);
+
+/* Like send_remote_softirq(), but the caller must disable local cpu interrupts
+ * and compute the current cpu, passed in as 'this_cpu'.
+ */
+extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
+				  int this_cpu, int softirq);
 
 /* Tasklets --- multithreaded analogue of BHs.
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 66484d4..2e4d58b 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -7,6 +7,7 @@
  */
 
 #include <linux/errno.h>
+#include <linux/types.h>
 #include <linux/list.h>
 #include <linux/cpumask.h>
 
@@ -16,7 +17,8 @@ struct call_single_data {
 	struct list_head list;
 	void (*func) (void *info);
 	void *info;
-	unsigned int flags;
+	u16 flags;
+	u16 priv;
 };
 
 #ifdef CONFIG_SMP
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 27642a2..77aba5e 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -6,6 +6,8 @@
  *	Distribute under GPLv2.
  *
  *	Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903)
+ *
+ *	Remote softirq infrastructure is by Jens Axboe.
  */
 
 #include <linux/module.h>
@@ -463,17 +465,144 @@ void tasklet_kill(struct tasklet_struct *t)
 
 EXPORT_SYMBOL(tasklet_kill);
 
+DEFINE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list);
+EXPORT_PER_CPU_SYMBOL(softirq_work_list);
+
+static void __local_trigger(struct call_single_data *cp, int softirq)
+{
+	struct list_head *head = &__get_cpu_var(softirq_work_list[softirq]);
+
+	list_add_tail(&cp->list, head);
+
+	/* Trigger the softirq only if the list was previously empty.  */
+	if (head->next == &cp->list)
+		raise_softirq_irqoff(softirq);
+}
+
+#ifdef CONFIG_USE_GENERIC_SMP_HELPERS
+static void remote_softirq_receive(void *data)
+{
+	struct call_single_data *cp = data;
+	unsigned long flags;
+	int softirq;
+
+	softirq = cp->priv;
+
+	local_irq_save(flags);
+	__local_trigger(cp, softirq);
+	local_irq_restore(flags);
+}
+
+static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+{
+	if (cpu_online(cpu)) {
+		cp->func = remote_softirq_receive;
+		cp->info = cp;
+		cp->flags = 0;
+		cp->priv = softirq;
+
+		__smp_call_function_single(cpu, cp);
+		return 0;
+	}
+	return 1;
+}
+#else /* CONFIG_USE_GENERIC_SMP_HELPERS */
+static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+{
+	return 1;
+}
+#endif
+
+/**
+ * __send_remote_softirq - try to schedule softirq work on a remote cpu
+ * @cp: private SMP call function data area
+ * @cpu: the remote cpu
+ * @this_cpu: the currently executing cpu
+ * @softirq: the softirq for the work
+ *
+ * Attempt to schedule softirq work on a remote cpu.  If this cannot be
+ * done, the work is instead queued up on the local cpu.
+ *
+ * Interrupts must be disabled.
+ */
+void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu, int softirq)
+{
+	if (cpu == this_cpu || __try_remote_softirq(cp, cpu, softirq))
+		__local_trigger(cp, softirq);
+}
+EXPORT_SYMBOL(__send_remote_softirq);
+
+/**
+ * send_remote_softirq - try to schedule softirq work on a remote cpu
+ * @cp: private SMP call function data area
+ * @cpu: the remote cpu
+ * @softirq: the softirq for the work
+ *
+ * Like __send_remote_softirq except that disabling interrupts and
+ * computing the current cpu is done for the caller.
+ */
+void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+{
+	unsigned long flags;
+	int this_cpu;
+
+	local_irq_save(flags);
+	this_cpu = smp_processor_id();
+	__send_remote_softirq(cp, cpu, this_cpu, softirq);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(send_remote_softirq);
+
+static int __cpuinit remote_softirq_cpu_notify(struct notifier_block *self,
+					       unsigned long action, void *hcpu)
+{
+	/*
+	 * If a CPU goes away, splice its entries to the current CPU
+	 * and trigger a run of the softirq
+	 */
+	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
+		int cpu = (unsigned long) hcpu;
+		int i;
+
+		local_irq_disable();
+		for (i = 0; i < NR_SOFTIRQ; i++) {
+			struct list_head *head = &per_cpu(softirq_work_list[i], cpu);
+			struct list_head *local_head;
+
+			if (list_empty(head))
+				continue;
+
+			local_head = &__get_cpu_var(softirq_work_list[i]);
+			list_splice_init(head, local_head);
+			raise_softirq_irqoff(i);
+		}
+		local_irq_enable();
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata remote_softirq_cpu_notifier = {
+	.notifier_call	= remote_softirq_cpu_notify,
+};
+
 void __init softirq_init(void)
 {
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
+		int i;
+
 		per_cpu(tasklet_vec, cpu).tail =
 			&per_cpu(tasklet_vec, cpu).head;
 		per_cpu(tasklet_hi_vec, cpu).tail =
 			&per_cpu(tasklet_hi_vec, cpu).head;
+		for (i = 0; i < NR_SOFTIRQ; i++)
+			INIT_LIST_HEAD(&per_cpu(softirq_work_list[i], cpu));
 	}
 
+	register_hotcpu_notifier(&remote_softirq_cpu_notifier);
+
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action);
 	open_softirq(HI_SOFTIRQ, tasklet_hi_action);
 }
-- 
1.5.6.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-09-24  7:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-20  6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller
2008-09-20 15:29 ` Daniel Walker
2008-09-20 15:45   ` Arjan van de Ven
2008-09-20 16:02     ` Daniel Walker
2008-09-20 16:19       ` Arjan van de Ven
2008-09-20 17:40         ` Daniel Walker
2008-09-20 18:09           ` Arjan van de Ven
2008-09-20 18:52             ` Daniel Walker
2008-09-20 20:04               ` David Miller
2008-09-20 19:59           ` David Miller
2008-09-21  6:05             ` Herbert Xu
2008-09-21  6:57               ` David Miller
2008-09-22 10:36                 ` Ilpo Järvinen
2008-09-24  4:54                   ` Herbert Xu
2008-09-21  9:13               ` James Courtier-Dutton
2008-09-21  9:17                 ` David Miller
2008-09-21  9:46               ` Steffen Klassert
2008-09-22  8:23                 ` Herbert Xu
2008-09-22 13:54                   ` Steffen Klassert
2008-09-20 20:00       ` David Miller
2008-09-22 21:22 ` Chris Friesen
2008-09-22 22:12   ` David Miller
2008-09-23 17:03     ` Chris Friesen
2008-09-23 21:10       ` Tom Herbert
2008-09-23 21:51       ` David Miller
2008-09-24  7:42 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).