* [PATCH 0/2]: Remote softirq invocation infrastructure.
@ 2008-09-20 6:48 David Miller
2008-09-20 15:29 ` Daniel Walker
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: David Miller @ 2008-09-20 6:48 UTC (permalink / raw)
To: linux-kernel; +Cc: netdev, jens.axboe, steffen.klassert
Jens Axboe has written some hacks for the block layer that allow
queueing softirq work to remote cpus. In the context of the block
layer he used this facility to trigger the softirq block I/O
completion on the same cpu where the I/O was submitted.
I want to make use of a similar facility for the networking, so we
should make this thing generic.
It depends upon the generic SMP call function infrastructure, which
Jens wrote specifically to do these remote softirq hacks.
For each softirq there is a per-cpu list head which is where the work
is queued up.
If the platform doesn't support the generic SMP call function bits,
the work is queued onto the local cpu.
The first patch adds a NR_SOFTIRQS so that we can size these arrays
by the actual number of softirqs instead of the magic number "32"
which is what is used now.
The second patch adds the infrastructure and provides intefaces to
invoke softirqs on remove cpus.
Jen's, as stated, has block layer uses for this. I intend to use this
for receive side flow seperation on non-multiqueue network cards. And
Steffen Klassert has a set of IPSEC parallelization changes that can
very likely make use of this.
These patches are against current 2.6.27-rcX
I would suggest that if nobody has any problems with this, we put it
into a GIT tree on kernel.org and any subsystem that wants to use it
can just pull that tree into their GIT tree. This way, it doesn't
matter which tree Linus pulls in first, he'll get this stuff properly
regardless of ordering.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller @ 2008-09-20 15:29 ` Daniel Walker 2008-09-20 15:45 ` Arjan van de Ven 2008-09-22 21:22 ` Chris Friesen 2008-09-24 7:42 ` David Miller 2 siblings, 1 reply; 26+ messages in thread From: Daniel Walker @ 2008-09-20 15:29 UTC (permalink / raw) To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert On Fri, 2008-09-19 at 23:48 -0700, David Miller wrote: > Jens Axboe has written some hacks for the block layer that allow > queueing softirq work to remote cpus. In the context of the block > layer he used this facility to trigger the softirq block I/O > completion on the same cpu where the I/O was submitted. > > Jen's, as stated, has block layer uses for this. I intend to use this > for receive side flow seperation on non-multiqueue network cards. And > Steffen Klassert has a set of IPSEC parallelization changes that can > very likely make use of this. What's the benefit that you (or Jens) sees from migrating softirqs from specific cpu's to others? Daniel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 15:29 ` Daniel Walker @ 2008-09-20 15:45 ` Arjan van de Ven 2008-09-20 16:02 ` Daniel Walker 0 siblings, 1 reply; 26+ messages in thread From: Arjan van de Ven @ 2008-09-20 15:45 UTC (permalink / raw) To: Daniel Walker Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 20 Sep 2008 08:29:21 -0700 > > > Jen's, as stated, has block layer uses for this. I intend to use > > this for receive side flow seperation on non-multiqueue network > > cards. And Steffen Klassert has a set of IPSEC parallelization > > changes that can very likely make use of this. > > What's the benefit that you (or Jens) sees from migrating softirqs > from specific cpu's to others? it means you do all the processing on the CPU that submitted the IO in the first place, and likely still has the various metadata pieces in its CPU cache (or at least you know you won't need to bounce them over) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 15:45 ` Arjan van de Ven @ 2008-09-20 16:02 ` Daniel Walker 2008-09-20 16:19 ` Arjan van de Ven 2008-09-20 20:00 ` David Miller 0 siblings, 2 replies; 26+ messages in thread From: Daniel Walker @ 2008-09-20 16:02 UTC (permalink / raw) To: Arjan van de Ven Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote: > On Sat, 20 Sep 2008 08:29:21 -0700 > > > > > Jen's, as stated, has block layer uses for this. I intend to use > > > this for receive side flow seperation on non-multiqueue network > > > cards. And Steffen Klassert has a set of IPSEC parallelization > > > changes that can very likely make use of this. > > > > What's the benefit that you (or Jens) sees from migrating softirqs > > from specific cpu's to others? > > it means you do all the processing on the CPU that submitted the IO in > the first place, and likely still has the various metadata pieces in > its CPU cache (or at least you know you won't need to bounce them over) In the case of networking and block I would think a lot of the softirq activity is asserted from userspace.. Maybe the scheduler shouldn't be migrating these tasks, or could take this softirq activity into account .. Daniel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 16:02 ` Daniel Walker @ 2008-09-20 16:19 ` Arjan van de Ven 2008-09-20 17:40 ` Daniel Walker 2008-09-20 20:00 ` David Miller 1 sibling, 1 reply; 26+ messages in thread From: Arjan van de Ven @ 2008-09-20 16:19 UTC (permalink / raw) To: Daniel Walker Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 20 Sep 2008 09:02:09 -0700 Daniel Walker <dwalker@mvista.com> wrote: > On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote: > > On Sat, 20 Sep 2008 08:29:21 -0700 > > > > > > > Jen's, as stated, has block layer uses for this. I intend to > > > > use this for receive side flow seperation on non-multiqueue > > > > network cards. And Steffen Klassert has a set of IPSEC > > > > parallelization changes that can very likely make use of this. > > > > > > What's the benefit that you (or Jens) sees from migrating softirqs > > > from specific cpu's to others? > > > > it means you do all the processing on the CPU that submitted the IO > > in the first place, and likely still has the various metadata > > pieces in its CPU cache (or at least you know you won't need to > > bounce them over) > > > In the case of networking and block I would think a lot of the softirq > activity is asserted from userspace.. Maybe the scheduler shouldn't be > migrating these tasks, or could take this softirq activity into > account .. well a lot of it comes from completion interrupts. and moving userspace isn't a good option; think of the case of 1 nic but 4 apache processes doing the work... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 16:19 ` Arjan van de Ven @ 2008-09-20 17:40 ` Daniel Walker 2008-09-20 18:09 ` Arjan van de Ven 2008-09-20 19:59 ` David Miller 0 siblings, 2 replies; 26+ messages in thread From: Daniel Walker @ 2008-09-20 17:40 UTC (permalink / raw) To: Arjan van de Ven Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 2008-09-20 at 09:19 -0700, Arjan van de Ven wrote: > On Sat, 20 Sep 2008 09:02:09 -0700 > Daniel Walker <dwalker@mvista.com> wrote: > > > On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote: > > > On Sat, 20 Sep 2008 08:29:21 -0700 > > > > > > > > > Jen's, as stated, has block layer uses for this. I intend to > > > > > use this for receive side flow seperation on non-multiqueue > > > > > network cards. And Steffen Klassert has a set of IPSEC > > > > > parallelization changes that can very likely make use of this. > > > > > > > > What's the benefit that you (or Jens) sees from migrating softirqs > > > > from specific cpu's to others? > > > > > > it means you do all the processing on the CPU that submitted the IO > > > in the first place, and likely still has the various metadata > > > pieces in its CPU cache (or at least you know you won't need to > > > bounce them over) > > > > > > In the case of networking and block I would think a lot of the softirq > > activity is asserted from userspace.. Maybe the scheduler shouldn't be > > migrating these tasks, or could take this softirq activity into > > account .. > > well a lot of it comes from completion interrupts. Yeah, partly I would think. > and moving userspace isn't a good option; think of the case of 1 nic > but 4 apache processes doing the work... > One nic, so one interrupt ? I guess we're talking about an SMP machine? It seems case dependent .. If you send a lot, or receive a lot.. BUT it's all speculation on my part.. Dave didn't supply the users of his code, or what kind of improvement was seen, or the case in which it would be needed. I think Dave knowns his subsystem, but the code on the surface looks like an end run around some other problem area.. Daniel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 17:40 ` Daniel Walker @ 2008-09-20 18:09 ` Arjan van de Ven 2008-09-20 18:52 ` Daniel Walker 2008-09-20 19:59 ` David Miller 1 sibling, 1 reply; 26+ messages in thread From: Arjan van de Ven @ 2008-09-20 18:09 UTC (permalink / raw) To: Daniel Walker Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 20 Sep 2008 10:40:04 -0700 Daniel Walker <dwalker@mvista.com> wrote: > On Sat, 2008-09-20 at 09:19 -0700, Arjan van de Ven wrote: > > On Sat, 20 Sep 2008 09:02:09 -0700 > > Daniel Walker <dwalker@mvista.com> wrote: > > > > > On Sat, 2008-09-20 at 08:45 -0700, Arjan van de Ven wrote: > > > > On Sat, 20 Sep 2008 08:29:21 -0700 > > > > > > > > > > > Jen's, as stated, has block layer uses for this. I intend > > > > > > to use this for receive side flow seperation on > > > > > > non-multiqueue network cards. And Steffen Klassert has a > > > > > > set of IPSEC parallelization changes that can very likely > > > > > > make use of this. > > > > > > > > > > What's the benefit that you (or Jens) sees from migrating > > > > > softirqs from specific cpu's to others? > > > > > > > > it means you do all the processing on the CPU that submitted > > > > the IO in the first place, and likely still has the various > > > > metadata pieces in its CPU cache (or at least you know you > > > > won't need to bounce them over) > > > > > > > > > In the case of networking and block I would think a lot of the > > > softirq activity is asserted from userspace.. Maybe the scheduler > > > shouldn't be migrating these tasks, or could take this softirq > > > activity into account .. > > > > well a lot of it comes from completion interrupts. > > Yeah, partly I would think. completions trigger the next send as well (for both block and net) so it's quite common > > > and moving userspace isn't a good option; think of the case of 1 nic > > but 4 apache processes doing the work... > > > > One nic, so one interrupt ? I guess we're talking about an SMP > machine? or multicore doing IPI's for this on an UP machine is a rather boring exercise > > Dave didn't supply the users of his code, or what kind of improvement > was seen, or the case in which it would be needed. I think Dave knowns > his subsystem, but the code on the surface looks like an end run > around some other problem area.. it's very fundamental, and has been talked about at various conferences as well. the basic problem is that the submitter of the IO (be it block or net) creates a ton of metadata state on submit, and ideally the completion processing happens on the same CPU, for two reasons 1) to use the state in the cache 2) for the case where you touch userland data/structures, we assume the scheduler kept affinity it's a Moses-to-the-Mountain problem, except we have four Moses' but only one Mountain. Or in CS terms: we move the work to the CPU where the userland is rather than moving the userland to the IRQ CPU, since there is usually only one IRQ but many userlands and many cpu cores. (for the UP case this is all very irrelevant obviously) I assume Dave will pipe in if he disagrees with me ;-) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 18:09 ` Arjan van de Ven @ 2008-09-20 18:52 ` Daniel Walker 2008-09-20 20:04 ` David Miller 0 siblings, 1 reply; 26+ messages in thread From: Daniel Walker @ 2008-09-20 18:52 UTC (permalink / raw) To: Arjan van de Ven Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert On Sat, 2008-09-20 at 11:09 -0700, Arjan van de Ven wrote: > > > > Dave didn't supply the users of his code, or what kind of improvement > > was seen, or the case in which it would be needed. I think Dave knowns > > his subsystem, but the code on the surface looks like an end run > > around some other problem area.. > > it's very fundamental, and has been talked about at various conferences > as well. At least you understand that not everyone goes to conferences.. > the basic problem is that the submitter of the IO (be it block or net) > creates a ton of metadata state on submit, and ideally the completion > processing happens on the same CPU, for two reasons > 1) to use the state in the cache > 2) for the case where you touch userland data/structures, we assume the > scheduler kept affinity > > it's a Moses-to-the-Mountain problem, except we have four Moses' but > only one Mountain. > > Or in CS terms: we move the work to the CPU where the userland is > rather than moving the userland to the IRQ CPU, since there is usually > only one IRQ but many userlands and many cpu cores. There must be some kind of trade off here .. There's a fairly good performance gain from have the softirq asserted and run on the same cpu since it runs in interrupt context right after the interrupt. If you move the softirq to another cpu then you have to re-assert and either wait for ksoftirqd to handle it or wait for an interrupt on the new cpu .. Neither is very predictable.. All that vs. bouncing data around the caches.. To what degree has all that been handled or thought about? Daniel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 18:52 ` Daniel Walker @ 2008-09-20 20:04 ` David Miller 0 siblings, 0 replies; 26+ messages in thread From: David Miller @ 2008-09-20 20:04 UTC (permalink / raw) To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert From: Daniel Walker <dwalker@mvista.com> Date: Sat, 20 Sep 2008 11:52:31 -0700 > All that vs. bouncing data around the caches.. To what degree has all > that been handled or thought about? Give concrete things to discuss or just be quiet. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 17:40 ` Daniel Walker 2008-09-20 18:09 ` Arjan van de Ven @ 2008-09-20 19:59 ` David Miller 2008-09-21 6:05 ` Herbert Xu 1 sibling, 1 reply; 26+ messages in thread From: David Miller @ 2008-09-20 19:59 UTC (permalink / raw) To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert From: Daniel Walker <dwalker@mvista.com> Date: Sat, 20 Sep 2008 10:40:04 -0700 > Dave didn't supply the users of his code, or what kind of improvement > was seen, or the case in which it would be needed. I think Dave knowns > his subsystem, but the code on the surface looks like an end run around > some other problem area.. I posted an example use case on netdev a few days ago, and the block layer example is in Jen's block layer tree. It's for networking cards that don't do flow seperation on receive using multiple RX queues and MSI-X interrupts. It's also for things like IPSEC where the per-packet cpu usage is so huge (to do the crypto) that it makes sense to even split up the work to multiple cpus within the same flow. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 19:59 ` David Miller @ 2008-09-21 6:05 ` Herbert Xu 2008-09-21 6:57 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Herbert Xu @ 2008-09-21 6:05 UTC (permalink / raw) To: David Miller Cc: dwalker, arjan, linux-kernel, netdev, jens.axboe, steffen.klassert David Miller <davem@davemloft.net> wrote: > > receive using multiple RX queues and MSI-X interrupts. It's > also for things like IPSEC where the per-packet cpu usage > is so huge (to do the crypto) that it makes sense to even > split up the work to multiple cpus within the same flow. Unfortunately doing this with IPsec is going to be non-trivial since we still want to maintain packet ordering inside IPsec and you don't get the inner flow information until you decrypt the packet. So if we want to process IPsec packets in parallel it's best to implement that from within the crypto API where we can queue the result in order to ensure proper ordering. Of course, we need to balance any effort spent on this with the likelihood that hardware improvements will soon make this obsolete (for IPsec anyway). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 6:05 ` Herbert Xu @ 2008-09-21 6:57 ` David Miller 2008-09-22 10:36 ` Ilpo Järvinen 2008-09-21 9:13 ` James Courtier-Dutton 2008-09-21 9:46 ` Steffen Klassert 2 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-09-21 6:57 UTC (permalink / raw) To: herbert; +Cc: dwalker, arjan, linux-kernel, netdev, jens.axboe, steffen.klassert From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sun, 21 Sep 2008 15:05:45 +0900 > Unfortunately doing this with IPsec is going to be non-trivial > since we still want to maintain packet ordering inside IPsec > and you don't get the inner flow information until you decrypt > the packet. Steffen has mechanisms by which to deal with this in his patches. > So if we want to process IPsec packets in parallel it's best to > implement that from within the crypto API where we can queue the > result in order to ensure proper ordering. That's another option, of course. And crypto could use remote softirqs even for that :-) > Of course, we need to balance any effort spent on this with the > likelihood that hardware improvements will soon make this obsolete > (for IPsec anyway). True, but old hardware will always exist. A lot of very reasonable machines out there will benefit from software RX flow seperation. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 6:57 ` David Miller @ 2008-09-22 10:36 ` Ilpo Järvinen 2008-09-24 4:54 ` Herbert Xu 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-09-22 10:36 UTC (permalink / raw) To: David Miller Cc: Herbert Xu, dwalker, arjan, LKML, Netdev, jens.axboe, steffen.klassert On Sat, 20 Sep 2008, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Sun, 21 Sep 2008 15:05:45 +0900 > > > Of course, we need to balance any effort spent on this with the > > likelihood that hardware improvements will soon make this obsolete > > (for IPsec anyway). > > True, but old hardware will always exist. ...Also, producing buggy hardware will not suddently just vanish either ("can you please turn of ipsec offloading and see if you can still reproduce" :-))... -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-22 10:36 ` Ilpo Järvinen @ 2008-09-24 4:54 ` Herbert Xu 0 siblings, 0 replies; 26+ messages in thread From: Herbert Xu @ 2008-09-24 4:54 UTC (permalink / raw) To: Ilpo Järvinen Cc: David Miller, dwalker, arjan, LKML, Netdev, jens.axboe, steffen.klassert On Mon, Sep 22, 2008 at 01:36:24PM +0300, Ilpo Järvinen wrote: > On Sat, 20 Sep 2008, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Sun, 21 Sep 2008 15:05:45 +0900 > > > > > Of course, we need to balance any effort spent on this with the > > > likelihood that hardware improvements will soon make this obsolete > > > (for IPsec anyway). > > > > True, but old hardware will always exist. > > ...Also, producing buggy hardware will not suddently just vanish either > ("can you please turn of ipsec offloading and see if you can still > reproduce" :-))... That's fine. If your AES hardware is buggy you just fall back to using the software version. In fact if this was done through the crypto API it would even happen automatically. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 6:05 ` Herbert Xu 2008-09-21 6:57 ` David Miller @ 2008-09-21 9:13 ` James Courtier-Dutton 2008-09-21 9:17 ` David Miller 2008-09-21 9:46 ` Steffen Klassert 2 siblings, 1 reply; 26+ messages in thread From: James Courtier-Dutton @ 2008-09-21 9:13 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe, steffen.klassert Herbert Xu wrote: > David Miller <davem@davemloft.net> wrote: >> receive using multiple RX queues and MSI-X interrupts. It's >> also for things like IPSEC where the per-packet cpu usage >> is so huge (to do the crypto) that it makes sense to even >> split up the work to multiple cpus within the same flow. > > Unfortunately doing this with IPsec is going to be non-trivial > since we still want to maintain packet ordering inside IPsec > and you don't get the inner flow information until you decrypt > the packet. > Why do you have to preserve packet ordering? TCP/IP does not preserve packet ordering across the network. IPSEC uses a sliding window for anti-relay detection precisely because it has to be able to handle out-of-order packets. Sharing the sliding window between CPUs might be interesting! James ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 9:13 ` James Courtier-Dutton @ 2008-09-21 9:17 ` David Miller 0 siblings, 0 replies; 26+ messages in thread From: David Miller @ 2008-09-21 9:17 UTC (permalink / raw) To: James Cc: herbert, dwalker, arjan, linux-kernel, netdev, jens.axboe, steffen.klassert From: James Courtier-Dutton <James@superbug.co.uk> Date: Sun, 21 Sep 2008 10:13:47 +0100 > Why do you have to preserve packet ordering? > TCP/IP does not preserve packet ordering across the network. Yes, but we should preserve per-flow ordering as much as possible within the local system for optimal performance. Things fall apart completely, even with TCP, once you reorder more than 2 or 3 packets deep. > Sharing the sliding window between CPUs might be interesting! Again, Steffen's patches take care of this issue. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 6:05 ` Herbert Xu 2008-09-21 6:57 ` David Miller 2008-09-21 9:13 ` James Courtier-Dutton @ 2008-09-21 9:46 ` Steffen Klassert 2008-09-22 8:23 ` Herbert Xu 2 siblings, 1 reply; 26+ messages in thread From: Steffen Klassert @ 2008-09-21 9:46 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe On Sun, Sep 21, 2008 at 03:05:45PM +0900, Herbert Xu wrote: > David Miller <davem@davemloft.net> wrote: > > > > receive using multiple RX queues and MSI-X interrupts. It's > > also for things like IPSEC where the per-packet cpu usage > > is so huge (to do the crypto) that it makes sense to even > > split up the work to multiple cpus within the same flow. > > Unfortunately doing this with IPsec is going to be non-trivial > since we still want to maintain packet ordering inside IPsec > and you don't get the inner flow information until you decrypt > the packet. It's non-trivial but possible. I have a test implementation that runs the whole IP layer in parallel. The basic idea to keep track of the packet ordering is to give the packets sequence numbers befor we run in parallel. Befor we push the packets to the upper layers or to the neighboring subsystem I have a mechanism that brings them back to the right order. With my test environment (two quad core boxes) I get with IPSEC aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s compared to about 200 Mbit/s without the parallel processing. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-21 9:46 ` Steffen Klassert @ 2008-09-22 8:23 ` Herbert Xu 2008-09-22 13:54 ` Steffen Klassert 0 siblings, 1 reply; 26+ messages in thread From: Herbert Xu @ 2008-09-22 8:23 UTC (permalink / raw) To: Steffen Klassert Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe On Sun, Sep 21, 2008 at 11:46:28AM +0200, Steffen Klassert wrote: > > It's non-trivial but possible. I have a test implementation that > runs the whole IP layer in parallel. The basic idea to keep track > of the packet ordering is to give the packets sequence numbers > befor we run in parallel. Befor we push the packets to the upper > layers or to the neighboring subsystem I have a mechanism that > brings them back to the right order. Yes that should do the trick. > With my test environment (two quad core boxes) I get with IPSEC > aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s > compared to about 200 Mbit/s without the parallel processing. Yes this would definitely help IPsec. However, I'm not so sure of its benefit to routing and other parts of networking. That's why I'd rather have this sort of hack stay in the crypto system where it's isolated rather than having it proliferate throughout the network stack. When the time comes to weed out this because all CPUs that matter have encryption in hardware then it'll be much easier to delete a crypto algorithm as opposed to removing parts of the network infrastructure :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-22 8:23 ` Herbert Xu @ 2008-09-22 13:54 ` Steffen Klassert 0 siblings, 0 replies; 26+ messages in thread From: Steffen Klassert @ 2008-09-22 13:54 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, dwalker, arjan, linux-kernel, netdev, jens.axboe On Mon, Sep 22, 2008 at 04:23:09PM +0800, Herbert Xu wrote: > > > With my test environment (two quad core boxes) I get with IPSEC > > aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s > > compared to about 200 Mbit/s without the parallel processing. > > Yes this would definitely help IPsec. However, I'm not so sure > of its benefit to routing and other parts of networking. That's > why I'd rather have this sort of hack stay in the crypto system > where it's isolated rather than having it proliferate throughout > the network stack. The crypto benefits the most of course, but routing and xfrm lookups could benefit on bigger networks too. However, the method to bring the packets back to order is quite generic and could be used even in the crypto system. The important thing for me is that we can run in parallel even if we have just one flow. > > When the time comes to weed out this because all CPUs that matter > have encryption in hardware then it'll be much easier to delete a > crypto algorithm as opposed to removing parts of the network > infrastructure :) > Yes, if you think about how to remove it I agree here. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 16:02 ` Daniel Walker 2008-09-20 16:19 ` Arjan van de Ven @ 2008-09-20 20:00 ` David Miller 1 sibling, 0 replies; 26+ messages in thread From: David Miller @ 2008-09-20 20:00 UTC (permalink / raw) To: dwalker; +Cc: arjan, linux-kernel, netdev, jens.axboe, steffen.klassert From: Daniel Walker <dwalker@mvista.com> Date: Sat, 20 Sep 2008 09:02:09 -0700 > In the case of networking and block I would think a lot of the softirq > activity is asserted from userspace.. Maybe the scheduler shouldn't be > migrating these tasks, or could take this softirq activity into > account .. Absolutely wrong. On a per-flow basis you want to push the work down as far as possible down to individual cpus. Why do you think the hardware folks are devoting silicon to RX multiqueue facilities that spread the RX work amongst available cpus using MSI-X? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller 2008-09-20 15:29 ` Daniel Walker @ 2008-09-22 21:22 ` Chris Friesen 2008-09-22 22:12 ` David Miller 2008-09-24 7:42 ` David Miller 2 siblings, 1 reply; 26+ messages in thread From: Chris Friesen @ 2008-09-22 21:22 UTC (permalink / raw) To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert David Miller wrote: > Jens Axboe has written some hacks for the block layer that allow > queueing softirq work to remote cpus. In the context of the block > layer he used this facility to trigger the softirq block I/O > completion on the same cpu where the I/O was submitted. <snip> > I intend to use this > for receive side flow seperation on non-multiqueue network cards. I'm not sure this belongs in this particular thread but I was interested in how you're planning on doing this? Is there going to be a way for userspace to specify which traffic flows they'd like to direct to particular cpus, or will the kernel try to figure it out on the fly? We have application guys that would like very much to be able to nail specific apps to specific cores and have the kernel send all their packets to those cores for processing. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-22 21:22 ` Chris Friesen @ 2008-09-22 22:12 ` David Miller 2008-09-23 17:03 ` Chris Friesen 0 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-09-22 22:12 UTC (permalink / raw) To: cfriesen; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert From: "Chris Friesen" <cfriesen@nortel.com> Date: Mon, 22 Sep 2008 15:22:36 -0600 > I'm not sure this belongs in this particular thread but I was > interested in how you're planning on doing this? Something like this patch which I posted last week on netdev. net: Do software flow seperation on receive. Push netif_receive_skb() work to remote cpus via flow hashing and remove softirqs. Signed-off-by: David S. Miller <davem@davemloft.net> --- include/linux/interrupt.h | 1 + include/linux/netdevice.h | 2 - include/linux/skbuff.h | 3 + net/core/dev.c | 273 +++++++++++++++++++++++++-------------------- 4 files changed, 157 insertions(+), 122 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 806b38f..223e68f 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -247,6 +247,7 @@ enum TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, + NET_RECEIVE_SOFTIRQ, BLOCK_SOFTIRQ, TASKLET_SOFTIRQ, SCHED_SOFTIRQ, diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 488c56e..a044caa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -965,11 +965,9 @@ static inline int unregister_gifconf(unsigned int family) struct softnet_data { struct Qdisc *output_queue; - struct sk_buff_head input_pkt_queue; struct list_head poll_list; struct sk_buff *completion_queue; - struct napi_struct backlog; #ifdef CONFIG_NET_DMA struct dma_chan *net_dma; #endif diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9099237..e36bc86 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -18,6 +18,7 @@ #include <linux/compiler.h> #include <linux/time.h> #include <linux/cache.h> +#include <linux/smp.h> #include <asm/atomic.h> #include <asm/types.h> @@ -255,6 +256,8 @@ struct sk_buff { struct sk_buff *next; struct sk_buff *prev; + struct call_single_data csd; + struct sock *sk; ktime_t tstamp; struct net_device *dev; diff --git a/net/core/dev.c b/net/core/dev.c index e719ed2..09827c7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1660,8 +1660,8 @@ out_kfree_skb: return 0; } -static u32 simple_tx_hashrnd; -static int simple_tx_hashrnd_initialized = 0; +static u32 simple_hashrnd; +static int simple_hashrnd_initialized = 0; static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb) { @@ -1669,9 +1669,9 @@ static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb) u32 hash, ihl; u8 ip_proto; - if (unlikely(!simple_tx_hashrnd_initialized)) { - get_random_bytes(&simple_tx_hashrnd, 4); - simple_tx_hashrnd_initialized = 1; + if (unlikely(!simple_hashrnd_initialized)) { + get_random_bytes(&simple_hashrnd, 4); + simple_hashrnd_initialized = 1; } switch (skb->protocol) { @@ -1708,7 +1708,7 @@ static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb) break; } - hash = jhash_3words(addr1, addr2, ports, simple_tx_hashrnd); + hash = jhash_3words(addr1, addr2, ports, simple_hashrnd); return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); } @@ -1878,75 +1878,6 @@ int weight_p __read_mostly = 64; /* old backlog weight */ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, }; -/** - * netif_rx - post buffer to the network code - * @skb: buffer to post - * - * This function receives a packet from a device driver and queues it for - * the upper (protocol) levels to process. It always succeeds. The buffer - * may be dropped during processing for congestion control or by the - * protocol layers. - * - * return values: - * NET_RX_SUCCESS (no congestion) - * NET_RX_DROP (packet was dropped) - * - */ - -int netif_rx(struct sk_buff *skb) -{ - struct softnet_data *queue; - unsigned long flags; - - /* if netpoll wants it, pretend we never saw it */ - if (netpoll_rx(skb)) - return NET_RX_DROP; - - if (!skb->tstamp.tv64) - net_timestamp(skb); - - /* - * The code is rearranged so that the path is the most - * short when CPU is congested, but is still operating. - */ - local_irq_save(flags); - queue = &__get_cpu_var(softnet_data); - - __get_cpu_var(netdev_rx_stat).total++; - if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { - if (queue->input_pkt_queue.qlen) { -enqueue: - __skb_queue_tail(&queue->input_pkt_queue, skb); - local_irq_restore(flags); - return NET_RX_SUCCESS; - } - - napi_schedule(&queue->backlog); - goto enqueue; - } - - __get_cpu_var(netdev_rx_stat).dropped++; - local_irq_restore(flags); - - kfree_skb(skb); - return NET_RX_DROP; -} - -int netif_rx_ni(struct sk_buff *skb) -{ - int err; - - preempt_disable(); - err = netif_rx(skb); - if (local_softirq_pending()) - do_softirq(); - preempt_enable(); - - return err; -} - -EXPORT_SYMBOL(netif_rx_ni); - static void net_tx_action(struct softirq_action *h) { struct softnet_data *sd = &__get_cpu_var(softnet_data); @@ -2177,7 +2108,7 @@ void netif_nit_deliver(struct sk_buff *skb) * NET_RX_SUCCESS: no congestion * NET_RX_DROP: packet was dropped */ -int netif_receive_skb(struct sk_buff *skb) +static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; struct net_device *orig_dev; @@ -2185,10 +2116,6 @@ int netif_receive_skb(struct sk_buff *skb) int ret = NET_RX_DROP; __be16 type; - /* if we've gotten here through NAPI, check netpoll */ - if (netpoll_receive_skb(skb)) - return NET_RX_DROP; - if (!skb->tstamp.tv64) net_timestamp(skb); @@ -2275,45 +2202,152 @@ out: return ret; } -/* Network device is going away, flush any packets still pending */ -static void flush_backlog(void *arg) +static void net_receive_action(struct softirq_action *h) { - struct net_device *dev = arg; - struct softnet_data *queue = &__get_cpu_var(softnet_data); - struct sk_buff *skb, *tmp; + struct list_head *cpu_list, local_list; - skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp) - if (skb->dev == dev) { - __skb_unlink(skb, &queue->input_pkt_queue); - kfree_skb(skb); - } + local_irq_disable(); + cpu_list = &__get_cpu_var(softirq_work_list[NET_RECEIVE_SOFTIRQ]); + list_replace_init(cpu_list, &local_list); + local_irq_enable(); + + while (!list_empty(&local_list)) { + struct sk_buff *skb; + + skb = list_entry(local_list.next, struct sk_buff, csd.list); + list_del_init(&skb->csd.list); + __netif_receive_skb(skb); + } } -static int process_backlog(struct napi_struct *napi, int quota) +static u16 *rxflow_cpu_map; +static int rxflow_num_cpus; + +/* skb->data points at the network header, but that is the only thing + * we can rely upon. + */ +static u16 simple_rx_hash(struct sk_buff *skb) { - int work = 0; - struct softnet_data *queue = &__get_cpu_var(softnet_data); - unsigned long start_time = jiffies; + u32 addr1, addr2, ports; + struct ipv6hdr *ip6; + struct iphdr *ip; + u32 hash, ihl; + u8 ip_proto; - napi->weight = weight_p; - do { - struct sk_buff *skb; + if (unlikely(!simple_hashrnd_initialized)) { + get_random_bytes(&simple_hashrnd, 4); + simple_hashrnd_initialized = 1; + } - local_irq_disable(); - skb = __skb_dequeue(&queue->input_pkt_queue); - if (!skb) { - __napi_complete(napi); - local_irq_enable(); - break; - } - local_irq_enable(); + switch (skb->protocol) { + case __constant_htons(ETH_P_IP): + if (!pskb_may_pull(skb, sizeof(*ip))) + return 0; - netif_receive_skb(skb); - } while (++work < quota && jiffies == start_time); + ip = (struct iphdr *) skb->data; + ip_proto = ip->protocol; + addr1 = ip->saddr; + addr2 = ip->daddr; + ihl = ip->ihl; + break; + case __constant_htons(ETH_P_IPV6): + if (!pskb_may_pull(skb, sizeof(*ip6))) + return 0; + + ip6 = (struct ipv6hdr *) skb->data; + ip_proto = ip6->nexthdr; + addr1 = ip6->saddr.s6_addr32[3]; + addr2 = ip6->daddr.s6_addr32[3]; + ihl = (40 >> 2); + break; + default: + return 0; + } + + ports = 0; + switch (ip_proto) { + case IPPROTO_TCP: + case IPPROTO_UDP: + case IPPROTO_DCCP: + case IPPROTO_ESP: + case IPPROTO_AH: + case IPPROTO_SCTP: + case IPPROTO_UDPLITE: + if (pskb_may_pull(skb, (ihl * 4) + 4)) + ports = *((u32 *) (skb->data + (ihl * 4))); + break; - return work; + default: + break; + } + + hash = jhash_3words(addr1, addr2, ports, simple_hashrnd); + + return (u16) (((u64) hash * rxflow_num_cpus) >> 32); } +/* Since we are already in softirq context via NAPI, it makes no + * sense to reschedule a softirq locally, so we optimize that case. + */ +int netif_receive_skb(struct sk_buff *skb) +{ + int target_cpu, this_cpu, do_direct; + unsigned long flags; + + /* If we've gotten here through NAPI, check netpoll. This part + * has to be synchronous and not get pushed to remote softirq + * receive packet processing. + */ + if (netpoll_receive_skb(skb)) + return NET_RX_DROP; + + target_cpu = rxflow_cpu_map[simple_rx_hash(skb)]; + + local_irq_save(flags); + this_cpu = smp_processor_id(); + do_direct = 0; + if (target_cpu != this_cpu) + __send_remote_softirq(&skb->csd, target_cpu, this_cpu, NET_RECEIVE_SOFTIRQ); + else + do_direct = 1; + + local_irq_restore(flags); + + if (do_direct) + return __netif_receive_skb(skb); + + return NET_RX_SUCCESS; +} + +int netif_rx(struct sk_buff *skb) +{ + int target_cpu; + + /* if netpoll wants it, pretend we never saw it */ + if (netpoll_rx(skb)) + return NET_RX_DROP; + + target_cpu = rxflow_cpu_map[simple_rx_hash(skb)]; + send_remote_softirq(&skb->csd, target_cpu, NET_RECEIVE_SOFTIRQ); + + return NET_RX_SUCCESS; +} + +int netif_rx_ni(struct sk_buff *skb) +{ + int err; + + preempt_disable(); + err = netif_rx(skb); + if (local_softirq_pending()) + do_softirq(); + preempt_enable(); + + return err; +} + +EXPORT_SYMBOL(netif_rx_ni); + /** * __napi_schedule - schedule for receive * @n: entry to schedule @@ -4182,8 +4216,6 @@ void netdev_run_todo(void) dev->reg_state = NETREG_UNREGISTERED; - on_each_cpu(flush_backlog, dev, 1); - netdev_wait_allrefs(dev); /* paranoia */ @@ -4489,7 +4521,6 @@ static int dev_cpu_callback(struct notifier_block *nfb, { struct sk_buff **list_skb; struct Qdisc **list_net; - struct sk_buff *skb; unsigned int cpu, oldcpu = (unsigned long)ocpu; struct softnet_data *sd, *oldsd; @@ -4520,10 +4551,6 @@ static int dev_cpu_callback(struct notifier_block *nfb, raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_enable(); - /* Process offline CPU's input_pkt_queue */ - while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) - netif_rx(skb); - return NOTIFY_OK; } @@ -4793,7 +4820,7 @@ static struct pernet_operations __net_initdata default_device_ops = { */ static int __init net_dev_init(void) { - int i, rc = -ENOMEM; + int i, index, rc = -ENOMEM; BUG_ON(!dev_boot_phase); @@ -4813,6 +4840,15 @@ static int __init net_dev_init(void) if (register_pernet_device(&default_device_ops)) goto out; + rxflow_cpu_map = kzalloc(sizeof(u16) * num_possible_cpus(), GFP_KERNEL); + if (!rxflow_cpu_map) + goto out; + rxflow_num_cpus = num_online_cpus(); + + index = 0; + for_each_online_cpu(i) + rxflow_cpu_map[index++] = i; + /* * Initialise the packet receive queues. */ @@ -4821,12 +4857,8 @@ static int __init net_dev_init(void) struct softnet_data *queue; queue = &per_cpu(softnet_data, i); - skb_queue_head_init(&queue->input_pkt_queue); queue->completion_queue = NULL; INIT_LIST_HEAD(&queue->poll_list); - - queue->backlog.poll = process_backlog; - queue->backlog.weight = weight_p; } netdev_dma_register(); @@ -4835,6 +4867,7 @@ static int __init net_dev_init(void) open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); + open_softirq(NET_RECEIVE_SOFTIRQ, net_receive_action); hotcpu_notifier(dev_cpu_callback, 0); dst_init(); -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-22 22:12 ` David Miller @ 2008-09-23 17:03 ` Chris Friesen 2008-09-23 21:10 ` Tom Herbert 2008-09-23 21:51 ` David Miller 0 siblings, 2 replies; 26+ messages in thread From: Chris Friesen @ 2008-09-23 17:03 UTC (permalink / raw) To: David Miller; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert David Miller wrote: > From: "Chris Friesen" <cfriesen@nortel.com> > Date: Mon, 22 Sep 2008 15:22:36 -0600 > >> I'm not sure this belongs in this particular thread but I was >> interested in how you're planning on doing this? > > Something like this patch which I posted last week on > netdev. That patch basically just picks an arbitrary cpu for each flow. This would spread the load out across cpus, but it doesn't allow any input from userspace. We have a current application where there are 16 cores and 16 threads. They would really like to be able to pin one thread to each core and tell the kernel what packets they're interested in so that the kernel can process those packets on that core to gain the maximum caching benefit as well as reduce reordering issues. In our case the hardware supports filtering for multiqueues, so we could pass this information down to the hardware to avoid software filtering. Either way, it requires some way for userspace to indicate interest in a particular flow. Has anyone given any thought to what an API like this would look like? I suppose we could automatically look at bound network sockets owned by tasks that are affined to single cpus. This would simplify userspace but would reduce flexibility for things like packet sockets with socket filters applied. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-23 17:03 ` Chris Friesen @ 2008-09-23 21:10 ` Tom Herbert 2008-09-23 21:51 ` David Miller 1 sibling, 0 replies; 26+ messages in thread From: Tom Herbert @ 2008-09-23 21:10 UTC (permalink / raw) To: Chris Friesen Cc: David Miller, linux-kernel, netdev, jens.axboe, steffen.klassert > > That patch basically just picks an arbitrary cpu for each flow. This would > spread the load out across cpus, but it doesn't allow any input from > userspace. > We've been running softRSS for a while (http://marc.info/?l=linux-netdev&m=120475045519940&w=2) which I believe has very similar functionality to this patch. From this work we found some nice ways to improve scaling that might be applicable: - When routing packets to CPU based on hash, sending to another CPU sharing L2 or L3 cache is best performance. - We added a simple functionality to route packets to the CPU on which the application last did a read for the socket. This seems to be a win for cache locality. - We added a lookup table that maps the Toeplitz hash to the receiving CPU where the application is running. This is for those devices that provide the Toeplitz hash in the receive descriptor. This is a win since the CPU receiving the interrupt doesn't need to take any cache misses on the packet itself. - In our (preliminary) 10G testing we found that routing packets in software with the the above trick actually allows higher PPS and better CPU utilization than using hardware RSS. Also, using both the software routing and hardware RSS yields the best results. Tom > We have a current application where there are 16 cores and 16 threads. They > would really like to be able to pin one thread to each core and tell the > kernel what packets they're interested in so that the kernel can process > those packets on that core to gain the maximum caching benefit as well as > reduce reordering issues. In our case the hardware supports filtering for > multiqueues, so we could pass this information down to the hardware to avoid > software filtering. > > Either way, it requires some way for userspace to indicate interest in a > particular flow. Has anyone given any thought to what an API like this > would look like? > > I suppose we could automatically look at bound network sockets owned by > tasks that are affined to single cpus. This would simplify userspace but > would reduce flexibility for things like packet sockets with socket filters > applied. > > Chris > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-23 17:03 ` Chris Friesen 2008-09-23 21:10 ` Tom Herbert @ 2008-09-23 21:51 ` David Miller 1 sibling, 0 replies; 26+ messages in thread From: David Miller @ 2008-09-23 21:51 UTC (permalink / raw) To: cfriesen; +Cc: linux-kernel, netdev, jens.axboe, steffen.klassert From: "Chris Friesen" <cfriesen@nortel.com> Date: Tue, 23 Sep 2008 11:03:48 -0600 > That patch basically just picks an arbitrary cpu for each flow. > This would spread the load out across cpus, but it doesn't allow any > input from userspace. With hardware RX flow seperation, the same exact thing happens. > We have a current application where there are 16 cores and 16 > threads. They would really like to be able to pin one thread to each > core and tell the kernel what packets they're interested in so that > the kernel can process those packets on that core to gain the > maximum caching benefit as well as reduce reordering issues. In our > case the hardware supports filtering for multiqueues, so we could > pass this information down to the hardware to avoid software > filtering. > > Either way, it requires some way for userspace to indicate interest > in a particular flow. Has anyone given any thought to what an API > like this would look like? Many cards cannot configure this, but yes we should allow an interface to configure RX flow seperation preferences, and we do plan on adding that at some point. It's probably be an ethtool operation of some sort. We already have a minimalistic RX flow hashing configuration knob, see ETHTOOL_GRXFH and ETHTOOL_SRXFH. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2]: Remote softirq invocation infrastructure. 2008-09-20 6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller 2008-09-20 15:29 ` Daniel Walker 2008-09-22 21:22 ` Chris Friesen @ 2008-09-24 7:42 ` David Miller 2 siblings, 0 replies; 26+ messages in thread From: David Miller @ 2008-09-24 7:42 UTC (permalink / raw) To: linux-kernel; +Cc: netdev, jens.axboe, steffen.klassert From: David Miller <davem@davemloft.net> Date: Fri, 19 Sep 2008 23:48:24 -0700 (PDT) > Jens Axboe has written some hacks for the block layer that allow > queueing softirq work to remote cpus. In the context of the block > layer he used this facility to trigger the softirq block I/O > completion on the same cpu where the I/O was submitted. As a followup to this, I've refreshed my patches and put them in a tree cloned from Linus's current GIT tree: master.kernel.org:/pub/scm/linux/kernel/git/davem/softirq-2.6.git I made minor touchups to the second patch, such as adding a few more descriptive comments, and adding the missing export of the softirq_work list array. Updated version below for reference: softirq: Add support for triggering softirq work on softirqs. This is basically a genericization of Jens Axboe's block layer remote softirq changes. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> --- include/linux/interrupt.h | 21 +++++++ include/linux/smp.h | 4 +- kernel/softirq.c | 129 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 153 insertions(+), 1 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index fdd7b90..0a7a14b 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -11,6 +11,8 @@ #include <linux/hardirq.h> #include <linux/sched.h> #include <linux/irqflags.h> +#include <linux/smp.h> +#include <linux/percpu.h> #include <asm/atomic.h> #include <asm/ptrace.h> #include <asm/system.h> @@ -272,6 +274,25 @@ extern void softirq_init(void); extern void raise_softirq_irqoff(unsigned int nr); extern void raise_softirq(unsigned int nr); +/* This is the worklist that queues up per-cpu softirq work. + * + * send_remote_sendirq() adds work to these lists, and + * the softirq handler itself dequeues from them. The queues + * are protected by disabling local cpu interrupts and they must + * only be accessed by the local cpu that they are for. + */ +DECLARE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list); + +/* Try to send a softirq to a remote cpu. If this cannot be done, the + * work will be queued to the local cpu. + */ +extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq); + +/* Like send_remote_softirq(), but the caller must disable local cpu interrupts + * and compute the current cpu, passed in as 'this_cpu'. + */ +extern void __send_remote_softirq(struct call_single_data *cp, int cpu, + int this_cpu, int softirq); /* Tasklets --- multithreaded analogue of BHs. diff --git a/include/linux/smp.h b/include/linux/smp.h index 66484d4..2e4d58b 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -7,6 +7,7 @@ */ #include <linux/errno.h> +#include <linux/types.h> #include <linux/list.h> #include <linux/cpumask.h> @@ -16,7 +17,8 @@ struct call_single_data { struct list_head list; void (*func) (void *info); void *info; - unsigned int flags; + u16 flags; + u16 priv; }; #ifdef CONFIG_SMP diff --git a/kernel/softirq.c b/kernel/softirq.c index 27642a2..77aba5e 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -6,6 +6,8 @@ * Distribute under GPLv2. * * Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903) + * + * Remote softirq infrastructure is by Jens Axboe. */ #include <linux/module.h> @@ -463,17 +465,144 @@ void tasklet_kill(struct tasklet_struct *t) EXPORT_SYMBOL(tasklet_kill); +DEFINE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list); +EXPORT_PER_CPU_SYMBOL(softirq_work_list); + +static void __local_trigger(struct call_single_data *cp, int softirq) +{ + struct list_head *head = &__get_cpu_var(softirq_work_list[softirq]); + + list_add_tail(&cp->list, head); + + /* Trigger the softirq only if the list was previously empty. */ + if (head->next == &cp->list) + raise_softirq_irqoff(softirq); +} + +#ifdef CONFIG_USE_GENERIC_SMP_HELPERS +static void remote_softirq_receive(void *data) +{ + struct call_single_data *cp = data; + unsigned long flags; + int softirq; + + softirq = cp->priv; + + local_irq_save(flags); + __local_trigger(cp, softirq); + local_irq_restore(flags); +} + +static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq) +{ + if (cpu_online(cpu)) { + cp->func = remote_softirq_receive; + cp->info = cp; + cp->flags = 0; + cp->priv = softirq; + + __smp_call_function_single(cpu, cp); + return 0; + } + return 1; +} +#else /* CONFIG_USE_GENERIC_SMP_HELPERS */ +static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq) +{ + return 1; +} +#endif + +/** + * __send_remote_softirq - try to schedule softirq work on a remote cpu + * @cp: private SMP call function data area + * @cpu: the remote cpu + * @this_cpu: the currently executing cpu + * @softirq: the softirq for the work + * + * Attempt to schedule softirq work on a remote cpu. If this cannot be + * done, the work is instead queued up on the local cpu. + * + * Interrupts must be disabled. + */ +void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu, int softirq) +{ + if (cpu == this_cpu || __try_remote_softirq(cp, cpu, softirq)) + __local_trigger(cp, softirq); +} +EXPORT_SYMBOL(__send_remote_softirq); + +/** + * send_remote_softirq - try to schedule softirq work on a remote cpu + * @cp: private SMP call function data area + * @cpu: the remote cpu + * @softirq: the softirq for the work + * + * Like __send_remote_softirq except that disabling interrupts and + * computing the current cpu is done for the caller. + */ +void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq) +{ + unsigned long flags; + int this_cpu; + + local_irq_save(flags); + this_cpu = smp_processor_id(); + __send_remote_softirq(cp, cpu, this_cpu, softirq); + local_irq_restore(flags); +} +EXPORT_SYMBOL(send_remote_softirq); + +static int __cpuinit remote_softirq_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + /* + * If a CPU goes away, splice its entries to the current CPU + * and trigger a run of the softirq + */ + if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { + int cpu = (unsigned long) hcpu; + int i; + + local_irq_disable(); + for (i = 0; i < NR_SOFTIRQ; i++) { + struct list_head *head = &per_cpu(softirq_work_list[i], cpu); + struct list_head *local_head; + + if (list_empty(head)) + continue; + + local_head = &__get_cpu_var(softirq_work_list[i]); + list_splice_init(head, local_head); + raise_softirq_irqoff(i); + } + local_irq_enable(); + } + + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata remote_softirq_cpu_notifier = { + .notifier_call = remote_softirq_cpu_notify, +}; + void __init softirq_init(void) { int cpu; for_each_possible_cpu(cpu) { + int i; + per_cpu(tasklet_vec, cpu).tail = &per_cpu(tasklet_vec, cpu).head; per_cpu(tasklet_hi_vec, cpu).tail = &per_cpu(tasklet_hi_vec, cpu).head; + for (i = 0; i < NR_SOFTIRQ; i++) + INIT_LIST_HEAD(&per_cpu(softirq_work_list[i], cpu)); } + register_hotcpu_notifier(&remote_softirq_cpu_notifier); + open_softirq(TASKLET_SOFTIRQ, tasklet_action); open_softirq(HI_SOFTIRQ, tasklet_hi_action); } -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-09-24 7:42 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-20 6:48 [PATCH 0/2]: Remote softirq invocation infrastructure David Miller 2008-09-20 15:29 ` Daniel Walker 2008-09-20 15:45 ` Arjan van de Ven 2008-09-20 16:02 ` Daniel Walker 2008-09-20 16:19 ` Arjan van de Ven 2008-09-20 17:40 ` Daniel Walker 2008-09-20 18:09 ` Arjan van de Ven 2008-09-20 18:52 ` Daniel Walker 2008-09-20 20:04 ` David Miller 2008-09-20 19:59 ` David Miller 2008-09-21 6:05 ` Herbert Xu 2008-09-21 6:57 ` David Miller 2008-09-22 10:36 ` Ilpo Järvinen 2008-09-24 4:54 ` Herbert Xu 2008-09-21 9:13 ` James Courtier-Dutton 2008-09-21 9:17 ` David Miller 2008-09-21 9:46 ` Steffen Klassert 2008-09-22 8:23 ` Herbert Xu 2008-09-22 13:54 ` Steffen Klassert 2008-09-20 20:00 ` David Miller 2008-09-22 21:22 ` Chris Friesen 2008-09-22 22:12 ` David Miller 2008-09-23 17:03 ` Chris Friesen 2008-09-23 21:10 ` Tom Herbert 2008-09-23 21:51 ` David Miller 2008-09-24 7:42 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).