root_lock vs. device's TX lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* root_lock vs. device's TX lock
@ 2011-11-17 16:10 Tom Herbert
  2011-11-17 16:34 ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Herbert @ 2011-11-17 16:10 UTC (permalink / raw)
  To: Linux Netdev List

>From sch_direct_xmit:

        /* And release qdisc */
        spin_unlock(root_lock);

        HARD_TX_LOCK(dev, txq, smp_processor_id());
        if (!netif_tx_queue_frozen_or_stopped(txq))
                ret = dev_hard_start_xmit(skb, dev, txq);

        HARD_TX_UNLOCK(dev, txq);

        spin_lock(root_lock);

This is a lot of lock manipulation to basically switch from one lock
to another and possibly thrashing just to send a packet.  I am
thinking that if the there is a 1-1 correspondence between qdisc and
device queue then we could actually use the device's lock as the root
lock for the qdisc.  So in that case, we would need to touch any locks
from sch_direct_xmit (just hold root lock which is already device lock
for the duration).

Is there any reason why this couldn't work?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 16:10 root_lock vs. device's TX lock Tom Herbert
@ 2011-11-17 16:34 ` Eric Dumazet
  2011-11-17 17:26   ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-11-17 16:34 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List

Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
> From sch_direct_xmit:
> 
>         /* And release qdisc */
>         spin_unlock(root_lock);
> 
>         HARD_TX_LOCK(dev, txq, smp_processor_id());
>         if (!netif_tx_queue_frozen_or_stopped(txq))
>                 ret = dev_hard_start_xmit(skb, dev, txq);
> 
>         HARD_TX_UNLOCK(dev, txq);
> 
>         spin_lock(root_lock);
> 
> This is a lot of lock manipulation to basically switch from one lock
> to another and possibly thrashing just to send a packet.  I am
> thinking that if the there is a 1-1 correspondence between qdisc and
> device queue then we could actually use the device's lock as the root
> lock for the qdisc.  So in that case, we would need to touch any locks
> from sch_direct_xmit (just hold root lock which is already device lock
> for the duration).
> 
> Is there any reason why this couldn't work?

But we have to dirty part of Qdisc anyway ?
(state, bstats, q, ...)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 16:34 ` Eric Dumazet
@ 2011-11-17 17:26   ` Eric Dumazet
  2011-11-17 18:19     ` Dave Taht
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-11-17 17:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List

Le jeudi 17 novembre 2011 à 17:34 +0100, Eric Dumazet a écrit :
> Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
> > From sch_direct_xmit:
> > 
> >         /* And release qdisc */
> >         spin_unlock(root_lock);
> > 
> >         HARD_TX_LOCK(dev, txq, smp_processor_id());
> >         if (!netif_tx_queue_frozen_or_stopped(txq))
> >                 ret = dev_hard_start_xmit(skb, dev, txq);
> > 
> >         HARD_TX_UNLOCK(dev, txq);
> > 
> >         spin_lock(root_lock);
> > 
> > This is a lot of lock manipulation to basically switch from one lock
> > to another and possibly thrashing just to send a packet.  I am
> > thinking that if the there is a 1-1 correspondence between qdisc and
> > device queue then we could actually use the device's lock as the root
> > lock for the qdisc.  So in that case, we would need to touch any locks
> > from sch_direct_xmit (just hold root lock which is already device lock
> > for the duration).
> > 
> > Is there any reason why this couldn't work?
> 
> But we have to dirty part of Qdisc anyway ?
> (state, bstats, q, ...)
> 

Also we want to permit other cpus to enqueue packets to Qdisc while our
cpu is busy in network driver ndo_start_xmit()

For complex Qdisc / tc setups (potentially touching a lot of cache
lines), we could eventually add a small ring buffer so that the cpu
doing the ndo_start_xmit() also queues the packets into Qdisc.

This ringbuffer could use a lockless algo. (we currently use the
secondary 'busylock' to serialize other cpus, but each cpu calls qdisc
enqueue itself.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 17:26   ` Eric Dumazet
@ 2011-11-17 18:19     ` Dave Taht
  2011-11-17 20:55       ` Andy Fleming
  2011-11-21 21:04       ` David Miller
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Taht @ 2011-11-17 18:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Linux Netdev List

On Thu, Nov 17, 2011 at 6:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 17 novembre 2011 à 17:34 +0100, Eric Dumazet a écrit :
>> Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
>> > From sch_direct_xmit:
>> >
>> >         /* And release qdisc */
>> >         spin_unlock(root_lock);
>> >
>> >         HARD_TX_LOCK(dev, txq, smp_processor_id());
>> >         if (!netif_tx_queue_frozen_or_stopped(txq))
>> >                 ret = dev_hard_start_xmit(skb, dev, txq);
>> >
>> >         HARD_TX_UNLOCK(dev, txq);
>> >
>> >         spin_lock(root_lock);
>> >
>> > This is a lot of lock manipulation to basically switch from one lock
>> > to another and possibly thrashing just to send a packet.  I am
>> > thinking that if the there is a 1-1 correspondence between qdisc and
>> > device queue then we could actually use the device's lock as the root
>> > lock for the qdisc.  So in that case, we would need to touch any locks
>> > from sch_direct_xmit (just hold root lock which is already device lock
>> > for the duration).
>> >
>> > Is there any reason why this couldn't work?
>>
>> But we have to dirty part of Qdisc anyway ?
>> (state, bstats, q, ...)
>>
>
> Also we want to permit other cpus to enqueue packets to Qdisc while our
> cpu is busy in network driver ndo_start_xmit()
>
> For complex Qdisc / tc setups (potentially touching a lot of cache
> lines), we could eventually add a small ring buffer so that the cpu
> doing the ndo_start_xmit() also queues the packets into Qdisc.
>
> This ringbuffer could use a lockless algo. (we currently use the
> secondary 'busylock' to serialize other cpus, but each cpu calls qdisc
> enqueue itself.)

I was thinking ringbuffering might also help in adding a 'grouper'
abstraction to the dequeuing side.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 18:19     ` Dave Taht
@ 2011-11-17 20:55       ` Andy Fleming
  2011-11-18  0:35         ` Tom Herbert
  2011-11-21 21:04       ` David Miller
  1 sibling, 1 reply; 8+ messages in thread
From: Andy Fleming @ 2011-11-17 20:55 UTC (permalink / raw)
  To: Dave Taht; +Cc: Eric Dumazet, Tom Herbert, Linux Netdev List

On Thu, Nov 17, 2011 at 12:19 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 6:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 17 novembre 2011 à 17:34 +0100, Eric Dumazet a écrit :
>>> Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
>>> > From sch_direct_xmit:
>>> >
>>> >         /* And release qdisc */
>>> >         spin_unlock(root_lock);
>>> >
>>> >         HARD_TX_LOCK(dev, txq, smp_processor_id());
>>> >         if (!netif_tx_queue_frozen_or_stopped(txq))
>>> >                 ret = dev_hard_start_xmit(skb, dev, txq);
>>> >
>>> >         HARD_TX_UNLOCK(dev, txq);
>>> >
>>> >         spin_lock(root_lock);
>>> >
>>> > This is a lot of lock manipulation to basically switch from one lock
>>> > to another and possibly thrashing just to send a packet.  I am
>>> > thinking that if the there is a 1-1 correspondence between qdisc and
>>> > device queue then we could actually use the device's lock as the root
>>> > lock for the qdisc.  So in that case, we would need to touch any locks
>>> > from sch_direct_xmit (just hold root lock which is already device lock
>>> > for the duration).
>>> >
>>> > Is there any reason why this couldn't work?
>>>
>>> But we have to dirty part of Qdisc anyway ?
>>> (state, bstats, q, ...)
>>>
>>
>> Also we want to permit other cpus to enqueue packets to Qdisc while our
>> cpu is busy in network driver ndo_start_xmit()
>>
>> For complex Qdisc / tc setups (potentially touching a lot of cache
>> lines), we could eventually add a small ring buffer so that the cpu
>> doing the ndo_start_xmit() also queues the packets into Qdisc.
>>
>> This ringbuffer could use a lockless algo. (we currently use the
>> secondary 'busylock' to serialize other cpus, but each cpu calls qdisc
>> enqueue itself.)
>
> I was thinking ringbuffering might also help in adding a 'grouper'
> abstraction to the dequeuing side.

Actually, I'm interested in circumventing *both* locks. Our SoC has
some quite-versatile queueing infrastructure, such that (for many
queueing setups) we can do all of the queueing in hardware, using
per-cpu access portals. By hacking around the qdisc lock, and using a
tx queue per core, we were able to achieve a significant speedup.

Right now, it's a big hack, but I'd like to find a way that we could
provide support for this. My initial thought was to craft a qdisc
implementation for our hardware, but the root_lock means doing so
would not yield any performance gain.

Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 20:55       ` Andy Fleming
@ 2011-11-18  0:35         ` Tom Herbert
  2011-11-18  5:02           ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Herbert @ 2011-11-18  0:35 UTC (permalink / raw)
  To: Andy Fleming; +Cc: Dave Taht, Eric Dumazet, Linux Netdev List

> Actually, I'm interested in circumventing *both* locks. Our SoC has
> some quite-versatile queueing infrastructure, such that (for many
> queueing setups) we can do all of the queueing in hardware, using
> per-cpu access portals. By hacking around the qdisc lock, and using a
> tx queue per core, we were able to achieve a significant speedup.
>
This was actually one of the motivations for my question.  If we have
a one TX queue per core, and and use a trivial mq aware qdisc for
instance, the locking becomes mostly overhead.  I don't mind taking a
lock once per TX, but right now were taking three! (root lock twice,
and device lock once).

Even without one queue per TX, I think the overhead savings may still
be present.  Eric, I realize that a point of dropping the root lock in
sch_direct_xmit is to possibly allow queuing to to qdisc and device
xmit in parallel, but if you're using a trivial qdisc then the time in
qdisc may be << time in device xmit, so the overhead of locking could
mitigate the gains in parallelism.  At the very least, this benefit is
hugely variable depending on the qdisc used.

Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-18  0:35         ` Tom Herbert
@ 2011-11-18  5:02           ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2011-11-18  5:02 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Fleming, Dave Taht, Linux Netdev List

Le jeudi 17 novembre 2011 à 16:35 -0800, Tom Herbert a écrit :
> > Actually, I'm interested in circumventing *both* locks. Our SoC has
> > some quite-versatile queueing infrastructure, such that (for many
> > queueing setups) we can do all of the queueing in hardware, using
> > per-cpu access portals. By hacking around the qdisc lock, and using a
> > tx queue per core, we were able to achieve a significant speedup.

If packet reordering is not a concern (or taken into account in the
hardware)... A task sending tcp flow can migrate from cpu1 to cpu2...

> This was actually one of the motivations for my question.  If we have
> a one TX queue per core, and and use a trivial mq aware qdisc for
> instance, the locking becomes mostly overhead.  I don't mind taking a
> lock once per TX, but right now were taking three! (root lock twice,
> and device lock once).
> 

> Even without one queue per TX, I think the overhead savings may still
> be present.  Eric, I realize that a point of dropping the root lock in
> sch_direct_xmit is to possibly allow queuing to to qdisc and device
> xmit in parallel, but if you're using a trivial qdisc then the time in
> qdisc may be << time in device xmit, so the overhead of locking could
> mitigate the gains in parallelism.  At the very least, this benefit is
> hugely variable depending on the qdisc used.

Andy speaks more of a way to bypass qdisc (direct to device), while I
thought Tom wanted to optimize htb/cbq/..complex qdisc handling...

Note we have LLTX thing to avoid taking dev->lock for some devices.

We could have a qdisc->fast_enqueue() method for some (qdiscs/device)
combinations, able to short cut __dev_xmit_skb() and do their own stuff
(using rcu locking or other synch method, percpu bytes/counter stats...)

But if device is really that smart, why even use a qdisc in the first
place ?

If qdisc not needed : We take only one lock (dev lock) to transmit a
packet. If device is LLTX, no lock taken at all.

If qdisc is needed : We need to take qdisc lock to enqueue packet.
   Then, _if_ we own the __QDISC___STATE_RUNNING flag, we enter the loop
to dequeue and xmit packets (__qdisc_run() / sch_direct_xmit() doing the
insane lock flips)

Its a tough choice : Do we want to avoid false sharing in the device
itself or qdisc...

Current handling was designed so that one cpu was feeding the device and
other cpus feeding the qdisc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: root_lock vs. device's TX lock
  2011-11-17 18:19     ` Dave Taht
  2011-11-17 20:55       ` Andy Fleming
@ 2011-11-21 21:04       ` David Miller
  1 sibling, 0 replies; 8+ messages in thread
From: David Miller @ 2011-11-21 21:04 UTC (permalink / raw)
  To: dave.taht; +Cc: eric.dumazet, therbert, netdev

From: Dave Taht <dave.taht@gmail.com>
Date: Thu, 17 Nov 2011 19:19:58 +0100

> On Thu, Nov 17, 2011 at 6:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> For complex Qdisc / tc setups (potentially touching a lot of cache
>> lines), we could eventually add a small ring buffer so that the cpu
>> doing the ndo_start_xmit() also queues the packets into Qdisc.
>>
>> This ringbuffer could use a lockless algo. (we currently use the
>> secondary 'busylock' to serialize other cpus, but each cpu calls qdisc
>> enqueue itself.)
> 
> I was thinking ringbuffering might also help in adding a 'grouper'
> abstraction to the dequeuing side.

I fully support the idea of batching between these different levels of
the TX path, but keep in mind that this is only really going to work
well for simple qdiscs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-11-21 21:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-17 16:10 root_lock vs. device's TX lock Tom Herbert
2011-11-17 16:34 ` Eric Dumazet
2011-11-17 17:26   ` Eric Dumazet
2011-11-17 18:19     ` Dave Taht
2011-11-17 20:55       ` Andy Fleming
2011-11-18  0:35         ` Tom Herbert
2011-11-18  5:02           ` Eric Dumazet
2011-11-21 21:04       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).