netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>, Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>, David Miller <davem@davemloft.net>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
Date: Wed, 18 Apr 2018 09:44:28 -0700	[thread overview]
Message-ID: <36a89ed1-d6ff-ddad-c736-4e68909d61c4@gmail.com> (raw)
In-Reply-To: <1524036512.2599.4.camel@redhat.com>

On 04/18/2018 12:28 AM, Paolo Abeni wrote:
> Hi,
> 
> let me revive this old thread...
> 
> On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
>> On 03/26/2018 10:30 AM, Cong Wang wrote:
>>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>>>> enqueue threads and dequeue threads to run in parallel. On the
>>>> enqueue side the skb bit ooo_okay is used to ensure all related
>>>> skbs are enqueued in-order. On the dequeue side though there is
>>>> no similar logic. What we observe is with fewer queues than CPUs
>>>> it is possible to re-order packets when two instances of
>>>> __qdisc_run() are running in parallel. Each thread will dequeue
>>>> a skb and then whichever thread calls the ndo op first will
>>>> be sent on the wire. This doesn't typically happen because
>>>> qdisc_run() is usually triggered by the same core that did the
>>>> enqueue. However, drivers will trigger __netif_schedule()
>>>> when queues are transitioning from stopped to awake using the
>>>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>>>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>>>> is usually done in the interrupt completion context. This CPU
>>>> is selected with the irq affinity which is unrelated to the
>>>> enqueue operations.
>>>
>>> Interesting. Why this is unique to pfifo_fast? For me it could
>>> happen to other qdisc's too, when we release the qdisc root
>>> lock in sch_direct_xmit(), another CPU could dequeue from
>>> the same qdisc and transmit the skb in parallel too?
>>>
>>
>> Agreed, my guess is it never happens because the timing is
>> tighter in the lock case. Or if it is happening its infrequent
>> enough that no one noticed the OOO packets.
> 
> I think the above could not happend due to the qdisc seqlock - which is
> not acquired by NOLOCK qdiscs.
> 

Yep, seems to be the case.

>> For net-next we probably could clean this up. I was just
>> going for something simple in net that didn't penalize all
>> qdiscs as Eric noted. This patch doesn't make it any worse
>> at least. And we have been living with the above race for
>> years.
> 
> I've benchmarked this patch is some different scenario, and in my
> testing it introduces a measurable regression in uncontended/lightly
> contended scenarios. The measured peak negative delta is with a pktgen
> thread using "xmit_mode queue_xmit":
> 
> before: 27674032 pps
> after: 23809052 pps

Yeah more atomic ops :/

> 
> I spend some time searching a way to improve this, without success.
> 
> John, did you had any chance to look at this again?
> 

If we have a multiple cores pulling from the same skb list and
feeding the same txq this happens. One problem is even if the
normal dev_queue_xmit path is aligned drivers call netif_schedule
from interrupt context and that happens on an arbitrary a cpu. When
the arbitrary cpu runs the netif_schedule logic it will dequeue
from the skb list using the cpu it was scheduled on.

The lockless case is not _really_ lockless after this patch we
have managed to pull apart the enqueue and dequeue serialization
though.

Thanks for bringing this up. I'll think about it for a bit maybe
there is something we can do here. There is a set of conditions
that if met we can run without the lock. Possibly ONETXQUEUE and
aligned cpu_map is sufficient. We could detect this case and drop
the locking. For existing systems and high Gbps NICs I think (feel
free to correct me) assuming a core per cpu is OK. At some point
though we probably need to revisit this assumption.

.John

> Thanks,
> 
> Paolo
> 

  reply	other threads:[~2018-04-18 16:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-25  5:25 [net PATCH v2] net: sched, fix OOO packets with pfifo_fast John Fastabend
2018-03-26 16:36 ` David Miller
2018-03-26 17:10   ` John Fastabend
2018-03-26 17:30 ` Cong Wang
2018-03-26 18:16   ` John Fastabend
2018-04-18  7:28     ` Paolo Abeni
2018-04-18 16:44       ` John Fastabend [this message]
2018-04-19  8:00         ` Paolo Abeni
2018-05-08 16:17         ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36a89ed1-d6ff-ddad-c736-4e68909d61c4@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=jiri@resnulli.us \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).