From: John Fastabend <john.fastabend@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>, Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
Jiri Pirko <jiri@resnulli.us>, David Miller <davem@davemloft.net>,
Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
Date: Wed, 18 Apr 2018 09:44:28 -0700 [thread overview]
Message-ID: <36a89ed1-d6ff-ddad-c736-4e68909d61c4@gmail.com> (raw)
In-Reply-To: <1524036512.2599.4.camel@redhat.com>
On 04/18/2018 12:28 AM, Paolo Abeni wrote:
> Hi,
>
> let me revive this old thread...
>
> On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
>> On 03/26/2018 10:30 AM, Cong Wang wrote:
>>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>>>> enqueue threads and dequeue threads to run in parallel. On the
>>>> enqueue side the skb bit ooo_okay is used to ensure all related
>>>> skbs are enqueued in-order. On the dequeue side though there is
>>>> no similar logic. What we observe is with fewer queues than CPUs
>>>> it is possible to re-order packets when two instances of
>>>> __qdisc_run() are running in parallel. Each thread will dequeue
>>>> a skb and then whichever thread calls the ndo op first will
>>>> be sent on the wire. This doesn't typically happen because
>>>> qdisc_run() is usually triggered by the same core that did the
>>>> enqueue. However, drivers will trigger __netif_schedule()
>>>> when queues are transitioning from stopped to awake using the
>>>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>>>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>>>> is usually done in the interrupt completion context. This CPU
>>>> is selected with the irq affinity which is unrelated to the
>>>> enqueue operations.
>>>
>>> Interesting. Why this is unique to pfifo_fast? For me it could
>>> happen to other qdisc's too, when we release the qdisc root
>>> lock in sch_direct_xmit(), another CPU could dequeue from
>>> the same qdisc and transmit the skb in parallel too?
>>>
>>
>> Agreed, my guess is it never happens because the timing is
>> tighter in the lock case. Or if it is happening its infrequent
>> enough that no one noticed the OOO packets.
>
> I think the above could not happend due to the qdisc seqlock - which is
> not acquired by NOLOCK qdiscs.
>
Yep, seems to be the case.
>> For net-next we probably could clean this up. I was just
>> going for something simple in net that didn't penalize all
>> qdiscs as Eric noted. This patch doesn't make it any worse
>> at least. And we have been living with the above race for
>> years.
>
> I've benchmarked this patch is some different scenario, and in my
> testing it introduces a measurable regression in uncontended/lightly
> contended scenarios. The measured peak negative delta is with a pktgen
> thread using "xmit_mode queue_xmit":
>
> before: 27674032 pps
> after: 23809052 pps
Yeah more atomic ops :/
>
> I spend some time searching a way to improve this, without success.
>
> John, did you had any chance to look at this again?
>
If we have a multiple cores pulling from the same skb list and
feeding the same txq this happens. One problem is even if the
normal dev_queue_xmit path is aligned drivers call netif_schedule
from interrupt context and that happens on an arbitrary a cpu. When
the arbitrary cpu runs the netif_schedule logic it will dequeue
from the skb list using the cpu it was scheduled on.
The lockless case is not _really_ lockless after this patch we
have managed to pull apart the enqueue and dequeue serialization
though.
Thanks for bringing this up. I'll think about it for a bit maybe
there is something we can do here. There is a set of conditions
that if met we can run without the lock. Possibly ONETXQUEUE and
aligned cpu_map is sufficient. We could detect this case and drop
the locking. For existing systems and high Gbps NICs I think (feel
free to correct me) assuming a core per cpu is OK. At some point
though we probably need to revisit this assumption.
.John
> Thanks,
>
> Paolo
>
next prev parent reply other threads:[~2018-04-18 16:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-25 5:25 [net PATCH v2] net: sched, fix OOO packets with pfifo_fast John Fastabend
2018-03-26 16:36 ` David Miller
2018-03-26 17:10 ` John Fastabend
2018-03-26 17:30 ` Cong Wang
2018-03-26 18:16 ` John Fastabend
2018-04-18 7:28 ` Paolo Abeni
2018-04-18 16:44 ` John Fastabend [this message]
2018-04-19 8:00 ` Paolo Abeni
2018-05-08 16:17 ` Paolo Abeni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=36a89ed1-d6ff-ddad-c736-4e68909d61c4@gmail.com \
--to=john.fastabend@gmail.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=jiri@resnulli.us \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).