From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Fastabend Subject: Re: [net PATCH] net: sched, fix OOO packets with pfifo_fast Date: Sat, 24 Mar 2018 15:10:29 -0700 Message-ID: <32acd1d0-48b0-eaff-c971-9febdd33fcbd@gmail.com> References: <20180324201338.7661.44440.stgit@john-Precision-Tower-5810> <71d325ad-5270-0491-b858-75674d9fe09a@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org To: Eric Dumazet , xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net Return-path: Received: from mail-pl0-f67.google.com ([209.85.160.67]:37580 "EHLO mail-pl0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752229AbeCXWKt (ORCPT ); Sat, 24 Mar 2018 18:10:49 -0400 Received: by mail-pl0-f67.google.com with SMTP id w12-v6so9571041plp.4 for ; Sat, 24 Mar 2018 15:10:49 -0700 (PDT) In-Reply-To: <71d325ad-5270-0491-b858-75674d9fe09a@gmail.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 03/24/2018 02:15 PM, Eric Dumazet wrote: > > > On 03/24/2018 01:13 PM, John Fastabend wrote: >> After the qdisc lock was dropped in pfifo_fast we allow multiple >> enqueue threads and dequeue threads to run in parallel. On the >> enqueue side the skb bit ooo_okay is used to ensure all related >> skbs are enqueued in-order. On the dequeue side though there is >> no similar logic. What we observe is with fewer queues than CPUs >> it is possible to re-order packets when two instances of >> __qdisc_run() are running in parallel. Each thread will dequeue >> a skb and then whichever thread calls the ndo op first will >> be sent on the wire. This doesn't typically happen because >> qdisc_run() is usually triggered by the same core that did the >> enqueue. However, drivers will trigger __netif_schedule() >> when queues are transitioning from stopped to awake using the >> netif_tx_wake_* APIs. When this happens netif_schedule() calls >> qdisc_run() on the same CPU that did the netif_tx_wake_* which >> is usually done in the interrupt completion context. This CPU >> is selected with the irq affinity which is unrelated to the >> enqueue operations. >> >> To resolve this we add a RUNNING bit to the qdisc to ensure >> only a single dequeue per qdisc is running. Enqueue and dequeue >> operations can still run in parallel and also on multi queue >> NICs we can still have a dequeue in-flight per qdisc, which >> is typically per CPU. >> >> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") >> Reported-by: Jakob Unterwurzacher >> Signed-off-by: John Fastabend >> --- >> include/net/sch_generic.h | 1 + >> net/sched/sch_generic.c | 13 ++++++++++--- >> 2 files changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >> index 2092d33..8da3267 100644 >> --- a/include/net/sch_generic.h >> +++ b/include/net/sch_generic.h >> @@ -30,6 +30,7 @@ struct qdisc_rate_table { >> enum qdisc_state_t { >> __QDISC_STATE_SCHED, >> __QDISC_STATE_DEACTIVATED, >> + __QDISC_STATE_RUNNING, >> }; >> >> struct qdisc_size_table { >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> index 7e3fbe9..29a1b47 100644 >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets) >> struct netdev_queue *txq; >> struct net_device *dev; >> struct sk_buff *skb; >> - bool validate; >> + bool more, validate; >> >> /* Dequeue packet */ >> + if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) >> + return false; >> + [...] > > This adds a pair of atomic operations in fast path, only for pfifo_fast sake. > Yeah, we can wrap these in a `if (TCQ_F_NOLOCK)` to avoid it in cases its not needed. Alternatively, for net we could turn off NOLOCK in pfifo_fast and fix it net-next with something more complete. > qdisc_restart() name is misleading, this is used from __qdisc_run() > > I'll change it in net-next. .John