From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [75.106.27.153] ([75.106.27.153]:35060 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753232AbeCXUNo (ORCPT ); Sat, 24 Mar 2018 16:13:44 -0400 Subject: [net PATCH] net: sched, fix OOO packets with pfifo_fast From: John Fastabend To: xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net Cc: netdev@vger.kernel.org Date: Sat, 24 Mar 2018 13:13:38 -0700 Message-ID: <20180324201338.7661.44440.stgit@john-Precision-Tower-5810> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org List-ID: After the qdisc lock was dropped in pfifo_fast we allow multiple enqueue threads and dequeue threads to run in parallel. On the enqueue side the skb bit ooo_okay is used to ensure all related skbs are enqueued in-order. On the dequeue side though there is no similar logic. What we observe is with fewer queues than CPUs it is possible to re-order packets when two instances of __qdisc_run() are running in parallel. Each thread will dequeue a skb and then whichever thread calls the ndo op first will be sent on the wire. This doesn't typically happen because qdisc_run() is usually triggered by the same core that did the enqueue. However, drivers will trigger __netif_schedule() when queues are transitioning from stopped to awake using the netif_tx_wake_* APIs. When this happens netif_schedule() calls qdisc_run() on the same CPU that did the netif_tx_wake_* which is usually done in the interrupt completion context. This CPU is selected with the irq affinity which is unrelated to the enqueue operations. To resolve this we add a RUNNING bit to the qdisc to ensure only a single dequeue per qdisc is running. Enqueue and dequeue operations can still run in parallel and also on multi queue NICs we can still have a dequeue in-flight per qdisc, which is typically per CPU. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Reported-by: Jakob Unterwurzacher Signed-off-by: John Fastabend --- include/net/sch_generic.h | 1 + net/sched/sch_generic.c | 13 ++++++++++--- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 2092d33..8da3267 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -30,6 +30,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_RUNNING, }; struct qdisc_size_table { diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 7e3fbe9..29a1b47 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets) struct netdev_queue *txq; struct net_device *dev; struct sk_buff *skb; - bool validate; + bool more, validate; /* Dequeue packet */ + if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) + return false; + skb = dequeue_skb(q, &validate, packets); - if (unlikely(!skb)) + if (unlikely(!skb)) { + clear_bit(__QDISC_STATE_RUNNING, &q->state); return false; + } if (!(q->flags & TCQ_F_NOLOCK)) root_lock = qdisc_lock(q); @@ -390,7 +395,9 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets) dev = qdisc_dev(q); txq = skb_get_tx_queue(dev, skb); - return sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + clear_bit(__QDISC_STATE_RUNNING, &q->state); + return more; } void __qdisc_run(struct Qdisc *q)