netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net PATCH] net: sched, fix OOO packets with pfifo_fast
@ 2018-03-24 20:13 John Fastabend
  2018-03-24 21:15 ` Eric Dumazet
  0 siblings, 1 reply; 3+ messages in thread
From: John Fastabend @ 2018-03-24 20:13 UTC (permalink / raw)
  To: xiyou.wangcong, jiri, davem; +Cc: netdev

After the qdisc lock was dropped in pfifo_fast we allow multiple
enqueue threads and dequeue threads to run in parallel. On the
enqueue side the skb bit ooo_okay is used to ensure all related
skbs are enqueued in-order. On the dequeue side though there is
no similar logic. What we observe is with fewer queues than CPUs
it is possible to re-order packets when two instances of
__qdisc_run() are running in parallel. Each thread will dequeue
a skb and then whichever thread calls the ndo op first will
be sent on the wire. This doesn't typically happen because
qdisc_run() is usually triggered by the same core that did the
enqueue. However, drivers will trigger __netif_schedule()
when queues are transitioning from stopped to awake using the
netif_tx_wake_* APIs. When this happens netif_schedule() calls
qdisc_run() on the same CPU that did the netif_tx_wake_* which
is usually done in the interrupt completion context. This CPU
is selected with the irq affinity which is unrelated to the
enqueue operations.

To resolve this we add a RUNNING bit to the qdisc to ensure
only a single dequeue per qdisc is running. Enqueue and dequeue
operations can still run in parallel and also on multi queue
NICs we can still have a dequeue in-flight per qdisc, which
is typically per CPU.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/sch_generic.h |    1 +
 net/sched/sch_generic.c   |   13 ++++++++++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 2092d33..8da3267 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_RUNNING,
 };
 
 struct qdisc_size_table {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7e3fbe9..29a1b47 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 	struct netdev_queue *txq;
 	struct net_device *dev;
 	struct sk_buff *skb;
-	bool validate;
+	bool more, validate;
 
 	/* Dequeue packet */
+	if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+		return false;
+
 	skb = dequeue_skb(q, &validate, packets);
-	if (unlikely(!skb))
+	if (unlikely(!skb)) {
+		clear_bit(__QDISC_STATE_RUNNING, &q->state);
 		return false;
+	}
 
 	if (!(q->flags & TCQ_F_NOLOCK))
 		root_lock = qdisc_lock(q);
@@ -390,7 +395,9 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 	dev = qdisc_dev(q);
 	txq = skb_get_tx_queue(dev, skb);
 
-	return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+	more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+	clear_bit(__QDISC_STATE_RUNNING, &q->state);
+	return more;
 }
 
 void __qdisc_run(struct Qdisc *q)

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [net PATCH] net: sched, fix OOO packets with pfifo_fast
  2018-03-24 20:13 [net PATCH] net: sched, fix OOO packets with pfifo_fast John Fastabend
@ 2018-03-24 21:15 ` Eric Dumazet
  2018-03-24 22:10   ` John Fastabend
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Dumazet @ 2018-03-24 21:15 UTC (permalink / raw)
  To: John Fastabend, xiyou.wangcong, jiri, davem; +Cc: netdev



On 03/24/2018 01:13 PM, John Fastabend wrote:
> After the qdisc lock was dropped in pfifo_fast we allow multiple
> enqueue threads and dequeue threads to run in parallel. On the
> enqueue side the skb bit ooo_okay is used to ensure all related
> skbs are enqueued in-order. On the dequeue side though there is
> no similar logic. What we observe is with fewer queues than CPUs
> it is possible to re-order packets when two instances of
> __qdisc_run() are running in parallel. Each thread will dequeue
> a skb and then whichever thread calls the ndo op first will
> be sent on the wire. This doesn't typically happen because
> qdisc_run() is usually triggered by the same core that did the
> enqueue. However, drivers will trigger __netif_schedule()
> when queues are transitioning from stopped to awake using the
> netif_tx_wake_* APIs. When this happens netif_schedule() calls
> qdisc_run() on the same CPU that did the netif_tx_wake_* which
> is usually done in the interrupt completion context. This CPU
> is selected with the irq affinity which is unrelated to the
> enqueue operations.
> 
> To resolve this we add a RUNNING bit to the qdisc to ensure
> only a single dequeue per qdisc is running. Enqueue and dequeue
> operations can still run in parallel and also on multi queue
> NICs we can still have a dequeue in-flight per qdisc, which
> is typically per CPU.
> 
> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
> Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/net/sch_generic.h |    1 +
>  net/sched/sch_generic.c   |   13 ++++++++++---
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 2092d33..8da3267 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
>  	__QDISC_STATE_SCHED,
>  	__QDISC_STATE_DEACTIVATED,
> +	__QDISC_STATE_RUNNING,
>  };
>  
>  struct qdisc_size_table {
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 7e3fbe9..29a1b47 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>  	struct netdev_queue *txq;
>  	struct net_device *dev;
>  	struct sk_buff *skb;
> -	bool validate;
> +	bool more, validate;
>  
>  	/* Dequeue packet */
> +	if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
> +		return false;
> +
>  	skb = dequeue_skb(q, &validate, packets);
> -	if (unlikely(!skb))
> +	if (unlikely(!skb)) {
> +		clear_bit(__QDISC_STATE_RUNNING, &q->state);
>  		return false;
> +	}
>  
>  	if (!(q->flags & TCQ_F_NOLOCK))
>  		root_lock = qdisc_lock(q);
> @@ -390,7 +395,9 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>  	dev = qdisc_dev(q);
>  	txq = skb_get_tx_queue(dev, skb);
>  
> -	return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
> +	more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
> +	clear_bit(__QDISC_STATE_RUNNING, &q->state);
> +	return more;
>  }
>  
>  void __qdisc_run(struct Qdisc *q)
> 


This adds a pair of atomic operations in fast path, only for pfifo_fast sake.

qdisc_restart() name is misleading, this is used from __qdisc_run()

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [net PATCH] net: sched, fix OOO packets with pfifo_fast
  2018-03-24 21:15 ` Eric Dumazet
@ 2018-03-24 22:10   ` John Fastabend
  0 siblings, 0 replies; 3+ messages in thread
From: John Fastabend @ 2018-03-24 22:10 UTC (permalink / raw)
  To: Eric Dumazet, xiyou.wangcong, jiri, davem; +Cc: netdev

On 03/24/2018 02:15 PM, Eric Dumazet wrote:
> 
> 
> On 03/24/2018 01:13 PM, John Fastabend wrote:
>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>> enqueue threads and dequeue threads to run in parallel. On the
>> enqueue side the skb bit ooo_okay is used to ensure all related
>> skbs are enqueued in-order. On the dequeue side though there is
>> no similar logic. What we observe is with fewer queues than CPUs
>> it is possible to re-order packets when two instances of
>> __qdisc_run() are running in parallel. Each thread will dequeue
>> a skb and then whichever thread calls the ndo op first will
>> be sent on the wire. This doesn't typically happen because
>> qdisc_run() is usually triggered by the same core that did the
>> enqueue. However, drivers will trigger __netif_schedule()
>> when queues are transitioning from stopped to awake using the
>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>> is usually done in the interrupt completion context. This CPU
>> is selected with the irq affinity which is unrelated to the
>> enqueue operations.
>>
>> To resolve this we add a RUNNING bit to the qdisc to ensure
>> only a single dequeue per qdisc is running. Enqueue and dequeue
>> operations can still run in parallel and also on multi queue
>> NICs we can still have a dequeue in-flight per qdisc, which
>> is typically per CPU.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> ---
>>  include/net/sch_generic.h |    1 +
>>  net/sched/sch_generic.c   |   13 ++++++++++---
>>  2 files changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 2092d33..8da3267 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
>>  enum qdisc_state_t {
>>  	__QDISC_STATE_SCHED,
>>  	__QDISC_STATE_DEACTIVATED,
>> +	__QDISC_STATE_RUNNING,
>>  };
>>  
>>  struct qdisc_size_table {
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 7e3fbe9..29a1b47 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -377,12 +377,17 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>  	struct netdev_queue *txq;
>>  	struct net_device *dev;
>>  	struct sk_buff *skb;
>> -	bool validate;
>> +	bool more, validate;
>>  
>>  	/* Dequeue packet */
>> +	if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>> +		return false;
>> +

[...]

> 
> This adds a pair of atomic operations in fast path, only for pfifo_fast sake.
> 

Yeah, we can wrap these in a `if (TCQ_F_NOLOCK)` to avoid it in cases
its not needed. Alternatively, for net we could turn off NOLOCK in
pfifo_fast and fix it net-next with something more complete.

> qdisc_restart() name is misleading, this is used from __qdisc_run()
> 
> 

I'll change it in net-next.

.John

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-03-24 22:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-24 20:13 [net PATCH] net: sched, fix OOO packets with pfifo_fast John Fastabend
2018-03-24 21:15 ` Eric Dumazet
2018-03-24 22:10   ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).