commit 917f7ff2b0f59d721d11f983af1f46c1cd74130a
Author: Vincent Ray <vray@kalray.eu>
Date:   Mon May 23 15:24:12 2022 +0200

    net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog
    
    In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit()
    does not provide any ordering guarantee as test_bit() is not an atomic
    operation. This, added to the fact that the spin_trylock() call at
    the beginning of qdisc_run_begin() does not guarantee acquire
    semantics if it does not grab the lock, makes it possible for the
    following statement :
    
    if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
    
    to be executed before an enqueue operation called before
    qdisc_run_begin().
    
    As a result the following race can happen :
    
               CPU 1                             CPU 2
    
          qdisc_run_begin()               qdisc_run_begin() /* true */
            set(MISSED)                            .
          /* returns false */                      .
              .                            /* sees MISSED = 1 */
              .                            /* so qdisc not empty */
              .                            __qdisc_run()
              .                                    .
              .                              pfifo_fast_dequeue()
     ----> /* may be done here */                  .
    |         .                                clear(MISSED)
    |         .                                    .
    |         .                                smp_mb __after_atomic();
    |         .                                    .
    |         .                                /* recheck the queue */
    |         .                                /* nothing => exit   */
    |   enqueue(skb1)
    |         .
    |   qdisc_run_begin()
    |         .
    |     spin_trylock() /* fail */
    |         .
    |     smp_mb__before_atomic() /* not enough */
    |         .
     ---- if (test_bit(MISSED))
            return false;   /* exit */
    
    In the above scenario, CPU 1 and CPU 2 both try to grab the
    qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the
    bypass code path, where it emits its skb then calls __qdisc_run().
    
    CPU1 fails, sets MISSED and goes down the traditionnal enqueue() +
    dequeue() code path. But when executing qdisc_run_begin() for the
    second time, after enqueing its skbuff, it sees the MISSED bit still
    set (by itself) and consequently chooses to exit early without setting
    it again nor trying to grab the spinlock again.
    
    Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue
    and found it empty, so it returned.
    
    At the end of the sequence, we end up with skb1 enqueued in the
    backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set,
    and no __netif_schedule() called made. skb1 will now linger in the
    qdisc until somebody later performs a full __qdisc_run(). Associated
    to the bypass capacity of the qdisc, and the ability of the TCP layer
    to avoid resending packets which it knows are still in the qdisc, this
    can lead to serious traffic "holes" in a TCP connexion.
    
    We fix this by turning smp_mb__before_atomic() into smp_mb() which
    guarantees the correct ordering of enqueue() vs test_bit() and
    consequently prevents the race.

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9bab396c1f3b..0c6016e10a6f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -191,7 +191,7 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 		 * STATE_MISSED checking is synchronized with clearing
 		 * in pfifo_fast_dequeue().
 		 */
-		smp_mb__before_atomic();
+		smp_mb();
 
 		/* If the MISSED flag is set, it means other thread has
 		 * set the MISSED flag before second spin_trylock(), so