netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Eric Dumazet <edumazet@google.com>
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	Willem de Bruijn <willemb@google.com>,
	netdev@vger.kernel.org, eric.dumazet@gmail.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
Date: Wed, 08 Oct 2025 12:05:53 +0200	[thread overview]
Message-ID: <87tt09k8se.fsf@toke.dk> (raw)
In-Reply-To: <CANn89iKoMHvYsvmqHB2HmCPp3H8Mmz1FwDD1XmqM6oDJS9+vTA@mail.gmail.com>

Eric Dumazet <edumazet@google.com> writes:

> On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > Remove busylock spinlock and use a lockless list (llist)
>> > to reduce spinlock contention to the minimum.
>> >
>> > Idea is that only one cpu might spin on the qdisc spinlock,
>> > while others simply add their skb in the llist.
>> >
>> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> > - Sending twice the number of packets per second.
>> > - While consuming 50 % less cycles.
>> >
>> > Tested:
>> >
>> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
>> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >
>> > Before:
>> >
>> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>> >
>> > Lock contention (1 second sample taken on 8 cores)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>> >
>> > If netperf threads are pinned, spinlock stress is very high.
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>> >
>> > @__dev_queue_xmit_ns:
>> > [256, 512)            21 |                                                    |
>> > [512, 1K)            631 |                                                    |
>> > [1K, 2K)           27328 |@                                                   |
>> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
>> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
>> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>> > [32K, 64K)         19055 |@                                                   |
>> > [64K, 128K)        17240 |@                                                   |
>> > [128K, 256K)       25633 |@                                                   |
>> > [256K, 512K)           4 |                                                    |
>> >
>> > After:
>> >
>> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
>> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
>> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
>> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
>> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>> >
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>> >
>> > If netperf threads are pinned (~54 Mpps)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>> >
>> > @__dev_queue_xmit_ns:
>> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
>> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [512, 1K)         483936 |@@@@@@@@@@                                          |
>> > [1K, 2K)          265345 |@@@@@@                                              |
>> > [2K, 4K)          145463 |@@@                                                 |
>> > [4K, 8K)           54571 |@                                                   |
>> > [8K, 16K)          10270 |                                                    |
>> > [16K, 32K)          9385 |                                                    |
>> > [32K, 64K)          7749 |                                                    |
>> > [64K, 128K)        26799 |                                                    |
>> > [128K, 256K)        2665 |                                                    |
>> > [256K, 512K)         665 |                                                    |
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> This is very cool! One question below, just to make sure I understand
>> this correctly:
>>
>> > ---
>> >  include/net/sch_generic.h |  4 +-
>> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
>> >  net/sched/sch_generic.c   |  5 ---
>> >  3 files changed, 52 insertions(+), 42 deletions(-)
>> >
>> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> > --- a/include/net/sch_generic.h
>> > +++ b/include/net/sch_generic.h
>> > @@ -115,7 +115,9 @@ struct Qdisc {
>> >       struct Qdisc            *next_sched;
>> >       struct sk_buff_head     skb_bad_txq;
>> >
>> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
>> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
>> > +     struct llist_head       defer_list;
>> > +
>> >       spinlock_t              seqlock;
>> >
>> >       struct rcu_head         rcu;
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >                                struct net_device *dev,
>> >                                struct netdev_queue *txq)
>> >  {
>> > +     struct sk_buff *next, *to_free = NULL;
>> >       spinlock_t *root_lock = qdisc_lock(q);
>> > -     struct sk_buff *to_free = NULL;
>> > -     bool contended;
>> > +     struct llist_node *ll_list, *first_n;
>> > +     unsigned long defer_count = 0;
>> >       int rc;
>> >
>> >       qdisc_calculate_pkt_len(skb, q);
>> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >               return rc;
>> >       }
>> >
>> > -     /*
>> > -      * Heuristic to force contended enqueues to serialize on a
>> > -      * separate lock before trying to get qdisc main lock.
>> > -      * This permits qdisc->running owner to get the lock more
>> > -      * often and dequeue packets faster.
>> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> > -      * and then other tasks will only enqueue packets. The packets will be
>> > -      * sent after the qdisc owner is scheduled again. To prevent this
>> > -      * scenario the task always serialize on the lock.
>> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
>> > +      * at most once to limit the number of skbs in defer_list.
>> > +      * We perform the defer_count increment only if the list is not empty,
>> > +      * because some arches have slow atomic_long_inc_return().
>> > +      */
>> > +     first_n = READ_ONCE(q->defer_list.first);
>> > +     do {
>> > +             if (first_n && !defer_count) {
>> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
>> > +                     if (unlikely(defer_count > q->limit)) {
>> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> > +                             return NET_XMIT_DROP;
>> > +                     }
>> > +             }
>> > +             skb->ll_node.next = first_n;
>> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> > +
>> > +     /* If defer_list was not empty, we know the cpu which queued
>> > +      * the first skb will process the whole list for us.
>> >        */
>> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> > -     if (unlikely(contended))
>> > -             spin_lock(&q->busylock);
>> > +     if (first_n)
>> > +             return NET_XMIT_SUCCESS;
>> >
>> >       spin_lock(root_lock);
>> > +
>> > +     ll_list = llist_del_all(&q->defer_list);
>> > +     /* There is a small race because we clear defer_count not atomically
>> > +      * with the prior llist_del_all(). This means defer_list could grow
>> > +      * over q->limit.
>> > +      */
>> > +     atomic_long_set(&q->defer_count, 0);
>> > +
>> > +     ll_list = llist_reverse_order(ll_list);
>> > +
>> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> > -             __qdisc_drop(skb, &to_free);
>> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> > +                     __qdisc_drop(skb, &to_free);
>> >               rc = NET_XMIT_DROP;
>> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > -                qdisc_run_begin(q)) {
>> > +             goto unlock;
>> > +     }
>> > +     rc = NET_XMIT_SUCCESS;
>> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
>> >               /*
>> >                * This is a work-conserving queue; there are no old skbs
>> >                * waiting to be sent out; and the qdisc is not running -
>> >                * xmit the skb directly.
>> >                */
>> >
>> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
>>
>> I was puzzled by this^. But AFAICT, the idea is that in the
>> non-contended path we're in here (no other CPU enqueueing packets), we
>> will still have added the skb to the llist before taking the root_lock,
>> and here we pull it back off the list, right?
>>
>> Even though this is technically not needed (we'll always get the same
>> skb back, I think?), so this is mostly for consistency(?)
>
> Exactly. I guess we could instead add an assert like
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                  * xmit the skb directly.
>                  */
>
> -               skb = llist_entry(ll_list, struct sk_buff, ll_node);
> +               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
> struct sk_buff, ll_node));
>                 qdisc_bstats_update(q, skb);
>                 if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>                         __qdisc_run(q);

OK, so thinking about this some more, isn't there a race between the

	if (first_n)
		return NET_XMIT_SUCCESS;

and taking the lock? I.e., two different CPUs can pass that check in
which case one of them will end up spinning on the lock, and by the time
it acquires it, there is no longer any guarantee that the skb on the
llist will be the same one that we started with? Or indeed that there
will be any skbs on the list at all?

In which case, shouldn't there be a:

if (unlikely(llist_empty(ll_list)))
	goto unlock;

after the llist_del_all() assignment inside the lock? (and we should
retain the skb = llist_entry()) assignment I was confused about).

Or am I missing some reason this can't happen?

-Toke


  reply	other threads:[~2025-10-08 10:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-06 19:38   ` Eric Dumazet
2025-10-07 15:26   ` Alexander Lobakin
2025-10-07 19:41     ` Maciej Fijalkowski
2025-10-09  8:37       ` Jason Xing
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
2025-10-12 15:22   ` Jamal Hadi Salim
2025-10-12 18:34     ` Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 11:37   ` Eric Dumazet
2025-10-08  6:37   ` Paolo Abeni
2025-10-08  7:32     ` Eric Dumazet
2025-10-08  7:44       ` Paolo Abeni
2025-10-08  8:48   ` Toke Høiland-Jørgensen
2025-10-08  9:32     ` Eric Dumazet
2025-10-08 10:05       ` Toke Høiland-Jørgensen [this message]
2025-10-08 10:46         ` Eric Dumazet
2025-10-08 12:11           ` Toke Høiland-Jørgensen
2025-10-07  5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
2025-10-07  7:41   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tt09k8se.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=horms@kernel.org \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).