From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B92DC4A33 for ; Wed, 8 Oct 2025 08:48:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759913305; cv=none; b=DDGl74/MVEOs/Uy/kJcRgx1qNFAzdMdLRofKQVoTzoUFmvcIe2TWXg4FY9EOLP6iKoQAYlR5k1jWJG+FgnR/Y+Y2aKRw1zlLwu2OFM9rKvpAZYUYBvFpa3lqGnvH5i7bbJiBuLlmxOHhEO8c1O+oaULOZauB3NNPcjJNiKZMKR8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759913305; c=relaxed/simple; bh=slutc4bGS2XCLAls3mVxkTUKXqyz4xqaXp7M9BE3ar4=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Y7xSIC4ckB7RyKb+WuFv/X+mpAp3ZyJqxYOocI/LEvQ7ZDOXKxkRg7vj6mlmHN84IxQvhXwTPZZOS5MEsGD38dugN2cFO736kfATPRBwV7YZo6Ze4VmjTEeYjSDHZJx3ItnEyo+q2Gwm+FxjzzTQgCOOQ/l7G6C9xiuoGZngHAs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GdempgFj; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GdempgFj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759913302; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OTbhG1mXd1mRjdFREqAqTpDjPJSPDsi0SniSjSEQ3Eo=; b=GdempgFjxSgCrhp2ibGI6nkozRSth8sKPGR3hMnB10PlsY99z1azADfY317t8Havn9M257 XSw5FL7n1NWIxPVSukyrZfiy+w6jcfPrwr7opY7Hdh+RmTdX7f86DZt3Hp+eVf7gkRtGry 35RN77bc5Y8Fm88rKiAnx9h9wCTaYSg= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-258-_jS0R4QfMEanaqBhXmIPxg-1; Wed, 08 Oct 2025 04:48:20 -0400 X-MC-Unique: _jS0R4QfMEanaqBhXmIPxg-1 X-Mimecast-MFC-AGG-ID: _jS0R4QfMEanaqBhXmIPxg_1759913300 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-634700fe857so728706a12.0 for ; Wed, 08 Oct 2025 01:48:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759913299; x=1760518099; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OTbhG1mXd1mRjdFREqAqTpDjPJSPDsi0SniSjSEQ3Eo=; b=TmK9TD4N4b2g1onULAQJX3Luz1Cat58F59B3CN+RRr2EnWtxSyZ8tAoTH0E8n8yDaj QGxIrrI3DgzydCk0RV7mXHQiryYzt3MvaD+l9ZMASSm5KyjOo7lEqyftiVghoFDnIYt4 bnm+O02REnFo8cqA2z5Ayzn/MlcXM2bkVy5AUus3xnZ8T4BiqadTnjBVc5lNmk/U0tHX /E7d+LrKJCxtwzUrVoNvpfKGon4KA6KIFgmckgUScbpB0hD8mVln6LCUWMUAiWYJpkNY l9unDB/tGT3NxSmt6pNbXXlH9XmdmGsT53kL7JpF2l+KPRlyS5TiNUFgoQJVw230CP/O 9Wjg== X-Forwarded-Encrypted: i=1; AJvYcCXBH6CRBX7Nr78eh8ihDsKnKwsVDDzk/8OTlCcoXUylmVm7iO5mBK17Gxm1xAPJwP0yKW9Scj8=@vger.kernel.org X-Gm-Message-State: AOJu0YwUOoEQ+fuZbw25fz/VsohB937+GaiE74zkaEWyvT27AwfbLGWu 26Afk+EEx/dhU+z1TYSgjcs6eVOONB2ZRjcykSx4a0CPffUMdlGE4Ooa4vic9FyfWcKGxkFvGi+ TAHWXw0I2wNls0/OQE33zQ8MSJfVD4xg9b8kI/R4jsriboeX7ioTUqUNU+g== X-Gm-Gg: ASbGncttQ/fDZ9XAg4PfjQMqoAlbqW3g6gTKczVDAQOmi10TVDIQBg0G8av2KjX3Xr+ aFfZmGTd3PCml4vN6vNueWhelQiYC7RcSjofYuvD/fIniB3LgtFe/iwzUL91MVGwZw3edqbabgq vLDOxMhNK7IINiD4GPsCg1cYsiCNO6+8K2ZpYsrpw8GKtjSzQtc4qiBbHV1sYHlyIScOniy1PyJ NhxackUpKni3ZcKrZC+6cduadAehoD2OK1lUS0tZlPaWofMnaT4GFG0T3cnCTgowVSe87Ukp6v1 IiHfSfBwvicakHTo2OV58dD6anqxuI2PG+e5wbtehwVSDg1q+N3nkjxjlZ69QfM1+g79wbrW X-Received: by 2002:a17:907:3f95:b0:b41:c602:c75d with SMTP id a640c23a62f3a-b50bf7f13d2mr318090966b.31.1759913299364; Wed, 08 Oct 2025 01:48:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHVqAfjq4LxFDs9fpk4P92hnsEZPBskzh6KcIalIEzeQeioo3oYIzHFchAM1IkqYa6HR/+n9A== X-Received: by 2002:a17:907:3f95:b0:b41:c602:c75d with SMTP id a640c23a62f3a-b50bf7f13d2mr318086966b.31.1759913298843; Wed, 08 Oct 2025 01:48:18 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk (alrua-x1.borgediget.toke.dk. [2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b48652a9ff0sm1601952466b.5.2025.10.08.01.48.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Oct 2025 01:48:18 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 06AB0278598; Wed, 08 Oct 2025 10:48:17 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Eric Dumazet , "David S . Miller" , Jakub Kicinski , Paolo Abeni Cc: Simon Horman , Jamal Hadi Salim , Cong Wang , Jiri Pirko , Kuniyuki Iwashima , Willem de Bruijn , netdev@vger.kernel.org, eric.dumazet@gmail.com, Eric Dumazet Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption In-Reply-To: <20251006193103.2684156-6-edumazet@google.com> References: <20251006193103.2684156-1-edumazet@google.com> <20251006193103.2684156-6-edumazet@google.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 08 Oct 2025 10:48:16 +0200 Message-ID: <87wm55kcdr.fsf@toke.dk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Eric Dumazet writes: > Remove busylock spinlock and use a lockless list (llist) > to reduce spinlock contention to the minimum. > > Idea is that only one cpu might spin on the qdisc spinlock, > while others simply add their skb in the llist. > > After this patch, we get a 300 % improvement on heavy TX workloads. > - Sending twice the number of packets per second. > - While consuming 50 % less cycles. > > Tested: > > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads). > - 100Gbit NIC, 30 TX queues with FQ packet scheduler. > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contenti= on in mm) > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n" > > Before: > > 16 Mpps (41 Mpps if each thread is pinned to a different cpu) > > vmstat 2 5 > procs -----------memory---------- ---swap-- -----io---- -system-- ------c= pu----- > r b swpd free buff cache si so bi bo in cs us sy i= d wa st > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 = 0 9 91 0 0 > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 1471= 8 0 52 48 0 0 > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 4603= 3 0 52 48 0 0 > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 1210= 7 0 52 48 0 0 > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 1011= 0 0 52 48 0 0 > > Lock contention (1 second sample taken on 8 cores) > perf lock record -C0-7 sleep 1; perf lock contention > contended total wait max wait avg wait type caller > > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard= _start_xmit+0xcd > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_qu= eue_xmit+0x3a0 > 244 560.63 us 7.63 us 2.30 us spinlock do_softi= rq+0x5b > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_a= ction+0xf8 > > If netperf threads are pinned, spinlock stress is very high. > perf lock record -C0-7 sleep 1; perf lock contention > contended total wait max wait avg wait type caller > > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard= _start_xmit+0xcd > 201 268.05 us 4.65 us 1.33 us spinlock __dev_qu= eue_xmit+0x3a0 > 12 26.05 us 3.84 us 2.17 us spinlock do_softi= rq+0x5b > > @__dev_queue_xmit_ns: > [256, 512) 21 | = | > [512, 1K) 631 | = | > [1K, 2K) 27328 |@ = | > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ = | > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ = | > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@@@| > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ = | > [32K, 64K) 19055 |@ = | > [64K, 128K) 17240 |@ = | > [128K, 256K) 25633 |@ = | > [256K, 512K) 4 | = | > > After: > > 29 Mpps (57 Mpps if each thread is pinned to a different cpu) > > vmstat 2 5 > procs -----------memory---------- ---swap-- -----io---- -system-- ------c= pu----- > r b swpd free buff cache si so bi bo in cs us sy i= d wa st > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0= 8 92 0 0 > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 28019= 9 0 23 76 0 0 > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 2982= 47 0 23 76 0 0 > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 29186= 5 0 24 76 0 0 > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 27867= 2 0 23 76 0 0 > > perf lock record -C0-7 sleep 1; perf lock contention > contended total wait max wait avg wait type caller > > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_qu= eue_xmit+0x55b > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard= _start_xmit+0xcd > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_= alloc+0x4e0 > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_= alloc+0x5c9 > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_a= ction+0xf8 > 6 564.33 us 167.60 us 94.05 us spinlock do_softi= rq+0x5b > > If netperf threads are pinned (~54 Mpps) > perf lock record -C0-7 sleep 1; perf lock contention > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard= _start_xmit+0xcd > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_qu= eue_xmit+0x554 > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_= alloc+0x5c9 > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_= alloc+0x4e0 > 233 9.09 ms 215.70 us 38.99 us spinlock do_softi= rq+0x5b > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_a= ction+0xfd > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_= alloc+0x5c9 > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_= alloc+0x4e0 > > @__dev_queue_xmit_ns: > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ = | > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@@@| > [512, 1K) 483936 |@@@@@@@@@@ = | > [1K, 2K) 265345 |@@@@@@ = | > [2K, 4K) 145463 |@@@ = | > [4K, 8K) 54571 |@ = | > [8K, 16K) 10270 | = | > [16K, 32K) 9385 | = | > [32K, 64K) 7749 | = | > [64K, 128K) 26799 | = | > [128K, 256K) 2665 | = | > [256K, 512K) 665 | = | > > Signed-off-by: Eric Dumazet This is very cool! One question below, just to make sure I understand this correctly: > --- > include/net/sch_generic.h | 4 +- > net/core/dev.c | 85 ++++++++++++++++++++++----------------- > net/sched/sch_generic.c | 5 --- > 3 files changed, 52 insertions(+), 42 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c23631= 9705aecba8c30cf 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -115,7 +115,9 @@ struct Qdisc { > struct Qdisc *next_sched; > struct sk_buff_head skb_bad_txq; >=20=20 > - spinlock_t busylock ____cacheline_aligned_in_smp; > + atomic_long_t defer_count ____cacheline_aligned_in_smp; > + struct llist_head defer_list; > + > spinlock_t seqlock; >=20=20 > struct rcu_head rcu; > diff --git a/net/core/dev.c b/net/core/dev.c > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b= 12b7258fa0ecff6 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *s= kb, struct Qdisc *q, > struct net_device *dev, > struct netdev_queue *txq) > { > + struct sk_buff *next, *to_free =3D NULL; > spinlock_t *root_lock =3D qdisc_lock(q); > - struct sk_buff *to_free =3D NULL; > - bool contended; > + struct llist_node *ll_list, *first_n; > + unsigned long defer_count =3D 0; > int rc; >=20=20 > qdisc_calculate_pkt_len(skb, q); > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *= skb, struct Qdisc *q, > return rc; > } >=20=20 > - /* > - * Heuristic to force contended enqueues to serialize on a > - * separate lock before trying to get qdisc main lock. > - * This permits qdisc->running owner to get the lock more > - * often and dequeue packets faster. > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit > - * and then other tasks will only enqueue packets. The packets will be > - * sent after the qdisc owner is scheduled again. To prevent this > - * scenario the task always serialize on the lock. > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit. > + * In the try_cmpxchg() loop, we want to increment q->defer_count > + * at most once to limit the number of skbs in defer_list. > + * We perform the defer_count increment only if the list is not empty, > + * because some arches have slow atomic_long_inc_return(). > + */ > + first_n =3D READ_ONCE(q->defer_list.first); > + do { > + if (first_n && !defer_count) { > + defer_count =3D atomic_long_inc_return(&q->defer_count); > + if (unlikely(defer_count > q->limit)) { > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP); > + return NET_XMIT_DROP; > + } > + } > + skb->ll_node.next =3D first_n; > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node)); > + > + /* If defer_list was not empty, we know the cpu which queued > + * the first skb will process the whole list for us. > */ > - contended =3D qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT); > - if (unlikely(contended)) > - spin_lock(&q->busylock); > + if (first_n) > + return NET_XMIT_SUCCESS; >=20=20 > spin_lock(root_lock); > + > + ll_list =3D llist_del_all(&q->defer_list); > + /* There is a small race because we clear defer_count not atomically > + * with the prior llist_del_all(). This means defer_list could grow > + * over q->limit. > + */ > + atomic_long_set(&q->defer_count, 0); > + > + ll_list =3D llist_reverse_order(ll_list); > + > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > - __qdisc_drop(skb, &to_free); > + llist_for_each_entry_safe(skb, next, ll_list, ll_node) > + __qdisc_drop(skb, &to_free); > rc =3D NET_XMIT_DROP; > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && > - qdisc_run_begin(q)) { > + goto unlock; > + } > + rc =3D NET_XMIT_SUCCESS; > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && > + !llist_next(ll_list) && qdisc_run_begin(q)) { > /* > * This is a work-conserving queue; there are no old skbs > * waiting to be sent out; and the qdisc is not running - > * xmit the skb directly. > */ >=20=20 > + skb =3D llist_entry(ll_list, struct sk_buff, ll_node); I was puzzled by this^. But AFAICT, the idea is that in the non-contended path we're in here (no other CPU enqueueing packets), we will still have added the skb to the llist before taking the root_lock, and here we pull it back off the list, right? Even though this is technically not needed (we'll always get the same skb back, I think?), so this is mostly for consistency(?) Assuming I got this right: Reviewed-by: Toke H=C3=B8iland-J=C3=B8rgensen