From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7630E34BA5F for ; Wed, 8 Oct 2025 12:19:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759925965; cv=none; b=nY6oT6u83R/5mPeavA1mljhOnlGGvfKySgDdjWWqGboRiRJl+CVgMvz1j7J5nvjy2Pawwumf14ZrJKDI+OlQdR6QUYYumO7blEGp1NeUvSrrn0aei9zbqfTn8ouL8eRs9eavtx0IbQvasy/Vfz0lJtdXO51hhOd9fFulXL5SlGM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759925965; c=relaxed/simple; bh=x0XZZU1OaKpn2nMKSA9PdSRHX1TkiUoMckyqrsuWkPM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=aDhig9JqWe7IzcH2GgvOmGmCwPVRFWVWSkG8kRSa2z4OIP8QFt7HgERFZQitcy5eL58haQqWUeTs07HZkttDpgs88NvygzMm68gWm8DcmZT/a6G49tqbL3B2qLZlvZj53imkJH5J5ElERnNkQ/RtuxkzZEDxVUek7K96QmLg/fk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ETQIKGDA; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ETQIKGDA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759925962; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K7qdu835kPfFfzyTyOfVqLtXCbG7o1eNv44bcm15CEw=; b=ETQIKGDAYPwkvRdY+PEFy9GjGG9KwMjz85bbX72UFA2dqjqaJeLrkyGFclPvX7mQmQtoH9 ezUHfZExeJhlTNOZXDjEKjHlGPIiPNaBq0KGLUxarj71hXxAjogtiXWpIeKUYaEhNwjuJf CWBUO4vSLtwhU8h/W94vMJAbahzvzr0= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-422-45tG5T7rMiKjK3BfJP60dw-1; Wed, 08 Oct 2025 08:19:21 -0400 X-MC-Unique: 45tG5T7rMiKjK3BfJP60dw-1 X-Mimecast-MFC-AGG-ID: 45tG5T7rMiKjK3BfJP60dw_1759925960 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-b4af88aba57so332138366b.3 for ; Wed, 08 Oct 2025 05:19:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759925957; x=1760530757; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K7qdu835kPfFfzyTyOfVqLtXCbG7o1eNv44bcm15CEw=; b=AD1tr653WD3sXxLmArop71VfMTFilOJ4t+84yZZgYAUFMZ0V/5VEYUENAv/M7yd+Xb kE6QbzafvrRwckTIk/OKde2z1WOegg2OXZaHtHidSD12i5PbGKL2HXF76dUOwAs+XPYq TomnD8pVCQ+NQENlCyoYEyGWZbufewB01aZRkNSkwoGpee3nf31BpQZA1wJ79OzrLLeq e0McZygUc8yr56jrZN3oAb9oN/2kq7UhQxe9/3+6G+kQWWiiSo4N6bBZaixEXuq98j5G Z682A3N+atvdqIHoC26zeag9FVtcXqg6O+x9nyHgEv330OBhOamtYVQMrdUtHOXRBZ1d as7Q== X-Forwarded-Encrypted: i=1; AJvYcCXxPQKP+PD73RM2S+6jfPA9FJa0wYGjuOtkGlfqcMxic0W3EmRlDwxsSdOm0P9VTSwwsAwtoH4=@vger.kernel.org X-Gm-Message-State: AOJu0YxgTW/3zf+Wixx1gX/iV2cldEBdbThCJtxqBuVuc7d+MnBS8tGb YTYEaXmatmVc+jV3XUyh4LH/RK/zjUd7lyV1WZyGcsxqrO34LD1Ky4iRghBgiTIFIjM4Swfw2Fh EAuQfgK0kBhcmIiSWqb6wMm/i5mMQF/PGEDf71ZmjSeI+ofJZBVkoZD7r6g== X-Gm-Gg: ASbGnctC6bUUexv5kwlP9SjB0BaNqyPoU5Q0DtZMN7Jk9SPQiDa5hZo5FOtrKFRZjTp mv9GrWcrRivqMBNEUlnSf3my2omEhTDMrhnohOUL+RBUq7pOwOsSJ/IFklsTh2A8AU+mY1o3CM6 /SF361XpBTC5TPN4uaqMuAFOGEQ2Vj8EKSJTRVdSUW+9i9Dcj++NtrBByQGcQpYIe2i5Y5+ctOI JmPjqPZpLpl9tRQlkuN1h1eREul9cjihx131jUuLMc+8t5rhL31vJZyn22CcpZdItppmRXXlOkm Qg6CFYEUV47+VHOlWPUuyLsRXRgo3TDmURn2r+7NGs2+pjGFdnihjZSqeo9G6rCi4VKegg== X-Received: by 2002:a17:907:94c2:b0:b45:1063:fb62 with SMTP id a640c23a62f3a-b50aaba11a3mr361500066b.24.1759925957178; Wed, 08 Oct 2025 05:19:17 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFGotG10iUJ9ygLnVd7e75iYIx2Tca46D0yMBF9TUxDfORkCTR/Oj1//1a/+1IeaUnddRsyVw== X-Received: by 2002:a17:907:94c2:b0:b45:1063:fb62 with SMTP id a640c23a62f3a-b50aaba11a3mr361495066b.24.1759925956563; Wed, 08 Oct 2025 05:19:16 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b53bd62ac20sm69128066b.9.2025.10.08.05.19.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Oct 2025 05:19:16 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 930562785C3; Wed, 08 Oct 2025 14:11:58 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Eric Dumazet Cc: "David S . Miller" , Jakub Kicinski , Paolo Abeni , Simon Horman , Jamal Hadi Salim , Cong Wang , Jiri Pirko , Kuniyuki Iwashima , Willem de Bruijn , netdev@vger.kernel.org, eric.dumazet@gmail.com Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption In-Reply-To: References: <20251006193103.2684156-1-edumazet@google.com> <20251006193103.2684156-6-edumazet@google.com> <87wm55kcdr.fsf@toke.dk> <87tt09k8se.fsf@toke.dk> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 08 Oct 2025 14:11:58 +0200 Message-ID: <87ms61k2y9.fsf@toke.dk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Eric Dumazet writes: > On Wed, Oct 8, 2025 at 3:05=E2=80=AFAM Toke H=C3=B8iland-J=C3=B8rgensen <= toke@redhat.com> wrote: >> >> Eric Dumazet writes: >> >> > On Wed, Oct 8, 2025 at 1:48=E2=80=AFAM Toke H=C3=B8iland-J=C3=B8rgense= n wrote: >> >> >> >> Eric Dumazet writes: >> >> >> >> > Remove busylock spinlock and use a lockless list (llist) >> >> > to reduce spinlock contention to the minimum. >> >> > >> >> > Idea is that only one cpu might spin on the qdisc spinlock, >> >> > while others simply add their skb in the llist. >> >> > >> >> > After this patch, we get a 300 % improvement on heavy TX workloads. >> >> > - Sending twice the number of packets per second. >> >> > - While consuming 50 % less cycles. >> >> > >> >> > Tested: >> >> > >> >> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads). >> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler. >> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid co= ntention in mm) >> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n" >> >> > >> >> > Before: >> >> > >> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu) >> >> > >> >> > vmstat 2 5 >> >> > procs -----------memory---------- ---swap-- -----io---- -system-- -= -----cpu----- >> >> > r b swpd free buff cache si so bi bo in cs u= s sy id wa st >> >> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 = 60 0 9 91 0 0 >> >> > 244 0 0 2368988672 51036 1100852 0 0 536 10 48774= 5 14718 0 52 48 0 0 >> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 50306= 7 46033 0 52 48 0 0 >> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 49480= 7 12107 0 52 48 0 0 >> >> > 244 0 0 2368988672 51036 1100852 0 0 702 26 49284= 5 10110 0 52 48 0 0 >> >> > >> >> > Lock contention (1 second sample taken on 8 cores) >> >> > perf lock record -C0-7 sleep 1; perf lock contention >> >> > contended total wait max wait avg wait type ca= ller >> >> > >> >> > 442111 6.79 s 162.47 ms 15.35 us spinlock de= v_hard_start_xmit+0xcd >> >> > 5961 9.57 ms 8.12 us 1.60 us spinlock __= dev_queue_xmit+0x3a0 >> >> > 244 560.63 us 7.63 us 2.30 us spinlock do= _softirq+0x5b >> >> > 13 25.09 us 3.21 us 1.93 us spinlock ne= t_tx_action+0xf8 >> >> > >> >> > If netperf threads are pinned, spinlock stress is very high. >> >> > perf lock record -C0-7 sleep 1; perf lock contention >> >> > contended total wait max wait avg wait type ca= ller >> >> > >> >> > 964508 7.10 s 147.25 ms 7.36 us spinlock de= v_hard_start_xmit+0xcd >> >> > 201 268.05 us 4.65 us 1.33 us spinlock __= dev_queue_xmit+0x3a0 >> >> > 12 26.05 us 3.84 us 2.17 us spinlock do= _softirq+0x5b >> >> > >> >> > @__dev_queue_xmit_ns: >> >> > [256, 512) 21 | = | >> >> > [512, 1K) 631 | = | >> >> > [1K, 2K) 27328 |@ = | >> >> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ = | >> >> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ = | >> >> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@@@@@@@@@| >> >> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@@@ | >> >> > [32K, 64K) 19055 |@ = | >> >> > [64K, 128K) 17240 |@ = | >> >> > [128K, 256K) 25633 |@ = | >> >> > [256K, 512K) 4 | = | >> >> > >> >> > After: >> >> > >> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu) >> >> > >> >> > vmstat 2 5 >> >> > procs -----------memory---------- ---swap-- -----io---- -system-- -= -----cpu----- >> >> > r b swpd free buff cache si so bi bo in cs u= s sy id wa st >> >> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 = 254 0 8 92 0 0 >> >> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713= 280199 0 23 76 0 0 >> >> > 104 0 0 2369573632 32896 1350988 0 0 290 0 43023= 8 298247 0 23 76 0 0 >> >> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019= 291865 0 24 76 0 0 >> >> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498= 278672 0 23 76 0 0 >> >> > >> >> > perf lock record -C0-7 sleep 1; perf lock contention >> >> > contended total wait max wait avg wait type ca= ller >> >> > >> >> > 2524 116.15 ms 486.61 us 46.02 us spinlock __= dev_queue_xmit+0x55b >> >> > 5821 107.18 ms 371.67 us 18.41 us spinlock de= v_hard_start_xmit+0xcd >> >> > 2377 9.73 ms 35.86 us 4.09 us spinlock __= _slab_alloc+0x4e0 >> >> > 923 5.74 ms 20.91 us 6.22 us spinlock __= _slab_alloc+0x5c9 >> >> > 121 3.42 ms 193.05 us 28.24 us spinlock ne= t_tx_action+0xf8 >> >> > 6 564.33 us 167.60 us 94.05 us spinlock do= _softirq+0x5b >> >> > >> >> > If netperf threads are pinned (~54 Mpps) >> >> > perf lock record -C0-7 sleep 1; perf lock contention >> >> > 32907 316.98 ms 195.98 us 9.63 us spinlock de= v_hard_start_xmit+0xcd >> >> > 4507 61.83 ms 212.73 us 13.72 us spinlock __= dev_queue_xmit+0x554 >> >> > 2781 23.53 ms 40.03 us 8.46 us spinlock __= _slab_alloc+0x5c9 >> >> > 3554 18.94 ms 34.69 us 5.33 us spinlock __= _slab_alloc+0x4e0 >> >> > 233 9.09 ms 215.70 us 38.99 us spinlock do= _softirq+0x5b >> >> > 153 930.66 us 48.67 us 6.08 us spinlock ne= t_tx_action+0xfd >> >> > 84 331.10 us 14.22 us 3.94 us spinlock __= _slab_alloc+0x5c9 >> >> > 140 323.71 us 9.94 us 2.31 us spinlock __= _slab_alloc+0x4e0 >> >> > >> >> > @__dev_queue_xmit_ns: >> >> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ = | >> >> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@@@@@@@@@| >> >> > [512, 1K) 483936 |@@@@@@@@@@ = | >> >> > [1K, 2K) 265345 |@@@@@@ = | >> >> > [2K, 4K) 145463 |@@@ = | >> >> > [4K, 8K) 54571 |@ = | >> >> > [8K, 16K) 10270 | = | >> >> > [16K, 32K) 9385 | = | >> >> > [32K, 64K) 7749 | = | >> >> > [64K, 128K) 26799 | = | >> >> > [128K, 256K) 2665 | = | >> >> > [256K, 512K) 665 | = | >> >> > >> >> > Signed-off-by: Eric Dumazet >> >> >> >> This is very cool! One question below, just to make sure I understand >> >> this correctly: >> >> >> >> > --- >> >> > include/net/sch_generic.h | 4 +- >> >> > net/core/dev.c | 85 ++++++++++++++++++++++-------------= ---- >> >> > net/sched/sch_generic.c | 5 --- >> >> > 3 files changed, 52 insertions(+), 42 deletions(-) >> >> > >> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085= c236319705aecba8c30cf 100644 >> >> > --- a/include/net/sch_generic.h >> >> > +++ b/include/net/sch_generic.h >> >> > @@ -115,7 +115,9 @@ struct Qdisc { >> >> > struct Qdisc *next_sched; >> >> > struct sk_buff_head skb_bad_txq; >> >> > >> >> > - spinlock_t busylock ____cacheline_aligned_in_smp; >> >> > + atomic_long_t defer_count ____cacheline_aligned_in_= smp; >> >> > + struct llist_head defer_list; >> >> > + >> >> > spinlock_t seqlock; >> >> > >> >> > struct rcu_head rcu; >> >> > diff --git a/net/core/dev.c b/net/core/dev.c >> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1= e52b9b12b7258fa0ecff6 100644 >> >> > --- a/net/core/dev.c >> >> > +++ b/net/core/dev.c >> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_b= uff *skb, struct Qdisc *q, >> >> > struct net_device *dev, >> >> > struct netdev_queue *txq) >> >> > { >> >> > + struct sk_buff *next, *to_free =3D NULL; >> >> > spinlock_t *root_lock =3D qdisc_lock(q); >> >> > - struct sk_buff *to_free =3D NULL; >> >> > - bool contended; >> >> > + struct llist_node *ll_list, *first_n; >> >> > + unsigned long defer_count =3D 0; >> >> > int rc; >> >> > >> >> > qdisc_calculate_pkt_len(skb, q); >> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_= buff *skb, struct Qdisc *q, >> >> > return rc; >> >> > } >> >> > >> >> > - /* >> >> > - * Heuristic to force contended enqueues to serialize on a >> >> > - * separate lock before trying to get qdisc main lock. >> >> > - * This permits qdisc->running owner to get the lock more >> >> > - * often and dequeue packets faster. >> >> > - * On PREEMPT_RT it is possible to preempt the qdisc owner du= ring xmit >> >> > - * and then other tasks will only enqueue packets. The packet= s will be >> >> > - * sent after the qdisc owner is scheduled again. To prevent = this >> >> > - * scenario the task always serialize on the lock. >> >> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue= limit. >> >> > + * In the try_cmpxchg() loop, we want to increment q->defer_c= ount >> >> > + * at most once to limit the number of skbs in defer_list. >> >> > + * We perform the defer_count increment only if the list is n= ot empty, >> >> > + * because some arches have slow atomic_long_inc_return(). >> >> > + */ >> >> > + first_n =3D READ_ONCE(q->defer_list.first); >> >> > + do { >> >> > + if (first_n && !defer_count) { >> >> > + defer_count =3D atomic_long_inc_return(&q->de= fer_count); >> >> > + if (unlikely(defer_count > q->limit)) { >> >> > + kfree_skb_reason(skb, SKB_DROP_REASON= _QDISC_DROP); >> >> > + return NET_XMIT_DROP; >> >> > + } >> >> > + } >> >> > + skb->ll_node.next =3D first_n; >> >> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->l= l_node)); >> >> > + >> >> > + /* If defer_list was not empty, we know the cpu which queued >> >> > + * the first skb will process the whole list for us. >> >> > */ >> >> > - contended =3D qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMP= T_RT); >> >> > - if (unlikely(contended)) >> >> > - spin_lock(&q->busylock); >> >> > + if (first_n) >> >> > + return NET_XMIT_SUCCESS; >> >> > >> >> > spin_lock(root_lock); >> >> > + >> >> > + ll_list =3D llist_del_all(&q->defer_list); >> >> > + /* There is a small race because we clear defer_count not ato= mically >> >> > + * with the prior llist_del_all(). This means defer_list coul= d grow >> >> > + * over q->limit. >> >> > + */ >> >> > + atomic_long_set(&q->defer_count, 0); >> >> > + >> >> > + ll_list =3D llist_reverse_order(ll_list); >> >> > + >> >> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))= { >> >> > - __qdisc_drop(skb, &to_free); >> >> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node) >> >> > + __qdisc_drop(skb, &to_free); >> >> > rc =3D NET_XMIT_DROP; >> >> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && >> >> > - qdisc_run_begin(q)) { >> >> > + goto unlock; >> >> > + } >> >> > + rc =3D NET_XMIT_SUCCESS; >> >> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && >> >> > + !llist_next(ll_list) && qdisc_run_begin(q)) { >> >> > /* >> >> > * This is a work-conserving queue; there are no old = skbs >> >> > * waiting to be sent out; and the qdisc is not runni= ng - >> >> > * xmit the skb directly. >> >> > */ >> >> > >> >> > + skb =3D llist_entry(ll_list, struct sk_buff, ll_node); >> >> >> >> I was puzzled by this^. But AFAICT, the idea is that in the >> >> non-contended path we're in here (no other CPU enqueueing packets), we >> >> will still have added the skb to the llist before taking the root_loc= k, >> >> and here we pull it back off the list, right? >> >> >> >> Even though this is technically not needed (we'll always get the same >> >> skb back, I think?), so this is mostly for consistency(?) >> > >> > Exactly. I guess we could instead add an assert like >> > >> > diff --git a/net/core/dev.c b/net/core/dev.c >> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c= 474698d904d0b864f6 >> > 100644 >> > --- a/net/core/dev.c >> > +++ b/net/core/dev.c >> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff >> > *skb, struct Qdisc *q, >> > * xmit the skb directly. >> > */ >> > >> > - skb =3D llist_entry(ll_list, struct sk_buff, ll_node); >> > + DEBUG_NET_WARN_ON_ONCE(skb !=3D llist_entry(ll_list, >> > struct sk_buff, ll_node)); >> > qdisc_bstats_update(q, skb); >> > if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) >> > __qdisc_run(q); >> >> OK, so thinking about this some more, isn't there a race between the >> >> if (first_n) >> return NET_XMIT_SUCCESS; >> >> and taking the lock? I.e., two different CPUs can pass that check in >> which case one of them will end up spinning on the lock, and by the time >> it acquires it, there is no longer any guarantee that the skb on the >> llist will be the same one that we started with? Or indeed that there >> will be any skbs on the list at all? > > Only one cpu can observe the list was empty. > > This (spinlock)less list still has atomic guarantees, thanks to cmpxchg(). Ah, right, missed the bit where try_cmpxchg would update first_n if it doesn't succeed... > So after this only cpu passes the spinlock() barrier, the list has at > least one skb in it. > > If there is a single skb in the list after the xchg() got it > atomically, then it must be its own skb. Gotcha! In that case I agree the DEBUG_NET_WARN_ON_ONCE() is clearer than reassigning the skb pointer. -Toke