public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch,
	 davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com,  mst@redhat.com, eperezma@redhat.com,
	leiyang@redhat.com,  stephen@networkplumber.org, jon@nutanix.com,
	tim.gebauer@tu-dortmund.de,  netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	 virtualization@lists.linux.dev
Subject: Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
Date: Thu, 26 Mar 2026 10:41:29 +0800	[thread overview]
Message-ID: <CACGkMEuLqxnz=GtdKp3-u0egkGT_eZUgBBwvvtHcgAYpLJv-UA@mail.gmail.com> (raw)
In-Reply-To: <cda710f8-033a-4772-b5ce-a2cfddf42bea@tu-dortmund.de>

On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 3/24/26 11:14, Simon Schippers wrote:
> > On 3/24/26 02:47, Jason Wang wrote:
> >> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
> >> <simon.schippers@tu-dortmund.de> wrote:
> >>>
> >>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>> reaches capacity, the netdev queue is stopped instead of dropping
> >>> subsequent packets.
> >>>
> >>> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>
> >>> The existing __tun_wake_queue() function wakes the netdev queue. Races
> >>> between this wakeup and the queue-stop logic could leave the queue
> >>> stopped indefinitely. To prevent this, a memory barrier is enforced
> >>> (as discussed in a similar implementation in [1]), followed by a recheck
> >>> that wakes the queue if space is already available.
> >>>
> >>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>
> >> I wonder if we need a dedicated TUN flag to enable this. With this new
> >> flag, we can even prevent TUN from using noqueue (not sure if it's
> >> possible or not).
> >>
> >
> > Except of the slight regressions because of this patchset I do not see
> > a reason for such a flag.
> >
> > I have never seen that the driver prevents noqueue. For example you can
> > set noqueue to your ethernet interface and under load you soon get
> >
> > net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >                                            dev->name);
> >
> > followed by a -ENETDOWN. And this is not prevented even though it is
> > clearly not something a user wants.
> >
> >>>
> >>> Benchmarks:
> >>> The benchmarks show a slight regression in raw transmission performance,
> >>> though no packets are lost anymore.
> >>>
> >>> The previously introduced threshold to only wake after the queue stopped
> >>> and half of the ring was consumed showed to be a descent choice:
> >>> Waking the queue whenever a consume made space in the ring strongly
> >>> degrades performance for tap, while waking only when the ring is empty
> >>> is too late and also hurts throughput for tap & tap+vhost-net.
> >>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >>> tun/tap+vhost-net.
> >>>
> >>> Test setup:
> >>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
> >>> mitigations disabled.
> >>>
> >>> Note for tap+vhost-net:
> >>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>>
> >>> +--------------------------+--------------+----------------+----------+
> >>> | 1 thread                 | Stock        | Patched with   | diff     |
> >>> | sending                  |              | fq_codel qdisc |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>>
> >>> +--------------------------+--------------+----------------+----------+
> >>> | 2 threads                | Stock        | Patched with   | diff     |
> >>> | sending                  |              | fq_codel qdisc |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>>
> >>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>
> >>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>> ---
> >>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>>  1 file changed, 28 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>> index b86582cc6cb6..9b7daec69acd 100644
> >>> --- a/drivers/net/tun.c
> >>> +++ b/drivers/net/tun.c
> >>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>         struct netdev_queue *queue;
> >>>         struct tun_file *tfile;
> >>>         int len = skb->len;
> >>> +       bool qdisc_present;
> >>> +       int ret;
> >>>
> >>>         rcu_read_lock();
> >>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>
> >>>         nf_reset_ct(skb);
> >>>
> >>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>> +       queue = netdev_get_tx_queue(dev, txq);
> >>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>> +
> >>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>
> >> So, it's possible that the administrator is switching between noqueue
> >> and another qdisc. So ptr_ring_produce() can fail here, do we need to
> >> check that or not?
> >>
> >
> > Do you mean that? My thoughts:
> >
> > Switching from noqueue to some qdisc can cause a
> >
> > net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >                                            dev->name);
> >
> > followed by a return of -ENETDOWN in __dev_queue_xmit().
> > This is because tun_net_xmit detects some qdisc with
> >
> > qdisc_present = !qdisc_txq_has_no_queue(queue);
> >
> > and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
> > detect noqueue.
> >
> > I am not sure how to solve this/if this has to be solved.
> > I do not see a proper way to avoid parallel execution of ndo_start_xmit
> > and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
> >
> > And from my understanding the veth implementation faces the same issue.
>
> How about rechecking if a qdisc is connected?
> This would avoid -ENETDOWN.
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f48dc299e4b2..2731a1a70732 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>                                 if (is_list)
>                                         rc = NETDEV_TX_OK;
>                         }
> +                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
>                         HARD_TX_UNLOCK(dev, txq);
>                         if (!skb) /* xmit completed */
>                                 goto out;
>
> +                       /* Maybe a qdisc was connected in the meantime */
> +                       if (qdisc_present) {
> +                               kfree_skb(skb);
> +                               goto out;
> +                       }
> +
>                         net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>                                              dev->name);
>                         /* NETDEV_TX_BUSY or queue was stopped */
>

Probably not, and we likely won't hit this warning because qdisc could
not be changed during ndo_start_xmit().

I meant something like this:

1) set noqueue to tuntap
2) produce packets so tuntap is full
3) set e.g fq_codel to tuntap
4) then we can hit the failure of __ptr_ring_produce()

Rethink of the code, it looks just fine.

>
> >
> >
> > Switching from some qdisc to noqueue is no problem I think.
> >
> >>> +               netif_tx_stop_queue(queue);
> >>> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
> >>> +                * waking if space is available in a re-check.
> >>> +                * The barrier makes sure that the stop is visible before
> >>> +                * we re-check.
> >>> +                */
> >>> +               smp_mb__after_atomic();
> >>
> >> Let's document which barrier is paired with this.
> >>
> >
> > I am basically copying the (old) logic of veth [1] proposed by
> > Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
> >
> > [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

So it looks like it implicitly tries to pair with tun_ring_consume():

1) spinlock(consumer_lock)
2) store NULL to ptr_ring // STORE
3) spinunlock(consumer_lock) // RELEASE
4) spinlock(consumer_lock) // ACQURE
5) check empty
6) spinunlock(consumer_lock)
7) netif_wakeup_queue() // test_and_set() which is an RMW

RELEASE + ACQUIRE implies a full barrier

I see several problems

1) Due to batch consumption, we may get spurious wakeups under heavy
load (we can try disabling batch consuming to see if it helps).
2) So the barriers don't help but would slow down the consuming
3) Two spinlocks were used instead of one, this is another reason we
will see a performance regression
4) Tricky code that needs to be understood or at least requires a comment tweak.

Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
may not notice the real degradation.

> >
> >>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>> +                       netif_tx_wake_queue(queue);
> >>> +       }
> >>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>> +
> >>> +       if (ret) {
> >>> +               /* If a qdisc is attached to our virtual device,
> >>> +                * returning NETDEV_TX_BUSY is allowed.
> >>> +                */
> >>> +               if (qdisc_present) {
> >>> +                       rcu_read_unlock();
> >>> +                       return NETDEV_TX_BUSY;
> >>> +               }
> >>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>                 goto drop;
> >>>         }
> >>>
> >>>         /* dev->lltx requires to do our own update of trans_start */
> >>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>         txq_trans_cond_update(queue);
> >>>
> >>>         /* Notify and wake up reader process */
> >>> --
> >>> 2.43.0
> >>>
> >>
> >> Thanks
> >>
>

Thanks


  reply	other threads:[~2026-03-26  2:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-03-24  1:47   ` Jason Wang
2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
2026-03-12 13:54   ` Michael S. Tsirkin
2026-03-24  1:47   ` Jason Wang
2026-03-12 13:06 ` [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
2026-03-12 13:17   ` Eric Dumazet
2026-03-12 13:48     ` Michael S. Tsirkin
2026-03-12 14:21       ` Eric Dumazet
2026-03-25 11:07         ` Michael S. Tsirkin
2026-03-12 13:06 ` [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
2026-03-24  1:47   ` Jason Wang
2026-03-24 10:14     ` Simon Schippers
2026-03-25 14:47       ` Simon Schippers
2026-03-26  2:41         ` Jason Wang [this message]
2026-03-12 13:55 ` [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Michael S. Tsirkin
2026-03-13  9:49   ` Simon Schippers
2026-03-13 10:35     ` Michael S. Tsirkin
2026-03-23 21:49 ` Simon Schippers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACGkMEuLqxnz=GtdKp3-u0egkGT_eZUgBBwvvtHcgAYpLJv-UA@mail.gmail.com' \
    --to=jasowang@redhat.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eperezma@redhat.com \
    --cc=jon@nutanix.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=leiyang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=simon.schippers@tu-dortmund.de \
    --cc=stephen@networkplumber.org \
    --cc=tim.gebauer@tu-dortmund.de \
    --cc=virtualization@lists.linux.dev \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox