From: "Michael S. Tsirkin" <mst@redhat.com>
To: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: willemdebruijn.kernel@gmail.com, jasowang@redhat.com,
andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, eperezma@redhat.com,
leiyang@redhat.com, stephen@networkplumber.org, jon@nutanix.com,
tim.gebauer@tu-dortmund.de, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
virtualization@lists.linux.dev
Subject: Re: [PATCH net-next v10 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
Date: Wed, 6 May 2026 18:56:04 -0400 [thread overview]
Message-ID: <20260506185515-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20260506181909-mutt-send-email-mst@kernel.org>
On Wed, May 06, 2026 at 06:28:06PM -0400, Michael S. Tsirkin wrote:
> On Wed, May 06, 2026 at 04:10:33PM +0200, Simon Schippers wrote:
> > This commit prevents tail-drop when a qdisc is present and the ptr_ring
> > becomes full. Once an entry is successfully produced and the ptr_ring
> > reaches capacity, the netdev queue is stopped instead of dropping
> > subsequent packets. If no qdisc is present, the previous tail-drop
> > behavior is preserved.
> >
> > If producing an entry fails anyways due to a race, tun_net_xmit() drops
> > the packet. Such races are expected because LLTX is enabled and the
> > transmit path operates without the usual locking.
> >
> > The __tun_wake_queue() function of the consumer races with the producer
> > for waking/stopping the netdev queue, which could result in a stalled
> > queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
> > with the smp_mb() of the consumer. It follows the principle of store
> > buffering described in tools/memory-model/Documentation/recipes.txt:
> >
> > - The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
> > followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
> > ring with __ptr_ring_produce_peek().
> >
> > - The consumer in __tun_wake_queue() first writes zero to the ring in
> > __ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
> > status with netif_tx_queue_stopped().
> >
> > => Following the aforementioned principle, it is impossible for the
> > producer to see a full ring (and therefore not wake the queue on the
> > re-check) while the consumer simultaneously fails to see a stopped
> > queue (and therefore also does not wake it).
> >
> > Benchmarks:
> > The benchmarks show a slight regression in raw transmission performance
> > when using two sending threads. Packet loss also occurs only in the
> > two-thread sending case; no packet loss was observed with a single
> > sending thread.
> >
> > Test setup:
> > AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> > Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
> > mitigations disabled.
> >
> > Note for tap+vhost-net:
> > XDP drop program active in VM -> ~2.5x faster; slower for tap due to
> > more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >
> > +--------------------------+--------------+----------------+----------+
> > | 1 thread | Stock | Patched with | diff |
> > | sending | | fq_codel qdisc | |
> > +------------+-------------+--------------+----------------+----------+
> > | TAP | Received | 1.132 Mpps | 1.133 Mpps | +0.1% |
> > | +-------------+--------------+----------------+----------+
> > | | Lost/s | 3.765 Mpps | 0 pps | |
> > +------------+-------------+--------------+----------------+----------+
> > | TAP | Received | 3.857 Mpps | 3.905 Mpps | +1.2% |
> > | +-------------+--------------+----------------+----------+
> > | +vhost-net | Lost/s | 0.802 Mpps | 0 pps | |
> > +------------+-------------+--------------+----------------+----------+
> >
> > +--------------------------+--------------+----------------+----------+
> > | 2 threads | Stock | Patched with | diff |
> > | sending | | fq_codel qdisc | |
> > +------------+-------------+--------------+----------------+----------+
> > | TAP | Received | 1.115 Mpps | 1.092 Mpps | -2.1% |
> > | +-------------+--------------+----------------+----------+
> > | | Lost/s | 8.490 Mpps | 359 pps | |
> > +------------+-------------+--------------+----------------+----------+
> > | TAP | Received | 3.664 Mpps | 3.549 Mpps | -3.1% |
> > | +-------------+--------------+----------------+----------+
> > | +vhost-net | Lost/s | 5.330 Mpps | 832 pps | |
> > +------------+-------------+--------------+----------------+----------+
> >
> > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > ---
> > drivers/net/tun.c | 25 +++++++++++++++++++++++--
> > 1 file changed, 23 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index fc358c4c355b..d9ffbf88cfd8 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1018,6 +1018,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> > struct netdev_queue *queue;
> > struct tun_file *tfile;
> > int len = skb->len;
> > + int ret;
> >
> > rcu_read_lock();
> > tfile = rcu_dereference(tun->tfiles[txq]);
> > @@ -1072,13 +1073,33 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >
> > nf_reset_ct(skb);
> >
> > - if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> > + queue = netdev_get_tx_queue(dev, txq);
> > +
> > + spin_lock(&tfile->tx_ring.producer_lock);
> > + ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> > + if (!qdisc_txq_has_no_queue(queue) &&
> > + (__ptr_ring_produce_peek(&tfile->tx_ring) || ret)) {
> > + netif_tx_stop_queue(queue);
> > + /* Paired with smp_mb() in __tun_wake_queue() */
> > + smp_mb__after_atomic();
> > + if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> > + netif_tx_wake_queue(queue);
> > + }
> > + spin_unlock(&tfile->tx_ring.producer_lock);
> > +
>
> There's a weird corner case here when tx_queue_len is 0
> but a qdisc has been configured - it looks like that
> currently it just drops all packets, with this change,
> the qdisc will get stuck permanently.
>
> I suspect just checking tx_ring.size should fix it.
> Or if you feel adventurous, change return code for __ptr_ring_produce
> to distinguish between "no ring" and "no space".
__ptr_ring_produce_peek really.
>
> > + if (ret) {
> > + /* This should be a rare case if a qdisc is present, but
> > + * can happen due to lltx.
> > + * Since skb_tx_timestamp(), skb_orphan(),
> > + * run_ebpf_filter() and pskb_trim() could have tinkered
> > + * with the SKB, returning NETDEV_TX_BUSY is unsafe and
> > + * we must drop instead.
> > + */
> > drop_reason = SKB_DROP_REASON_FULL_RING;
> > goto drop;
> > }
> >
> > /* dev->lltx requires to do our own update of trans_start */
> > - queue = netdev_get_tx_queue(dev, txq);
> > txq_trans_cond_update(queue);
> >
> > /* Notify and wake up reader process */
> > --
> > 2.43.0
next prev parent reply other threads:[~2026-05-06 22:56 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-06 14:10 [PATCH net-next v10 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-05-06 14:10 ` [PATCH net-next v10 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-05-06 22:18 ` Michael S. Tsirkin
2026-05-07 6:21 ` Simon Schippers
2026-05-06 14:10 ` [PATCH net-next v10 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
2026-05-06 14:10 ` [PATCH net-next v10 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
2026-05-06 14:10 ` [PATCH net-next v10 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
2026-05-06 22:28 ` Michael S. Tsirkin
2026-05-06 22:56 ` Michael S. Tsirkin [this message]
2026-05-07 6:32 ` Simon Schippers
2026-05-07 15:19 ` Simon Schippers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260506185515-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=jon@nutanix.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=leiyang@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=simon.schippers@tu-dortmund.de \
--cc=stephen@networkplumber.org \
--cc=tim.gebauer@tu-dortmund.de \
--cc=virtualization@lists.linux.dev \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.