From: Simon Schippers <simon.schippers@tu-dortmund.de>
To: willemdebruijn.kernel@gmail.com, jasowang@redhat.com,
andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, mst@redhat.com,
eperezma@redhat.com, leiyang@redhat.com,
stephen@networkplumber.org, jon@nutanix.com,
tim.gebauer@tu-dortmund.de, simon.schippers@tu-dortmund.de,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, virtualization@lists.linux.dev
Subject: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
Date: Tue, 28 Apr 2026 14:38:59 +0200 [thread overview]
Message-ID: <20260428123859.19578-5-simon.schippers@tu-dortmund.de> (raw)
In-Reply-To: <20260428123859.19578-1-simon.schippers@tu-dortmund.de>
This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.
If producing an entry fails anyways due to a race, tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
LLTX is enabled and the transmit path operates without the usual locking.
If no qdisc is present, the previous tail-drop behavior is preserved.
The existing __tun_wake_queue() function of the consumer races with the
producer for waking/stopping the netdev queue: the consumer may drain
the ring just as the producer stops the queue, leading to a permanent
stall. To avoid this, the producer re-checks the ring after stopping
and wakes the queue itself if space was just made. An
smp_mb__after_atomic() is required so the re-peek of the ring sees any
drain that the consumer performed.
smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
netif_wake_subqueue():
Consumer CPU Producer CPU
======================== =========================
__ptr_ring_consume()
netif_wake_subqueue() netif_tx_stop_queue()
/\ smp_mb__after_atomic()
|| __ptr_ring_produce_peek()
contains RMW operation
test_and_clear_bit()
/\
||
"Fully ordered RMW:
smp_mb() before + after"
- atomic_t.txt
Benchmarks:
The benchmarks show a slight regression in raw transmission performance,
though no packets are lost anymore.
The previously introduced threshold to only wake after the queue stopped
and half of the ring was consumed showed to be a descent choice:
Waking the queue whenever a consume made space in the ring strongly
degrades performance for tap, while waking only when the ring is empty
is too late and also hurts throughput for tap & tap+vhost-net.
Other ratios (3/4, 7/8) showed similar results (not shown here), so
1/2 was chosen for the sake of simplicity for both tun/tap and
tun/tap+vhost-net.
Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.
Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster, slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
+--------------------------+--------------+----------------+----------+
| 1 thread | Stock | Patched with | diff |
| sending | | fq_codel qdisc | |
+------------+-------------+--------------+----------------+----------+
| TAP | Transmitted | 1.136 Mpps | 1.130 Mpps | -0.6% |
| +-------------+--------------+----------------+----------+
| | Lost/s | 3.758 Mpps | 0 pps | |
+------------+-------------+--------------+----------------+----------+
| TAP | Transmitted | 3.858 Mpps | 3.816 Mpps | -1.1% |
| +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s | 789.8 Kpps | 0 pps | |
+------------+-------------+--------------+----------------+----------+
+--------------------------+--------------+----------------+----------+
| 2 threads | Stock | Patched with | diff |
| sending | | fq_codel qdisc | |
+------------+-------------+--------------+----------------+----------+
| TAP | Transmitted | 1.117 Mpps | 1.087 Mpps | -2.7% |
| +-------------+--------------+----------------+----------+
| | Lost/s | 8.476 Mpps | 0 pps | |
+------------+-------------+--------------+----------------+----------+
| TAP | Transmitted | 3.679 Mpps | 3.464 Mpps | -5.8% |
| +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s | 5.306 Mpps | 0 pps | |
+------------+-------------+--------------+----------------+----------+
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index efe809597622..c2a1618cc9db 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
struct netdev_queue *queue;
struct tun_file *tfile;
int len = skb->len;
+ bool qdisc_present;
+ int ret;
rcu_read_lock();
tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
nf_reset_ct(skb);
- if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+ queue = netdev_get_tx_queue(dev, txq);
+ qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+ spin_lock(&tfile->tx_ring.producer_lock);
+ ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+ if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
+ netif_tx_stop_queue(queue);
+ /* Re-peek and wake if the consumer drained the ring
+ * concurrently in a race. smp_mb__after_atomic() pairs
+ * with the test_and_clear_bit() of netif_wake_subqueue()
+ * in __tun_wake_queue().
+ */
+ smp_mb__after_atomic();
+ if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+ netif_tx_wake_queue(queue);
+ }
+ spin_unlock(&tfile->tx_ring.producer_lock);
+
+ if (ret) {
+ /* If a qdisc is attached to our virtual device,
+ * returning NETDEV_TX_BUSY is allowed.
+ */
+ if (qdisc_present) {
+ rcu_read_unlock();
+ return NETDEV_TX_BUSY;
+ }
drop_reason = SKB_DROP_REASON_FULL_RING;
goto drop;
}
/* dev->lltx requires to do our own update of trans_start */
- queue = netdev_get_tx_queue(dev, txq);
txq_trans_cond_update(queue);
/* Notify and wake up reader process */
--
2.43.0
next prev parent reply other threads:[~2026-04-28 12:41 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
2026-04-28 12:38 ` Simon Schippers [this message]
2026-04-28 12:50 ` [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Michael S. Tsirkin
2026-04-28 13:10 ` Simon Schippers
2026-04-28 13:22 ` Michael S. Tsirkin
2026-04-28 13:41 ` Simon Schippers
2026-04-28 14:10 ` Michael S. Tsirkin
2026-04-28 14:18 ` Simon Schippers
2026-04-28 14:32 ` Michael S. Tsirkin
2026-04-28 14:55 ` Simon Schippers
2026-04-29 21:04 ` [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260428123859.19578-5-simon.schippers@tu-dortmund.de \
--to=simon.schippers@tu-dortmund.de \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=jon@nutanix.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=leiyang@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stephen@networkplumber.org \
--cc=tim.gebauer@tu-dortmund.de \
--cc=virtualization@lists.linux.dev \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox