All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: Jason Wang <jasowang@redhat.com>,
	willemdebruijn.kernel@gmail.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, eperezma@redhat.com, leiyang@redhat.com,
	stephen@networkplumber.org, jon@nutanix.com,
	tim.gebauer@tu-dortmund.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	virtualization@lists.linux.dev
Subject: Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Date: Mon, 12 Jan 2026 06:18:14 -0500	[thread overview]
Message-ID: <20260112061721-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <9aaf2420-089d-4fd9-824d-24982a86a70d@tu-dortmund.de>

On Mon, Jan 12, 2026 at 12:08:14PM +0100, Simon Schippers wrote:
> On 1/12/26 03:22, Jason Wang wrote:
> > On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/9/26 07:09, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/8/26 05:37, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>>>
> >>>>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>>>> this, a memory barrier is enforced (as discussed in a similar
> >>>>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>>>> space is already available.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> +-------------------------+-----------+---------------+----------------+
> >>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>>>> | 10M packets             |           |               |                |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>>>> |           +-------------+-----------+---------------+----------------+
> >>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>>>> |  +        +-------------+-----------+---------------+----------------+
> >>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>>
> >>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>         struct netdev_queue *queue;
> >>>>>>         struct tun_file *tfile;
> >>>>>>         int len = skb->len;
> >>>>>> +       bool qdisc_present;
> >>>>>> +       int ret;
> >>>>>>
> >>>>>>         rcu_read_lock();
> >>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>
> >>>>>>         nf_reset_ct(skb);
> >>>>>>
> >>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +               netif_tx_stop_queue(queue);
> >>>>>> +               /* Avoid races with queue wake-up in
> >>>>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>>>> +                * available in a re-check.
> >>>>>> +                * The barrier makes sure that the stop is visible before
> >>>>>> +                * we re-check.
> >>>>>> +                */
> >>>>>> +               smp_mb__after_atomic();
> >>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +                       netif_tx_wake_queue(queue);
> >>>>>
> >>>>> I'm not sure I will get here, but I think those should be moved to the
> >>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>>>> need to bother with those queue stop/wake logic?
> >>>>
> >>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >>>> we stop the queue proactively.
> >>>
> >>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> >>> NETDEV_TX_BUSY prepared for the xdp_xmit?
> >>
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> > 
> > No, I mean I don't understand why we still need to peek since we've
> > already used NETDEV_TX_BUSY.
> 
> Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
> __ptr_ring_produce_peek(). I agree with you on this point and will update
> the code accordingly. In all other cases, checking
> __ptr_ring_produce_peek() is still required in order to proactively stop
> the queue.
> 
> > 
> >> And I do not understand the connection with xdp_xmit.
> > 
> > Since there's we don't modify xdp_xmit path, so even if we peek next
> > ndo_start_xmit can still hit ring full.
> 
> Ah okay. Would you apply the same stop-and-recheck logic in
> tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
> not permitted there?
> 
> Apart from that, as noted in the commit message, since we are using LLTX,
> hitting a full ring is still possible anyway. I could see that especially
> at multiqueue tests with pktgen by looking at the qdisc requeues.
> 
> Thanks

If it's an exceptional rare condition (i.e. a race), it's not a big deal. That's why
NETDEV_TX_BUSY is there in the 1st place. If it's the main modus
operadi - not good.


> > 
> > Thanks
> > 
> >>
> >>>
> >>>>
> >>>> I believe what you are aiming for is to always stop the queue if(ret),
> >>>> which I can agree with. In that case, I would simply change the condition
> >>>> to:
> >>>>
> >>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>>>
> >>>>>
> >>>>>> +       }
> >>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +       if (ret) {
> >>>>>> +               /* If a qdisc is attached to our virtual device,
> >>>>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +                */
> >>>>>> +               if (qdisc_present) {
> >>>>>> +                       rcu_read_unlock();
> >>>>>> +                       return NETDEV_TX_BUSY;
> >>>>>> +               }
> >>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>                 goto drop;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>>>         txq_trans_cond_update(queue);
> >>>>>>
> >>>>>>         /* Notify and wake up reader process */
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> > 


  reply	other threads:[~2026-01-12 11:18 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
2026-01-08  3:23   ` Jason Wang
2026-01-08  7:20     ` Simon Schippers
2026-01-09  6:01       ` Jason Wang
2026-01-09  6:47         ` Michael S. Tsirkin
2026-01-09  7:22   ` Michael S. Tsirkin
2026-01-09  7:35     ` Simon Schippers
2026-01-09  8:31       ` Michael S. Tsirkin
2026-01-09  9:06         ` Simon Schippers
2026-01-12 16:29           ` Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-01-08  3:38   ` Jason Wang
2026-01-08  7:40     ` Simon Schippers
2026-01-09  6:02       ` Jason Wang
2026-01-09  9:31         ` Simon Schippers
2026-01-21  9:32         ` Simon Schippers
2026-01-22  5:35           ` Jason Wang
2026-01-23  3:05             ` Jason Wang
2026-01-23  9:54               ` Simon Schippers
2026-01-27 16:47                 ` Simon Schippers
2026-01-28  7:03                   ` Jason Wang
2026-01-28  7:53                     ` Simon Schippers
2026-01-29  1:14                       ` Jason Wang
2026-01-29  9:24                         ` Simon Schippers
2026-01-30  1:51                           ` Jason Wang
2026-02-01 20:19                             ` Simon Schippers
2026-02-03  3:48                               ` Jason Wang
2026-02-04 15:43                                 ` Simon Schippers
2026-02-05  3:59                                   ` Jason Wang
2026-02-05 22:28                                     ` Simon Schippers
2026-02-06  3:21                                       ` Jason Wang
2026-02-08 18:18                                         ` Simon Schippers
2026-02-12  0:12                                           ` Simon Schippers
2026-02-12  7:06                                             ` Michael S. Tsirkin
2026-02-12  8:03                                               ` Simon Schippers
2026-02-12  8:14                                           ` Jason Wang
2026-02-14 17:13                                             ` Simon Schippers
2026-02-14 18:18                                               ` Michael S. Tsirkin
2026-02-14 19:51                                                 ` Simon Schippers
2026-02-14 23:49                                                   ` Michael S. Tsirkin
2026-02-15 10:38                                                   ` Michael S. Tsirkin
2026-02-16 13:27                                                     ` Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 4/9] tun/tap: add batched ptr_ring consume functions " Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring Simon Schippers
2026-01-08  3:40   ` Jason Wang
2026-01-07 21:04 ` [PATCH net-next v7 6/9] tun/tap: add helper functions to check file type Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers Simon Schippers
2026-01-08  4:38   ` Jason Wang
2026-01-08  7:47     ` Simon Schippers
2026-01-09  6:04       ` Jason Wang
2026-01-09  9:57         ` Simon Schippers
2026-01-12  2:54           ` Jason Wang
2026-01-12  4:42             ` Michael S. Tsirkin
2026-01-07 21:04 ` [PATCH net-next v7 8/9] tun/tap: drop get ring exports Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present Simon Schippers
2026-01-08  4:37   ` Jason Wang
2026-01-08  8:01     ` Simon Schippers
2026-01-09  6:09       ` Jason Wang
2026-01-09 10:14         ` Simon Schippers
2026-01-12  2:22           ` Jason Wang
2026-01-12 11:08             ` Simon Schippers
2026-01-12 11:18               ` Michael S. Tsirkin [this message]
2026-01-13  6:26               ` Jason Wang
2026-01-12  4:33           ` Michael S. Tsirkin
2026-01-12 11:17             ` Simon Schippers
2026-01-12 11:19               ` Michael S. Tsirkin
2026-01-12 11:28                 ` Simon Schippers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260112061721-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eperezma@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=jon@nutanix.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=leiyang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=simon.schippers@tu-dortmund.de \
    --cc=stephen@networkplumber.org \
    --cc=tim.gebauer@tu-dortmund.de \
    --cc=virtualization@lists.linux.dev \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.