Netdev List
 help / color / mirror / Atom feed
From: "Jonas Köppeler" <j.koeppeler@tu-berlin.de>
To: <hawk@kernel.org>, <netdev@vger.kernel.org>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	<linux-kernel@vger.kernel.org>, <bpf@vger.kernel.org>
Subject: Re: [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
Date: Thu, 28 May 2026 09:45:39 +0200	[thread overview]
Message-ID: <7b75ef64-fd7d-4323-a323-e520947ffbfa@tu-berlin.de> (raw)
In-Reply-To: <20260527135418.1166665-3-hawk@kernel.org>

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
>
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") gave qdiscs control over veth by returning
> NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
> a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
> a dark buffer, adding base latency because the qdisc has no visibility
> into how many bytes are already queued there.
>
> Add BQL support so the qdisc gets feedback and can begin shaping traffic
> before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
> under UDP load from ~6.61ms to ~0.36ms (18x).
>
> Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
> the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
> has no link speed -- the ptr_ring drains at CPU speed and is
> packet-indexed, not byte-indexed, so bytes are not the natural unit.
> With byte-based charging, small packets sneak many more entries into
> the ring before STACK_XOFF fires, deepening the dark buffer under
> mixed-size workloads.  Testing with a concurrent min-size packet flood
> shows 3.7x ping RTT degradation with skb->len charging versus no
> change with fixed-unit charging.
>
> Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
> confirming the ring is not full.  The charge must precede the produce
> because the NAPI consumer can run on another CPU and complete the SKB
> the instant it becomes visible in the ring.  Doing both under the same
> lock avoids a pre-charge/undo pattern -- BQL is only charged when
> produce is guaranteed to succeed.
>
> BQL is enabled only when a real qdisc is attached (guarded by
> !qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
> for TXQ modification like dql_queued(). For lltx devices, like veth,
> this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
> producer_lock provides additional serialization that would allow
> BQL to work correctly even with noqueue, though that combination
> is not currently enabled, as the netstack will drop and warn.
>
> Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
> entry.  This is necessary because the qdisc can be replaced live while
> SKBs are in-flight -- each SKB must carry the charge decision made at
> enqueue time rather than re-checking the peer's qdisc at completion.
>
> Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
> clears promptly when producer and consumer run on different CPUs.
>
> BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
> alongside the existing DRV_XOFF (ring full).  Both must be clear for
> the queue to transmit.  Reset BQL state in veth_napi_del_range() after
> synchronize_net() to avoid racing with in-flight veth_poll() calls.
> Clamp the reset loop to the peer's real_num_tx_queues, since the peer
> may have fewer TX queues than the local device has RX queues (e.g. when
> veth is enslaved to a bond with XDP attached).
>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>

> ---
>   drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 75 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 0cfb19b760dd..21ff78533943 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -34,9 +34,13 @@
>   #define DRV_VERSION	"1.0"
>   
>   #define VETH_XDP_FLAG		BIT(0)
> +#define VETH_BQL_FLAG		BIT(1)
>   #define VETH_RING_SIZE		256
>   #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
>   
> +/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
> +#define VETH_BQL_UNIT		1
> +
>   #define VETH_XDP_TX_BULK_SIZE	16
>   #define VETH_XDP_BATCH		16
>   
> @@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
>   	return (unsigned long)ptr & VETH_XDP_FLAG;
>   }
>   
> +static bool veth_ptr_is_bql(void *ptr)
> +{
> +	return (unsigned long)ptr & VETH_BQL_FLAG;
> +}
> +
> +static struct sk_buff *veth_ptr_to_skb(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
> +}
> +
> +static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
> +{
> +	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
> +}
> +
>   static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
>   {
>   	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
> @@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
>   	if (veth_is_xdp_frame(ptr))
>   		xdp_return_frame(veth_ptr_to_xdp(ptr));
>   	else
> -		kfree_skb(ptr);
> +		kfree_skb(veth_ptr_to_skb(ptr));
>   }
>   
>   static void __veth_xdp_flush(struct veth_rq *rq)
> @@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>   	}
>   }
>   
> -static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> +static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
> +		       struct netdev_queue *txq)
>   {
> -	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> +	struct ptr_ring *ring = &rq->xdp_ring;
> +
> +	spin_lock(&ring->producer_lock);
> +	if (unlikely(__ptr_ring_check_produce(ring))) {
> +		spin_unlock(&ring->producer_lock);
>   		return NETDEV_TX_BUSY; /* signal qdisc layer */
> +	}
> +
> +	/* BQL charge before produce; consumer cannot see entry yet */
> +	if (do_bql)
> +		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
> +
> +	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
> +	spin_unlock(&ring->producer_lock);
>   
>   	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>   }
>   
>   static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> -			    struct veth_rq *rq, bool xdp)
> +			    struct veth_rq *rq, bool xdp, bool do_bql,
> +			    struct netdev_queue *txq)
>   {
>   	return __dev_forward_skb(dev, skb) ?: xdp ?
> -		veth_xdp_rx(rq, skb) :
> +		veth_xdp_rx(rq, skb, do_bql, txq) :
>   		__netif_rx(skb);
>   }
>   
> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>   	struct veth_rq *rq = NULL;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;
>   	struct net_device *rcv;
>   	int length = skb->len;
>   	bool use_napi = false;
> +	bool do_bql = false;
>   	int ret, rxq;
>   
>   	rcu_read_lock();
> @@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	}
>   
>   	skb_tx_timestamp(skb);
> -
> -	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	if (rxq < dev->real_num_tx_queues) {
> +		txq = netdev_get_tx_queue(dev, rxq);
> +		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
> +		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
> +	}
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
>   	switch (ret) {
>   	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>   		if (!use_napi)
> @@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>   				     __func__, dev->name, ret);
>   	}
> +
>   	rcu_read_unlock();
>   
>   	return ret;
> @@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>   
>   static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct veth_xdp_tx_bq *bq,
> -			struct veth_stats *stats)
> +			struct veth_stats *stats,
> +			struct netdev_queue *peer_txq)
>   {
>   	int i, done = 0, n_xdpf = 0;
>   	void *xdpf[VETH_XDP_BATCH];
> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			}
>   		} else {
>   			/* ndo_start_xmit */
> -			struct sk_buff *skb = ptr;
> +			bool bql_charged = veth_ptr_is_bql(ptr);
> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>   
>   			stats->xdp_bytes += skb->len;
> +			if (peer_txq && bql_charged)
> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
> +
>   			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>   			if (skb) {
>   				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>   
>   	xdp_set_return_frame_no_direct();
> -	done = veth_xdp_rcv(rq, budget, &bq, &stats);
> +	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
>   
>   	if (stats.xdp_redirect > 0)
>   		xdp_do_flush();
> @@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
>   static void veth_napi_del_range(struct net_device *dev, int start, int end)
>   {
>   	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer;
>   	int i;
>   
>   	for (i = start; i < end; i++) {
> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>   		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>   	}
>   
> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
> +	 * synchronize_net(), and NAPI can no longer clear it.
> +	 * Only wake when the device is still up.
> +	 */
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);
> +
> +		for (i = start; i < peer_end; i++) {
> +			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
> +
> +			netdev_tx_reset_queue(txq);
> +			if (netif_running(dev))
> +				netif_tx_wake_queue(txq);
> +		}
> +	}
> +
>   	for (i = start; i < end; i++) {
>   		page_pool_destroy(priv->rq[i].page_pool);
>   		priv->rq[i].page_pool = NULL;
> @@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
>   	dev->priv_flags |= IFF_PHONY_HEADROOM;
>   	dev->priv_flags |= IFF_DISABLE_NETPOLL;
>   	dev->lltx = true;
> +	dev->bql = true;
>   
>   	dev->netdev_ops = &veth_netdev_ops;
>   	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;

  reply	other threads:[~2026-05-28  7:45 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-28  7:45   ` Jonas Köppeler [this message]
2026-06-04  8:19   ` Paolo Abeni
2026-06-10 12:21     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-06-04  8:24   ` Paolo Abeni
2026-06-10 12:37     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-06-04  8:30   ` Paolo Abeni
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
2026-05-28  7:46   ` Jonas Köppeler
2026-06-01 12:00     ` Simon Schippers
2026-06-01 14:03       ` Jonas Köppeler
2026-06-01 16:16         ` Simon Schippers
2026-06-02  7:24           ` Jonas Köppeler
2026-06-02 15:37             ` Simon Schippers
2026-06-03  8:28               ` Jonas Köppeler
2026-05-29 14:51   ` Simon Schippers
2026-06-04  8:21   ` Paolo Abeni
2026-06-08 10:38   ` Simon Schippers
2026-06-08 13:04     ` Jesper Dangaard Brouer
2026-06-08 13:13       ` Jonas Köppeler
2026-06-08 14:21         ` Simon Schippers
2026-06-09 13:59           ` Jesper Dangaard Brouer
2026-06-09 15:08             ` Simon Schippers
2026-06-10  7:04               ` Simon Schippers
2026-06-10 10:15                 ` Jesper Dangaard Brouer
2026-06-10 12:00                   ` Simon Schippers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b75ef64-fd7d-4323-a323-e520947ffbfa@tu-berlin.de \
    --to=j.koeppeler@tu-berlin.de \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox