From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.17.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72D813A0B3B; Fri, 8 May 2026 09:20:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=212.227.17.13 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778232054; cv=none; b=SAT7/3LJb7bxnO49k/F4EcKOx2669T0wHk3Cvx6RO6z+brqRG6TwEU08wMzweoSzBChANzffR2EYQQYV9cqoa41vV8Ub/kmmSbo46o/1RvpjX7bpywo0ZZONKdsWggS+cm9eYKOLmoq04h0+0T65OF5swsQxtNdlFo9u0NFHgeU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778232054; c=relaxed/simple; bh=oenf9co85HWC6U6Z6yxE2cSzQKfneZ36M1pV4jxCyT8=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=t3m3n9MV0CAq3EwTcK1EZDOo+9vw/0UWzv1uPxmjtWorZJ0WShR8gUyYlzqwpqDRptxsJZ/HAyAz1R9LfmHICH4ZfXLTv6Vu9hQm/5Qd0C4YvPwWdF4d2U8imu6mbmKb4eAIYkU+J8ssIBT2hGiaiVbk0sumPbqZ9WMbRF+JAWA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=schippers-hamm.de; spf=pass smtp.mailfrom=schippers-hamm.de; dkim=pass (2048-bit key) header.d=schippers-hamm.de header.i=simon@schippers-hamm.de header.b=Xu9LrzYS; arc=none smtp.client-ip=212.227.17.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=schippers-hamm.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=schippers-hamm.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=schippers-hamm.de header.i=simon@schippers-hamm.de header.b="Xu9LrzYS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=schippers-hamm.de; s=s1-ionos; t=1778232008; x=1778836808; i=simon@schippers-hamm.de; bh=Hc0GCEPa4JhrvnKii+m3Lykf1LE1GIBL0gGV95D/z88=; h=X-UI-Sender-Class:Message-ID:Date:MIME-Version:Subject:From:To: Cc:References:In-Reply-To:Content-Type:Content-Transfer-Encoding: cc:content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=Xu9LrzYSxbVNYH/KQzwfwj5bV0VNLuoAFZTDfCVZgcEtaUOs3nX/o6i/tlYuPrkl SVWhSaA+f+KyHAjkSYedMuZUBBbpXrMMPQKF8e47a/sJmjGUh9ZjxRtAFfcHiAgb2 spHONZdtEkpXEhenISaIU8i6Yj0PQdP/aYQJKq42ZHR30qN6wTCZ/ppqMEgOn/P7N wAxthuYeM2HSi+x5qKBdzdSYYwqV1W6xaxf+Rk8jd+J5mCaXX3FgoWxfZyLykm23P XVR5DXO2piW4BSpSSZeYUYH2YcX96uR5ImOJz9nsJLSIMqBc5lVbJg0uPCyobMbFV XlqOfl4zCcXsxGZVxw== X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6 Received: from client.hidden.invalid by mrelayeu.kundenserver.de (mreue108 [212.227.17.181]) with ESMTPSA (Nemesis) id 1MCsHm-1wCRgm1iIS-00Dqn7; Fri, 08 May 2026 11:20:08 +0200 Message-ID: Date: Fri, 8 May 2026 11:20:05 +0200 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction From: Simon Schippers To: Jesper Dangaard Brouer , Paolo Abeni , netdev@vger.kernel.org Cc: kernel-team@cloudflare.com, Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Stanislav Fomichev , linux-kernel@vger.kernel.org, bpf@vger.kernel.org References: <20260505132159.241305-1-hawk@kernel.org> <20260505132159.241305-4-hawk@kernel.org> <8f2f7f2e-6aa2-4e5b-b52d-0025b2525579@redhat.com> <6a597dbd-70bf-4b14-b495-2f7248fd3220@kernel.org> <68223314-1a44-4aee-8207-57437ef9f3ab@schippers-hamm.de> <3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org> <21d639fc-e244-486e-8368-8891b3c43215@schippers-hamm.de> Content-Language: en-US In-Reply-To: <21d639fc-e244-486e-8368-8891b3c43215@schippers-hamm.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:WAXVRmfQ06nU85T0VZPtSVGg7wsnYtNUbsSf7t56ghLd8asTuKq SHuNw52IRNqz3QxBsJ1+wtMtPSug/zIsUmutNan0DvKqWhoRn0dwq6D7x3NK1mNz8CVBvoT iesOpBf/cEXVEU2MSByE9rKQ0xm2dBxDKG6mRl18aPEz92FHnX0cvrP9+jz5cxbA7hFn5Wh eW0UoTTiwT9qFWWN3hjeA== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:/xrhJ6c1wLo=;wy0A1q7QtPHjyGzaRqW8oZlrR3P IuNKLzoXdV0icMgQ/ftGnwlA8LnyBDNgSEzyXJ2F/o/nd+cWFTtG7C9dVoBtpAFFYsCeJ86et nVjcbRlfXlrnHD4qtlLCiVTooXWd/6LDGe3pCpc7Jtr7fECatrwYo2JiCX/ty+8rS/1IJU/eu uYKbBk9/bGEmcuUK8W8TQaw/zula27G1+pg3K9ZtB33Q1xR9k/Q40+DoCgWAX5UJvCvU1sVR1 ODQ1H1Pd+mrrP+ms7Kkck8qXzKgQpmGoTHSF4JpwUP2IMe++dIweMOq2IYZyXHdqInIg4ZRWv Dv0GN+9jzVeyrcWjF066HOFh0Lw3U3ITStp1nVFNNwajPAYVieftZzMGxzfKGSZZAIictUIyR U6YK2D/14ktEveZFYOrgBBR8N6EUzD2OK9WHPiOD+WerSi6OQ15NYIlJDchT6AyS3eTm6Y2Tx boPbqmSe3GshE2N9bso5mrrvNxwWbpilqd9pDs8Ex19fi7O5wg7+VzZAu0iCZEdLL5eINyUMH jLOdXcZbnbr7CcFWHpjf+W6n/l9fMNkXo/0xiIvoaupctmyaDpEjXcXRh39P9AjU4ix5uhkFy xHaFP18xy7DQ6nqXFSLehstPixOM74tAjjfC+lgtL7r8gS4agRIiD0ymvh0xN1/83Kr57l1+9 ZBRroNRpIxvrDErMMY2ZVbpSZXUAf4eAPpChqOvlXfBnjJcVN4r7m1/zU7HlDLfEC9KcBfRHa /bgJaDMfhuQnGWZXorWKT8f3MxHa5BEyeIyR/8em4ev/KDWUTqzoAwcLQXWhCNHmAkle+oL5R qAJSRD23Fe6hRmg+zuNTzg2ZpjCEZIR20Y/TNYwfM87Ol0Fq0AuGvhUxc5cNP3OfUt28iS7H/ ZLk+mcCONuo/Ixr7WqSiyRD87ACl0+h8TTdoguC8fbiV6fXxHZu/mEFfsunn8Ryykoof/c4Mc kKJsc+EzaLpEvzXMwNGYmoF1T8um6Sg6XhV8DcIl5XCJ1Rm7qJAEj+ivrJi0lE6pzG9SIjpfX WpdXFVpJnepQ7KHAXzKZK0zhkkfY/GcZpLefl8IxCU3hLPX+WezLv/Bx6sbKQLex4fQe8HVaf PaHaggDEP2COC+dUMhAtbHGlw0yeMMnlAkJbmECqWEhRP5BStI8QvMKz+EIcsxFLpBZtoXI0S E7nCmpxRn3QEHuPnZlcvO4th7gUvWUu12Lv9fPViGEYnu11g1EUiBlunpazeerxnfkKQTZu1M GzHJaqq/HfAH9yPQJo5mGl8QhZDY9meUzVva/vWd6H3rvvxGR0av8I5c7BVoe/5ffWb7TBTXd 2+pBpxx+z/Gc2N4BYJnaYcq17XTiNkJR/ECYC7HEo7VjStFlqd7S6k6Zd/ChIZ8Ntb0sLtU+k W8ZQXjJABO52VO3zSOm3Mw3xRxc6nHvLInjG9kj/bgSC2iJqmkcJV4PYRmxg9QAQr4c4BG2M5 pwHJSZKRI8FFbPP3UXPO5GJUJIMSstMCC7Z8Z+zvezY1ZExETaP50rk0sa5RNGi6uQx6Z/oFC WB3O0FfUW+LwjukHsn3WB9v9Ycoo8uK4eeO0tocEe68Tad2R0x92ks9dZF6PCGlNg/WspPuse z7/sQWCW655GLKD4RhmS1UFo9c6jpjUO6A+bA7nbSn3o2KH9jClgiyziIYvFlwh43t6yqn+E1 ACxy2TK5/NNZoKYmR7tOzYGU2hudCxuZFUjOUd2AmipS8IyPz51kJC8Xtrqt64Hj64PL8g73r wOxdXq4nb+kX5Pq7/fADV/kN2W7AWPhoRNiKZbwU1zXiXk6xprrhfq34kV0XOqvPVAOuXr8LN r44ikSAmk+3ND30Ntj2TaEjDSttwCLHd92eZPXVJ91fgkS0ZS/43oiRPrRl7O6NDKI7ydIMSj qSk5Hpp2FF6x11sqPz640x4kTiHjldyXD5yzO6sG1mIMnUg4+/UvE5Po/RJiqiVFD8oTfCp+M jCiDvSzGJvVOa3dnvObnlrA9CpYulxGxrSM2BCp1d82ss/lVMW5pHzZU++cni6P/5/toEpd9z jZVb1cyBMPPd8gNWc3uTP3x6f4XHpx1kotLvPj0cdaWeUTaXkTCcBVH7xT54j+OSX3ibalmqV fovKqiTsxbs= On 5/8/26 10:01, Simon Schippers wrote: > On 5/7/26 22:45, Jesper Dangaard Brouer wrote: >> >> >> On 07/05/2026 22.12, Simon Schippers wrote: >>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote: >>>> >>>> >>>> On 07/05/2026 16.46, Simon Schippers wrote: >>>>> >>>>> >>>>> On 5/7/26 16:34, Paolo Abeni wrote: >>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote: >>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote: >>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, = int budget, >>>>>>>> } >>>>>>>> } else { >>>>>>>> /* ndo_start_xmit */ >>>>>>>> - struct sk_buff *skb =3D ptr; >>>>>>>> + bool bql_charged =3D veth_ptr_is_bql(ptr); >>>>>>>> + struct sk_buff *skb =3D veth_ptr_to_skb(ptr); >>>>>>>> stats->xdp_bytes +=3D skb->len; >>>>>>>> + if (peer_txq && bql_charged) >>>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_= UNIT); >>>>>>> >>>>>>> In the discussion with Jonas [1], I left a comment explaining why = I think >>>>>>> this doesn=E2=80=99t work. >>>>>>> >>>> >>>> I've experimented with doing the "completion" at NAPI-end in >>>> veth_poll(), but that resulted in BQL limit being 128 packets, which >>>> leads to bad latency results (not acceptable). >>>> (See detailed report later) >>>> >>>> >>>>>>> I still think first that adding an option to modify the hard-coded >>>>>>> VETH_RING_SIZE is the way to go. >>>>>>> >>>> >>>> Not against being able to modify VETH_RING_SIZE, but I don't think it= is >>>> the solution here. >>>> >>>> The simply solution is the configure BQL limit_min: >>>> `/sys/class/net//queues/tx-N/byte_queue_limits/limit_min` >>>> >>>> My experiments (below) find that limit_min=3D8 is gives good performa= nce. >>>> We can simply set default to 8 as this still allows userspace to chan= ge >>>> this later if lower latency is preferred. >>>> >>>>>>> Thanks! >>>>>>> >>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8= 274b62920df@tu-dortmund.de/ >>>>>> >>>>>> In the above discussion a 20% regression is reported, which IMHO ca= n't >>>>>> be ignored. Still the tput figures in the data are extremely low, >>>>>> something is possibly off?!? I would expect a few Mpps with pktgen = on >>>>>> top of veth, while the reported data is ~20-30Kpps. >>>>>> >>>>>> /P >>>>>> >>>>> >>>>> The ~20-30Kpps occur when thousands of iptables rules are applied an= d >>>>> an UDP userspace application is sending. >>>>> >>>>> And there is a 20% pktgen regression (no iptables rules applied). >>>>> >>>> >>>> The pktgen test is a little dubious/weird and Jonas had to modify pkt= gen >>>> to test this. John Fastabend added a config to pktgen that allows u= s >>>> to benchmarking egress qdisc path, this might be better to use this. >>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usa= ge. >>>> >>>> If redoing the tests, can you adjust limit_min to see the effect? >>>> /sys/class/net//queues/tx-N/byte_queue_limits/limit_min >>>> >>>> 20% throughput performance regression is of-cause too much, but I wil= l >>>> remind us, that adding a qdisc will "cost" some overhead, that is a >>>> configuration choice. Our purpose here is to reduce bufferbloat and >>>> latency, not optimize for throughput. >>>> >>>> >>>>> I am pretty sure the reason is because the BQL limit is stuck at 2 >>>>> packets (because the completed queue is always called with 1 packet >>>>> and not in a interrupt/timer with multiple packets...). >>>>> >>>> >>>> I've run a lot of experiments, which I made AI write a report over, s= ee attachment. The TL;DR is that best performance vs latency tradeoff is = defaulting BQL/DQL limit_min to be 8 packets. >>>> >>>> I fear this patchset will stall forever, if we keep searching for a p= erfect solution without any overhead. The qdisc layer will be a baseline = overhead. The limit=3D2 packets is actually the optimal darkbuffer queue = size, but I acknowledge that this causes too many qdisc requeue events (le= ading to overhead). I suggest that I add another patch in V6, that defaul= ts limit_min to 8 (separate patch to make it easier to revert/adjust later= ). >>>> >>>> I've talked with Jonas, and we want to experiment with different solu= tions to make BQL/DQL work better with virtual devices. >>>> >>>> This patchset helps our (production) use-case reduce mice-flow latenc= y >>>> from approx 22ms to 1.3ms for latency under-load. Due to the consume= r >>>> namespace being the bottleneck the requeue overhead is negligible in >>>> comparison. >>>> >>>> -Jesper >>> >>> First of all thanks for you work and I really see the advantages of >>> avoiding bufferbloat :) >>> >>> But the key of the BQL algorithm, which is the *dynamic* adaption of t= he >>> limit, is not working. Always calling netdev_completed_queue() with >>> 1 packet results in a static limit of 2 packets (as seen by Jonas >>> measurements), which you force up to 8 packets. >>> >>> So in the end this patchset has the same effect as just setting >>> VETH_RING_SIZE to 8 (and giving an option to change this value). >>> >> >> I've code up a time based BQL implementation, see attachment. >> WDYT? >> >> --Jesper >> >=20 > A step in the right direction, but I dislike that you call > netdev_sent_queue() with at least 1 packet (never 0 packets). > I am not sure if it works, and I am not sure about the parameter. >=20 Rethinking of it this could be fine, but really needs testing because: The weird thing is that is that BQL's inflight !=3D number of packets in the ring and BQL's limit !=3D "current ring size". Instead the BQL limit describes the number of maximal allowed packets between calls of netdev_sent_queue(). I messed up in my approach below. Forget it :P >=20 > I would propose doing it like other BQL implementations do > (for example usbnet for which I adapted BQL [1] :) ): >=20 > Call netdev_sent_queue() with n_bql in a periodic work. n_bql would > still be counted in veth_xdp_rcv() like you currently do (synchronized > with the work via ring.consumer_lock?). >=20 > The only weird thing that remains is that BQL's inflight !=3D number of > packets in the ring and BQL's limit !=3D "current ring size". Instead > the BQL limit describes the number of maximal allowed packets between > calls of netdev_sent_queue(), which occur periodically in a somewhat > fixed time interval. > I guess that could be fine, but it surely needs testing. >=20 > [1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.sc= hippers@tu-dortmund.de/ >=20