From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 926C42FFDE3; Mon, 11 May 2026 10:10:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=212.227.126.131 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778494242; cv=none; b=sxn8v92D0vXbXbARt3tL54I7GXyA5BtcUlKBiuPppVLabrdERoH/bg91kO23D5gcKXtOwC1HENxnlefHJU9jKo0e6WKUHXb/EOnmPQBCpo7A4M793DzOTnOnkx8jFP5ekDy6MFFGk3IVH/UYKwSsjUcg3iAIIBu/FcpYOw2J0Pw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778494242; c=relaxed/simple; bh=Qz9kpg/HS0cFlZE/TwFxVNmoXERLtHXlJY68fxjj6qA=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=EbIw102oXrjN/BE94TcFd1Gb8rUi5iW7aC6kuLlmFWEEFJQdYPcuG8WaKD/qVkMvwHJ9f05RPli6qScgyKfu3uMXD5RKPitpuhTAPGUFy3UzSmGOCX7MBcSmdxExA3LQ0Z3DK0fgH/Yoid9t3FscvBsjfjv6v0m0cNCw4ZDXiHs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=schippers-hamm.de; spf=pass smtp.mailfrom=schippers-hamm.de; dkim=pass (2048-bit key) header.d=schippers-hamm.de header.i=simon@schippers-hamm.de header.b=tcPPCTrc; arc=none smtp.client-ip=212.227.126.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=schippers-hamm.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=schippers-hamm.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=schippers-hamm.de header.i=simon@schippers-hamm.de header.b="tcPPCTrc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=schippers-hamm.de; s=s1-ionos; t=1778494222; x=1779099022; i=simon@schippers-hamm.de; bh=UkLOFwSIv0ZeMA04rMv1O9GYe4NK09leTU5z8PTBxOU=; h=X-UI-Sender-Class:Message-ID:Date:MIME-Version:Subject:To:Cc: References:From:In-Reply-To:Content-Type: Content-Transfer-Encoding:cc:content-transfer-encoding: content-type:date:from:message-id:mime-version:reply-to:subject: to; b=tcPPCTrcSfTJgiq3ZVeonnCyXGftoGWAOmpOOL7S9698YfkKOlO2kyn5TDKRXEpq bsMdhnbUJ2Z1/xUAsgHgyrtsNRfeo7tDqPEcIeyERpryxvMufsOIDVE/+/sk995c9 A1hdE2aXdvuItcWPOKyWw08hHm2P3+MMmyJZg4vWmY5m6ujK+ewtB02L/5iLPqOf/ /8SAAehlhfhvwRsLkIMFXn8w9NNgjMMnKUTfw3LpNiHknTiowTeAnJdcv24DuJq1m M2x7aN8a+nn9yKqphT72aPvJ48vqcLGUPkNqk4vfyqRLIbsI4ZdcwaztM+ztPrbn2 H0u8f9hfYGDtj4cG3w== X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6 Received: from client.hidden.invalid by mrelayeu.kundenserver.de (mreue012 [212.227.15.136]) with ESMTPSA (Nemesis) id 1Mq1GE-1x9AJB0U9C-00pEe9; Mon, 11 May 2026 11:55:51 +0200 Message-ID: <41023c34-87a3-4e4f-b3ab-3ed53d171910@schippers-hamm.de> Date: Mon, 11 May 2026 11:55:46 +0200 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction To: Jesper Dangaard Brouer , Jakub Kicinski Cc: Paolo Abeni , netdev@vger.kernel.org, kernel-team@cloudflare.com, Andrew Lunn , "David S. Miller" , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Stanislav Fomichev , linux-kernel@vger.kernel.org, bpf@vger.kernel.org References: <20260505132159.241305-1-hawk@kernel.org> <20260505132159.241305-4-hawk@kernel.org> <8f2f7f2e-6aa2-4e5b-b52d-0025b2525579@redhat.com> <6a597dbd-70bf-4b14-b495-2f7248fd3220@kernel.org> <20260508190626.4285fac0@kernel.org> <20260510085602.57c7a081@kernel.org> Content-Language: en-US From: Simon Schippers In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:ooYqc/zrl1aUI/C0NK5cJ8I5Ix6X2RKup/9iRqBAMhsFc2c1ciN d06pIv4q+1GQtF1C7815sHLqpU44X5Zi0YZp94iibLKFHR6MR/lOnAyBFwNasYzoi0CQUK8 xwQU6Bsrd3i3jHGIxBkk5Lu0+gJfx1gCJ0pytst4zdLP5vkq5DG/CqK6692hLh5v1HDMVrx LG2HJ/7NpmqNujlliiQAg== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:x9YASn3Q0N8=;pRVWsaTwVHJbKS4ZMMM+1J75ABr Ryi9pFKo1rlDhjCDtAtxmjdxAi9hYJiJRpuLBV0oFDQVdVz0duBPx93xVn02iemHuglLfCCtf LC+cAQHlXu7s0dGOlTOu5YDHigWnwfa2woCkMSC8e3g5Cvwc9undZaBS+2UMjijXSxlnxvtap SlVBsW+Xg0S5TQ7Y7LH3yupvS9/0kxjAKt3oh1VDkW3cV0QKA3CG0KuoH6FnfugVg1EXtZ+U2 dNcAtk2pBH1XZjnpCiq4WXp1xvNlYo1MtLLCQOm8mnTKIB2FbSubBtG+Osg9ivl5xsEnLIMJd RRxDzoTirwUGGg2F3wEQLgQwZPeH6LaNxVBJ3UytOWi9bD2/6XHVVAXbyZ4Sno4MVpNJvMz0W hzIFPbO/QMqxWI/eDjVtSdM+/hbirdhBRNAF2eQPKO+GCohhVvFJCxIn3K627MtNyTD6u1Bg4 f7/ZBig0egxOpIUpoVIvngq/b1qDvPsDcawiNEE6iSXIVGRi8v7pYNV3MrAP33fCo1JOFDOUw Z4CMN1awnwLmF4jAySr08tqTimwK1pnBBu/ogUuiqAhpQ2+aHiy3VHY9kdLYuUtW/0qEFpAMj EBWO9BEpPYgoceq5Wwc7DQ1UlnXWv+iSAklGZ/SWH4JD93qNbTKIbYyZV8nIUGXcuiwI3yoSZ CVvl7X022VJ2onsaNvPoRXj2G2SBSsrUEcHGekYVOjWRFVQlHBX/YrRovdW/A+RZYVnB8yaTG hZ6PFIGFF2Uh+a0D7RQayW2fYYicPZTsyf7sGt7BwUG/9Nwe86nJStozlQAcaHAyQnHxdyZF2 lgjiwPjk5pOWrzjeKPhr4M/r8aNjrZ1CjopDN89vgcV+9ZfSnCD5pIAJ1Vn9FkBi596rigprL Ki+SfbCGw4E+/ouFv3VJy6LUntKEXG9jJiIoK/lyHXCQm9FMutEsLJcd5r87dnPISAjEmCM9b UyQj4Q4jdKJ6nmVi8THDzIvUXsFXpxdjEm/7HnzuqKArp/ijm3ohl/box5n/g6VRhPq7iNmPD Czde+rYq0b7MhrnGXM96Bom2vwHQgYtx6KciU/hBQ2fgbppch8J95p8zDqCrqxURC1o794IsN 0aDrlv0v0IQdDz3Flh5SsKiFFKtnwJ5Z180yRIAISXTWo22we689/DsIZQbJfikbsZjg1q6Yl PL7be4Tz25ApSXx/alwLSblJl/wnUBF5ZRIYlsRoHZq86XWaxfFWApsep2TdCNGTbSKc6mnRD XHKz4pEtyXsGg2NqRCB3eYD6doiujdQaETBDJc5S/hpxswIHPrNYl5V0MHBc7j1k8bFcfrHsN YfKmPxF+XHeSdsCreEuclBKw9+aMPHDw5Ra+ZZ2Tulb2R3vkE99gZ4/Q7PRnoQyawx1VTEtAo 2rCvgnwCAT2OV5Vpo9FC/725U3lkKeBf94bPDxBRVKu9MstOrsoYq6bq5B7Pj5jueVbfuVaxB ZcjGX+FHEe0zd7C6L6+BHTd3CSN9sxyxdvtyCUer9VUZtXFSJ+a5EglK8TKDWi+PWYWsOyvoi Yb4Zj6i9Mu8bo6AWG14kVtmLNK3NXnVexl8rZhaI3sENkUq9xFqw3ReufHDges0MPwqnpRsZP rmm9dLe6/dvyEf/eCT1jHZAqGyTmwiPSb6qduwO7R27ZAvwcEEWHoHbhvy5UlcYLc+jQ2mxuG yXMrYyltm4YFYa2N8tnDzK9o1oF6owN4pTlTOdbShkOXgExaMNJunn0m3Xe15o9KjBhrAO5/p WUIJrknVzFnWaKpjcK+JNGJGOfbHtD1kXqDQJZU0OefRigRM63Mu9CiOIsjxC4hy/lCUsv0TA utbRb/Tt6NDpokCRS7xgzc+k9vyd19TDvsOU/FiXGKI0ENhWkE26nZtvskSBn0+rUGogMvgz1 t1oWw1zOoDUlfL+kAs/ztnqzPoiE7r7jX9+SsuLi486xZacBqlydkmjJer7SvvKSh7yyd9fkI hlzeA72bO6mTa5sK3mXsW42heleVTe8sMuMhsP9dgKYD+oVjFzaDbdQkp/twBTHS5FcvU98Cj nGhg3MtXK12sY8rCy2NgYxzoulW/0snpuJ/xQD8/gU/nBXsVduUglEqRVDPZO2l9XwGG7tXcT d8NRx/zGRhhrS3AvMS+iXqSH57WvGx0j On 5/11/26 10:11, Jesper Dangaard Brouer wrote: >=20 >=20 > On 10/05/2026 17.56, Jakub Kicinski wrote: >> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote: >>> On 09/05/2026 04.06, Jakub Kicinski wrote: >>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote: >>>>> Not against being able to modify VETH_RING_SIZE, but I don't think i= t is >>>>> the solution here. >>>> >>>> Was it evaluated, tho? >>>> >>>> It's obviously super easy these days have AI spew no end of complex >>>> code. So it'd be great to have some solid, ideally production-like >>>> data to back this all up. >>>> >>>> VETH_RING_SIZE seems trivial, ethtool set ringparam >>> >>> No, unfortunately we cannot just decrease the VETH_RING_SIZE. >> >> To be clear - I said may it configurable with ethtool -G >> not change the default. >> >=20 > Sure, I understand the desire to make VETH_RING_SIZE configurable. > If doing so we are making Linux network stack harder to tune and setup > correctly. E.g. adding a qdisc to veth would also require changing the > ring size, but if system also uses XDP then tuning below 64 (likely 128) > will lead to hard-to-find packet drops. I mean 64 still could be a 4x improvement at least. >=20 > I prefer adding something (like BQL) that auto-tune how much of the ring > queue we are using. Good queues function as shock absorbers when > concurrent processes in the OS have scheduling noise. >=20 > I acknowledge that Simon Schippers found that the BQL implementation was > actually not auto-tuning. We need to work on this, my prototype > implementation [1] [2] works surprisingly well. >=20 >=20 > - [1] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@k= ernel.org/2-09-veth-time-based-bql-coalescing.patch > - [2] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@k= ernel.org/ >=20 >=20 >>> The reason is that XDP-redirect into veth don't have any >>> back-pressure and would simply drop packets if queue size becomes >>> less than the NAPI budget (64). (Yes, we use both normal path and >>> XDP-redirect in production). >> >> Doesn't this mean you have a queue which is not under BQL control? >> >=20 > It is a matter of perspective. BQL needs between 17-55 elements in the > 256 queue. At the same time we handle if the ring runs full, e.g. due > to a sudden burst of XDP redirected packets, which pushes packets into > the qdisc layer. You are checking inflight/limit in /sys directory to get the 17-55 number, right? I think those elements are not really in the queue. As written before: The weird thing in this implementation is that is that BQL's inflight !=3D number of packets in the ring and BQL's limit !=3D "current ring size= ". Instead the BQL limit describes the number of maximal allowed packets between calls of netdev_sent_queue(). And in our case here we do not complete (in our case forward) the packets when calling netdev_sent_queue() but instead immediately and therefore they are not in the queue anymore when netdev_sent_queue() is called. Also that means that the number strongly depends on the VETH_BQL_COAL_TX_USECS parameter. For a fixed PPS the limit should be approx.: Limit =3D VETH_BQL_COAL_TX_USECS * PPS Assuming the default 10us coal and a fixed 1 MPPS: Limit =3D 0.00001s * 1_000_000 =3D 10 packet Can you follow my theory? Judging from that I personally think VETH_BQL_COAL_TX_USECS needs to bigger. More like 100us/1ms. With 10us the bql limit gets adjusted very often I think.. Thanks. >=20 >=20 >>> My benchmarking shows that an optimal BQL limit is dynamically >>> adjusted between 17-55 depending on veth consumer namespace >>> overhead/speed, when balancing throughput and latency. >> >> Testing with prod-approximating traffic pattern and load would be great= . >=20 > That is what I'm doing. I'm testing with prod-approximating traffic > pattern and changing the number of iptables rules to simulate the > overhead I measured from production. I think I explained this in the > cover letter. We are going to use this in a production environment (to > be clear). >=20 > Simon found an issue testing the overload scenario. >=20 > --Jesper >=20