From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03550C3ABBE for ; Tue, 6 May 2025 18:20:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BFC016B000A; Tue, 6 May 2025 14:20:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BAA7E6B0082; Tue, 6 May 2025 14:20:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A9B846B0085; Tue, 6 May 2025 14:20:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8DA686B000A for ; Tue, 6 May 2025 14:20:15 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CBEE1C06B8 for ; Tue, 6 May 2025 18:20:16 +0000 (UTC) X-FDA: 83413297632.02.544BDEF Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf30.hostedemail.com (Postfix) with ESMTP id 0E0C980014 for ; Tue, 6 May 2025 18:20:14 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ABQu3iKw; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf30.hostedemail.com: domain of kuba@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=kuba@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746555615; a=rsa-sha256; cv=none; b=rfpIUtmFt0MaN3FbJRzcDCAv6bvO6nY5NlmKgd5g2ujqmlnLRjoh1mb4LmV7ROFnAHkW07 oDZfb9BPVY1+ieorx4J3o+HLWknZYKDTUWga0tskt3U7zkcpEQlTloP92fl+Bylgqr3dQx sMkGjHNcODDUGMhDDmIa82m6ZKYsAo8= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ABQu3iKw; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf30.hostedemail.com: domain of kuba@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=kuba@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746555615; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HAL2zN+7Y03lF9KVGoA9RFS6xGm4Vzn8X80C1c7T/4U=; b=Ae5SOA+IcuyuqJov+LB4Y5i6zC6JexFsqg2xjo9RMrDom1vnGlDiBXSj+LoQbmVL5otSf7 TsZhDfIKbvk45N03j4sIW5Knmv8xne8TOqY8Vs5JL7ptTPP3qkkdZ0NxEIsAlkt7RjxaRs 62yS2hnFnAM2kHPK/hMxDW7ySvmdoWY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id A7619437D7; Tue, 6 May 2025 18:20:13 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 25F89C4CEE4; Tue, 6 May 2025 18:20:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1746555613; bh=MwR5vRWIc9Q59dY7j226r9RQUC2uGk4RTq05VY0f+38=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=ABQu3iKwqXYRUoedkXMsorP2MmQWaEkB2apdmZTi4yq4gcQpjbP61S/w6uKVXEzrU 9wyf7K3jFEfdPoWjBNSCbBd9Oq69OFVVbdf86ThWLmvD1U0TbII/EyftWOLo1+B0Q2 n5elEwqti9/86AASCEz/S7JabQ8/Th4H5iDVPjqtm99cL5Mlpb5FSOvlF3YCUVnivi VXtG60adeIxv+XMnUwsEphz8tahcwm28NNucQ0bGXMhR9fOdeqzYxKMKf9lSNptfU0 dFaszo8khBDeBsun9+N/P752IZzRodGHN0Y2nCzWIuXIewhQv7lmeo37lDYJoCGl6E XH/IUxG7KZlJg== Date: Tue, 6 May 2025 11:20:12 -0700 From: Jakub Kicinski To: David Howells Cc: Andrew Lunn , Eric Dumazet , "David S. Miller" , David Hildenbrand , John Hubbard , Christoph Hellwig , willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org, Willem de Bruijn Subject: Re: Reorganising how the networking layer handles memory Message-ID: <20250506112012.5779d652@kernel.org> In-Reply-To: <1216273.1746539449@warthog.procyon.org.uk> References: <20250505131446.7448e9bf@kernel.org> <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch> <1015189.1746187621@warthog.procyon.org.uk> <1021352.1746193306@warthog.procyon.org.uk> <1069540.1746202908@warthog.procyon.org.uk> <1216273.1746539449@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 0E0C980014 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ire3c5ixg9z7r34momi37oj484cmn1g4 X-HE-Tag: 1746555614-528876 X-HE-Meta: U2FsdGVkX1+OxbhW/q9kLAMwyQEOXeVsMMCgMPTbOIHwlhRqh7SKRepk5RYStuWcG3fEQ2wW1FPvAQGugsb/OD7yEd61tdURdNOuVYMOkq7OyheDF2OJCw91/5y4wcC1IdF7pbH2kLmZh+1rXAKmz0RTnvtLAjOJZ7r690Lc0PNGwFhIwpwY7V8nJQ5cqDy7wDP5NsJNFCjQteU/pbB3ctSgkyEfJ9icjmrpcrPRXpFvlC7su7WTlwwgrnDb0k9++e+4Fc8/aXhCGj8pwW2oyWq+kZAvaSLjzTR6lFXVAPVIvwe68DKe8kvGtaNHAB1Y6oRIvkr90CDt+H58Aj1oRj0lEShckodwqsSl+vqHbcBV8DFhpSNlMpRtAzmOGb3IH+KCMtBf5QnUnYisW6Wf6QP/VzvTZoEGO9PuZAbliwdhxvZsbhSQ+1M6Q9pITRF1PcqHBWZ0zZGrD2YneMfkeee0auWAa+tN4VJj6Ksi8u2eMJNqX5nxLyjQ4M36d7PHVMByEUfQwAE1SR2haO+WJCWKN6M7Imgp2xn0ie/hPjpXS4hGXkWlgr49MS67Dz5yIxKl63k8yhr6PO5ScbeYwhkQpmTAdUVMHlsjWqFoA0LpV9Wg4jGVL+IFI8/G/Mn/SADhpmpoZHJw0QsytbS6Sjln2UGDEx2U/b07PhJzvvlPnRahe3La2G+7JRvXhcnUTUVj+AodfT+s40oQQvEKk6bDjKwJ65MxkMgky0N+x+DmRyGzBT8bQU2Z0SF1TM58ZV0br1aYZ79b3tFFFg/+HABlOokP3Ahj+zI6Nz985RC7+KRYOusEi1hhJiezWFVN0m3OO7ga2i2d9n0Bd3hSB4XTIcvt/UkgnsFQzFv1fZrkVunP/Tdkj1EwuGvVGUO9sx/pUMWa/+CVH0ztJNiKNq4sthYzd+p4l+ucLycnXM6Hl6Q2E+PGws+evP4zjKL/OcMlkp3yT0JeufKm6XD IGpKg3nV UBQidvWKDtvWDCAkkbr4Gjvf8QD3ODjNZMS5wCFPFalD8ID8bm5G9gRXewOsLxQyiThLY9bBGJyUoeB5bqWq7KVzBsoatOp+4jRyusyVHxinSllnK76JgwIVu7FxZkv1tqfDp1pSvmXRXbAqZKK0cmK6GwzVdYp6X+oY6KP8NdHtIahr6IiL8Urko8JclRpwNbGZGy+8YWf3rowBtK8cp/0AYuU6lchACBFfpBirBdbmeXQa1vf93PC+/PY7sL1fYuZwXswzSuWYCRc8MSCTtPHIteA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote: > Jakub Kicinski wrote: > > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because > > > it doesn't use page pinning. It needs to use the GUP routines. > > > > We end up calling iov_iter_get_pages2(). Is it not setting > > FOLL_PIN is a conscious choice, or nobody cared until now? > > iov_iter_get_pages*() predates GUP, I think. There's now an > iov_iter_extract_pages() that does the pinning stuff, but you have to do a > different cleanup, which is why I created a new API call. > > iov_iter_extract_pages() also does no pinning at all on pages extracted from a > non-user iterator (e.g. ITER_BVEC). FWIW it occurred to me after hitting send that we may not care. We're talking about Tx, so the user pages are read only for the kernel. I don't think we have the "child gets the read data" problem? > > > (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be > > > used with certain memory types (e.g. slab). It takes a ref on whatever > > > it is given - which is wrong if it should pin this instead. > > > > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants > > a ref to the stack, right? But yes, the networking stack will try to > > release it. > > I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that > needs to be changed. > > Christoph Hellwig would like to make it such that the extractor gets > {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see > is that you've got a span of memory to use as your buffer. How that span of > memory is managed is the responsibility of whoever called sendmsg() - and they > need a callback to be able to handle that. Sure, there may be things to iron out as data in networking is not opaque. We need to handle the firewalling and inspection cases. Likely all this will work well for ZC but not sure if we can "convert" the stack to phyaddr+len. > > TAL at struct ubuf_info > > I've looked at it, yes, however, I'm wondering if we can make it more generic > and usable by regular file DIO and splice also. Okay, just keep in mind that we are working on 800Gbps NIC support these days, and MTU does not grow. So whatever we do - it must be fast fast. > Further, we need a way to track pages we've pinned. One way to do that is to > simply rely on the sk_buff fragment array and keep track of which particular > bits need putting/unpinning/freeing/kfreeing/etc - but really that should be > handled by the caller unless it costs too much performance (which it might). > > Once advantage of delegating it to the caller, though, and having the caller > keep track of which bits in still needs to hold on to by transmission > completion position is that we don't need to manage refs/pins across sk_buff > duplication - let alone what we should do with stuff that's kmalloc'd. > > > > (3) We also pass an optional 'refill' function to sendmsg. As data is > > > sent, the code that extracts the data will call this to pin more user > > > bufs (we don't necessarily want to pin everything up front). The > > > refill function is permitted to sleep to allow the amount of pinned > > > memory to subside. > > > > Why not feed the data as you get the notifications for completion? > > Because there are multiple factors that govern the size of the chunks in which > the refilling is done: > > (1) We want to get user pages in batches to reduce the cost of the > synchronisation MM has to do. Further, the individual spans in the > batches will be of variable size (folios can be of different sizes, for > example). The idea of the 'refill' is that we go and refill as each > batch is transcribed into skbuffs. > > (2) We don't want to run extraction too far ahead as that will delay the > onset of transmission. > > (3) We don't want to pin too much at any one time as that builds memory > pressure and in the worst case will cause OOM conditions. > > So we need to balance things - particularly (1) and (2) - and accept that we > may get multiple refils in order to fill the socket transmission buffer. Hard to comment without concrete workload at hand. Ideally the interface would be good enough for the application to dependably drive the transmission in an efficient way.