From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD020C3ABAA for ; Mon, 5 May 2025 20:14:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 288C96B0099; Mon, 5 May 2025 16:14:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2380B6B009A; Mon, 5 May 2025 16:14:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 100746B009B; Mon, 5 May 2025 16:14:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E83846B0099 for ; Mon, 5 May 2025 16:14:50 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 899F21A08C1 for ; Mon, 5 May 2025 20:14:51 +0000 (UTC) X-FDA: 83409957582.15.187B025 Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf01.hostedemail.com (Postfix) with ESMTP id 0018F4000F for ; Mon, 5 May 2025 20:14:49 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=e4omUzoL; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf01.hostedemail.com: domain of kuba@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=kuba@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746476090; a=rsa-sha256; cv=none; b=43ZvRxmLXhNH7DwsxgAtGNhjM5Wp/9svq+s5Sl5/o+h8672OJVTX4xRq9UZJ8OYvJiP7Yi PjM7REMqubfP3v4VATvzMnAl+/qsedHiz/+dQ/P60nSQ6zDNLR7AWvR28yUu0PemO9mL3P lCxDXDKqau1sXoUsIY3xKenGwrvVFBM= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=e4omUzoL; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf01.hostedemail.com: domain of kuba@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=kuba@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746476090; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vUECcFUkqlP73StcsKFadQnQNMwXdHNSJRaq8rIRV+E=; b=XRY+zu/eGXvN39qc3J0pwjXhHBuTVYRqpjuX43YWzdrfJFMA0IyWpGHrr3+xphTxzOJQnt dWL2ZZNlfr+fzupqBweS1Is+yzuZySJpBFMh+vt2s/gU/1k31+mogZWw92HZcny5U9VLsN Y2ZEP+GcLLcfJVZ8RaDVc3vOuR7UL58= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 4B9DA61F1D; Mon, 5 May 2025 20:14:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0E8F4C4CEE4; Mon, 5 May 2025 20:14:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1746476088; bh=/hG7G6MUd0jlZa6G16QYvMWjPpUh8hoYkDnJ0sfdLIA=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=e4omUzoLZZRbglU9ceiv/HMXWD+6iOS3qpPJDl+bOBbaftyXQNbMrlFZKU6SJigu8 PkvHlc0LB51f4jwe689iGj79bPBQdpD/nRyKCZgyF0NYtZPShgoTqWOe/7w4q8S7PW f0totyxBZlBUC4788rTnLR9mQMgvCa9kOhpzizV7IzQDBGtzPAvmiwAWbulvI2Gde4 Y2ocBHFc5s8XM1GhZXI3m1Aj8RjAJ8eHwew4lyDWAtPYgHwe6UPkyUoARovY3JVbLT YACxX1jTSfORNzXpcv7BgL2IzJx93WeqCLTx4YonIv957hRm0JmOyp3VAviYLKM59S Dwlf8AvnEMyhQ== Date: Mon, 5 May 2025 13:14:46 -0700 From: Jakub Kicinski To: David Howells Cc: Andrew Lunn , Eric Dumazet , "David S. Miller" , David Hildenbrand , John Hubbard , Christoph Hellwig , willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org, Willem de Bruijn Subject: Re: Reorganising how the networking layer handles memory Message-ID: <20250505131446.7448e9bf@kernel.org> In-Reply-To: <1069540.1746202908@warthog.procyon.org.uk> References: <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch> <1015189.1746187621@warthog.procyon.org.uk> <1021352.1746193306@warthog.procyon.org.uk> <1069540.1746202908@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 0018F4000F X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: fnh63q9mch1m3arktjn9oc79rxor7f9p X-HE-Tag: 1746476089-39206 X-HE-Meta: U2FsdGVkX19G/f8E50B8DmE3J405lmZmZuTkq9XB5ifMSciE8qgwecX+vmA7s1ArUFsp63kPYpXn7NnO5EqoyHh0l0mh6qB/zr47ew4jSeN8MSPOyYMClxpCGD0B8BIc1hn2e0v/XpzPzXbDaeIkj93RFhRKbtsZJ03pssVBs4DSqdaO5fVeJt4HFc9IPB8n8Nn9hiGt1B6rs+BanwPgkc9pwb9kWBlFi27lKWKd0H1VZg7wVjC2Sk3Oy2Y1woMdtziJCVogZCaXFBQPxqSnqsPr9qEPElwDVFQvmC3bvVXrfl8Nm/bq3xZ7rBQ3xjasQLDQ56Xr1cb5Hgf+Yt7JVHF6DEVDYW/z1WEhwXxXKGo7Cw5QdtPe/Sj8pZWIunZm5hRS4sSnF0NkQqYiiSPBxrmuu9N83zyT3ZHMnfYDmpR82SfYCM/9VDiBWSMH1LW0KMe9bZJzDYegLa00/w5KChh7C1TgKnXz79fn8hbYHpRjZG0uCfV+vFNLqX0ETOENViLtpQabfPcYJAeGQtvdgV40LkWerWcDLMvzwhu6gQg5mgh2p+JLLqanfub24DyNNBRqcTqbqGQ/xpQJMCcxPVB+fMYEWrskAeNUcffycxSGfm7myz6BABBeGpW+WrMLO7U4Z95RVmVf+6P8iyKP+j1gZ4XA+uG/0GN/3pmUlOFXhV2bqejtnG9kuelMXsMk5s8QF5YV29Wlk74KjZWjC1a5zWt8UqgqBBEbznUw4xfrmIRrXTSyzfzhvpCT53l0oQhJy0DaJIiMUiNKWZnlETqJAJRdxZsknXbYgkMBA/0CXoi/zqbHmn6WAzeChOnY1k16jNHO76OGwG1cxWQfOg4Y4skiVmdSwL7d/+ZxhICkl9I3lIaUULNlKcAyPyCd4V9MEv/xYI1z/D5UlbVC2TRyIbxr+4Je6jkXGZq+XPgHn4lYtzIkfCQZUGT0YyPjeQCmRHh4dliVgD14oIv 4gAodkH5 T2qAuqn/ZDGKBKxQ/2xKbfalcg+V5OVwh6lBK53j9Oe+NqBm1SMGTwQeS/4qkG10nbYwj/ERef1l1+ys/vwr1UswQbGRT24OivDlFIRIv8Jm1ImlZiI/oql3KbFxRIJo+3Jpjz4JBwr8pRemvnpgWaImKIHg/6AWvyQS2yyOlSeXRRDmUAQEP91WRli93EgpTTc89K0R61Q6rCeTHcJq7ye5BDKWeX14L55JqINYrQCaSLOrBu7B1LsCCX+Yvq4ml/HgQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 02 May 2025 17:21:48 +0100 David Howells wrote: > Okay, perhaps I should start at the beginning :-). Thanks :) Looks like Eric is CCed, also adding Willem. The direction of using ubuf_info makes sense to me. Random comments below on the little I know. > There a number of things that are going to mandate an overhaul of how the > networking layer handles memory: > > (1) The sk_buff code assumes it can take refs on pages it is given, but the > page ref counter is going to go away in the relatively near term. > > Indeed, you're already not allowed to take a ref on, say, slab memory, > because the page ref doesn't control the lifetime of the object. > > Even pages are going to kind of go away. Willy haz planz... I think the part NVMe folks run into is the BPF integration layer called skmsg. It's supposed to be a BPF-based "data router", at the socket layer, before any protocol processing, so it tries to do its own page ref accounting.. > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it > doesn't use page pinning. It needs to use the GUP routines. We end up calling iov_iter_get_pages2(). Is it not setting FOLL_PIN is a conscious choice, or nobody cared until now? > (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be > used with certain memory types (e.g. slab). It takes a ref on whatever > it is given - which is wrong if it should pin this instead. s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants a ref to the stack, right? But yes, the networking stack will try to release it. > (4) iov_iter extraction will probably change to dispensing {physaddr,len} > tuples rather than {page,off,len} tuples. The socket layer won't then > see pages at all. > > (5) Memory segments splice()'d into a socket may have who-knows-what weird > lifetime requirements. > > So after discussions at LSF/MM, what I'm proposing is this: > > (1) If we want to use zerocopy we (the kernel) have to pass a cleanup > function to sendmsg() along with the data. If you don't care about > zerocopy, it will copy the data. TAL at struct ubuf_info > (2) For each message sent with sendmsg, the cleanup function is called > progressively as parts of the data it included are completed. I would do > it progressively so that big messages can be handled. > > (3) We also pass an optional 'refill' function to sendmsg. As data is sent, > the code that extracts the data will call this to pin more user bufs (we > don't necessarily want to pin everything up front). The refill function > is permitted to sleep to allow the amount of pinned memory to subside. Why not feed the data as you get the notifications for completion? > (4) We move a lot the zerocopy wrangling code out of the basement of the > networking code and put it at the system call level, above the call to > ->sendmsg() and the basement code then calls the appropriate functions to > extract, refill and clean up. It may be usable in other contexts too - > DIO to regular files, for example. > > (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by > the cleanup function. Already the case? :) > (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it > refers to. This is done for it by the zerocopy layer.