From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C185C5AD49 for ; Fri, 30 May 2025 15:50:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEBFF6B014F; Fri, 30 May 2025 11:50:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AC3B96B0150; Fri, 30 May 2025 11:50:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A00706B0151; Fri, 30 May 2025 11:50:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8187E6B014F for ; Fri, 30 May 2025 11:50:15 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 0FB0B120DE6 for ; Fri, 30 May 2025 15:50:15 +0000 (UTC) X-FDA: 83500010790.23.2A01E13 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf22.hostedemail.com (Postfix) with ESMTP id 2477CC0003 for ; Fri, 30 May 2025 15:50:12 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lYLkGRqh; spf=pass (imf22.hostedemail.com: domain of stfomichev@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=stfomichev@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748620213; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WC9Rvi27HJY83cUSYYfRK4cDUnKxrvB3LaS3LNFAIls=; b=glzvh2v0sh02vsqHCGFyFS1zD9lzaPfksW8MJTwhxIr2S4qg33DMRfOfRvgujW5QUK9P6j 3MoZV+3Gd8ueyTYgnjLykYf9NLyWznRFbbytqA0hpsiZItiKM8r53jPRgUlJJTpea1pnEi KQ/ZmujtgfjgiOAe5fJWRNz6gVdPO2M= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lYLkGRqh; spf=pass (imf22.hostedemail.com: domain of stfomichev@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=stfomichev@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748620213; a=rsa-sha256; cv=none; b=EXMAdG0fqKdFCUbK4Sncc62Nw/6fH/ICzEbmB4Zcyw4RLwwGBQ/X2ckQ2PGBgXRugzqFGR I+TK6KUQD2NL6RgM5skqjhU+VcTQmzW0oxvsQAmp0J1HhucCCbOUML0GyRHZZ6eULlPit2 2DwLsAmuo0ptVVolEng/NMkm2LvjD5g= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-2349f096605so26638835ad.3 for ; Fri, 30 May 2025 08:50:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748620212; x=1749225012; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=WC9Rvi27HJY83cUSYYfRK4cDUnKxrvB3LaS3LNFAIls=; b=lYLkGRqhPYSCoC8la9xqMH/WsMY2XzaWpxCvkpq/bfjonSHEsTRRyI3kl4+9BavDlN t2sKhR/eE+p/enEzurD0YokPB219M/vYW4+A/dWhrEgqhhutydv+bJ0Tx2eqsPqANB1X 3mnJ922hoZAkikBlt+QazGSXT9pHqXJ4vZokXX236iMl3tIR/9rdkr3jhMihNRAsvrPS vmj8hs5Ha4a4dKTNkrKMbIUff1ULE4t9iDuhUWM/+BUWFsbsPEi7yKO6lP3AmZQFqjGb ivpCVc0rn0XySp3UtZiCEComO91bLQKbhewPR3ARgWMyp46LZ1cBAaT4zxMe7oDyyfeG RqEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748620212; x=1749225012; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=WC9Rvi27HJY83cUSYYfRK4cDUnKxrvB3LaS3LNFAIls=; b=MJSWGAWFkGXfiWk/qtnIiEq+fBk7pMiV2/OeLuBRaTM33MF0GAGnDJQ9bwMlT16u81 FWth9cRVDKzVLbbk655C+NLIvIyVLwxj1G53yx+fnuEQBgXGdAuc272om0ZHgiERS/CE ibVB05kvY1zyNMNAXPyKeq25FKq5Bnr56rBjjlKjOmRSKpQutPznRs0R8+Fn6RyQdjxy ZFBPQ3Kle/42gh2HuH1u++MDWYx3P6YhWyqWExIoISXK/Pi9fSNZlSsBPvww4HaYKN1g 5Fm3gFeHg+CHxAO0qB6E8wLJoCCstCmsCeEquys8mCI0aEaA069V0DyAHRHKcEZoxjh2 BXJg== X-Forwarded-Encrypted: i=1; AJvYcCWJM5HJSBFHqu1H56u8n7L8lb0pfs/3s3nI4QraknGqr3luRTxgrwcAR64PV0+vV19vqHQTqx9UaA==@kvack.org X-Gm-Message-State: AOJu0YzvXinZybF/k0x4ZA2+GOWSeXxRA3D4QcG7O6ypV6pSuIyY90s1 IsdG8FCY9+NDYtVemZUE1vjDxv9EMOeUe8HeuaHHw+XIJrHGachqICk= X-Gm-Gg: ASbGncv8YORVsLS3w7/vSiazEapUbqa67QufNBzJ/k5aYqqIL6whbIZCQUIiLt7OsnB KRHx/WIK9uR+PM0SuJ7tfOBI2Eq11pKpCZvFWt1P16khhNVVN+d8C5HazvNBbB4+X8CFzP9fqzz Vie0dkjtgfKh+eWzbFyahwVy5AP/cThvoIQqyW2P0p/+zBYnDChWFMKYrQwA3XERJRKos8HJv3E fXXU12+62htsSq6LjY0+oz+gXxdDqU4DyIC9uN5CSHGI2IjbPK8GFNrkGWBDjwBFTQx+ONLzX8/ rCgQeorMEnWtVWOyS1lfekHfvlOsXkOtALZXQJZmkJZ5H8dJV4+s8vpzQSA5LLKuofkoFMUYNPb e/L6lxDKkA6l1 X-Google-Smtp-Source: AGHT+IFrbrx24i3weLMMBMlTpJuBxqZhfTy54w5ALO6Ziv0LCz3ADHeaOpQzDCySI+pz5XafZmwcJg== X-Received: by 2002:a17:903:4410:b0:235:129a:175f with SMTP id d9443c01a7336-23529a28fb8mr53071005ad.34.1748620211824; Fri, 30 May 2025 08:50:11 -0700 (PDT) Received: from localhost (c-73-158-218-242.hsd1.ca.comcast.net. [73.158.218.242]) by smtp.gmail.com with UTF8SMTPSA id d9443c01a7336-23506bc861dsm30122415ad.4.2025.05.30.08.50.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 May 2025 08:50:11 -0700 (PDT) Date: Fri, 30 May 2025 08:50:10 -0700 From: Stanislav Fomichev To: David Howells Cc: Mina Almasry , willy@infradead.org, hch@infradead.org, Jakub Kicinski , Eric Dumazet , netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: Device mem changes vs pinning/zerocopy changes Message-ID: References: <770012.1748618092@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <770012.1748618092@warthog.procyon.org.uk> X-Rspamd-Queue-Id: 2477CC0003 X-Stat-Signature: iz86k7hxjj6aujzgi8jjd7nfunbxginh X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748620212-787091 X-HE-Meta: U2FsdGVkX1+VtqtK/3irTcMnllA+Ysw/eAToreFvyXpyPOB0ZXYvjmksTUnRG74hcOU4QoYWKcTBE9M5/j4mwa3tw7NNTAIfyTuF4dw32u4eob9G9dTvNorWkETbosFlH633Nrs7NJ1PGLxrjPsTX5XyscYMcCA31ce3MgRHmLwoqM0kWhLH5jtrykzBej2CgSoDF3+G9me0aT7nfkb8ZR9IC3R29QsK9fF8B8BPeppp2p5nTsVLT8Q+GyMe0DfKtK9DIvE7RxJy3QYgDcxgwAN2JP8lD+bknNH0lMPLFjcdvm4HwzCElg1xUxPnNkLDTvlgL3LsF8n+ZuTGHhQ5JIgDxbUMDIIPjxKCP9GoPyPw87WABRRam9OhvHo7MyyQlCYPOibzvAN+ZOdMGxKsq7KbsaZRRgMg2vR7HCTQSkrSxabMRd/5P49S0Tm9u0BlZwmGn7n3p6F9FiAH68VWlDbHLM4/XgpjWyJo9RxyJbjEeJbW5mkXUPXtOGP62Wq4InL7SOTp/xojn2OgIroVq8qDetkd3W9PH35k4KCXzmeGy5j0+7NK9kWvUW78BKxIatjm4c8vUGAurFF3bdYqvdNVp9uz9XnYjKN+r/TX59MjxIcNxJMX7N2VaYUqVaNVOsAmSdUtG72RytZedOzOP8gimzU4s4sdsWhsqVvC2NogTEB64X1BSWHikU8e53KsFlolpqkIPO3nfqno5b52+E5VXZuun6k3wfTozRUhcHfVqVT8UokhtUyCC+m0VdCbH/XuP66zy5GyClWE9lqGHmJa2Yoj7rBmTurASo/dssZXrTsT5Y+hMq8zynHBR0swfsfu8ZowNDTU5Aq+lm6MoK0NEgamvpoureBreRKUpD9MytVGm0rh/hsKdvKV2RuZGPWYtMn0Ev1c+TDMEbf+dJ07p4ZN9QXfe0KIpQCQ+2erPzPogm3J2MHoMWd1SQw4uYZNJalQCQTLBBS6v0m f5noVXS1 BICIHnjOvA7n4+mwWz1LQgJo2LPUCFhYA6Zwe1sc7SyuXiRrF7RrS1HNZUfaR78DWHPYMo6x2HjctBlGT6JU756mipZ7U/D/AoWQ0LmNsc7DovFzTl1gz+T74d7B9wK1P1i3T/yLZN41L6wX0/Gog1/Zrj6AtoNp6IytgEMSAFKcwoJgI2BSHs6ZlAPdwIYMDTodkRy8cbp6D6oDVAdmpYv8N1MA+ZB10n2ccRpeC2qxLpTIv7OzKPyN496Xm1h3amHpGThFxRqoBgVChkxpNoF8wuDmEON1ebT1h8YxeLajDGythg1laeETVGq4PtZcHQmxPEouXMuys9iM5FbQu3sJUGvyXrQj++S2rfajWNvbb9l7Ef4YoeLYkt2MXK3jDiwuW5+JIMmcN6tftgL5rd/IfQg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/30, David Howells wrote: > Hi Mina, > > I've seen your transmission-side TCP devicemem stuff has just gone in and it > conflicts somewhat with what I'm trying to do. I think you're working on the > problem bottom up and I'm working on it top down, so if you're willing to > collaborate on it...? > > So, to summarise what we need to change (you may already know all of this): > > (*) The refcount in struct page is going to go away. The sk_buff fragment > wrangling code, however, occasionally decides to override the zerocopy > mode and grab refs on the pages pointed to by those fragments. sk_buffs > *really* want those page refs - and it does simplify memory handling. > But. > > Anyway, we need to stop taking refs where possible. A fragment may in > future point to a sequence of pages and we would only be getting a ref on > one of them. > > (*) Further, the page struct is intended to be slimmed down to a single typed > pointer if possible, so all the metadata in the net_iov struct will have > to be separately allocated. > > (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user > pages specified by the iterator but we need to stop doing that. We need > to call GUP to take a "pin" instead (and must not take any refs). The > pages we get access to may be folio-type, anon-type, some sort of device > type. > > (*) It would be good to do a batch lookup of user buffers to cut down on the > number of page table trawls we do - but, on the other hand, that might > generate more page faults upfront. > > (*) Splice and vmsplice. If only I could uninvent them... Anyway, they give > us buffers from a pipe - but the buffers come with destructors and should > not have refs taken on the pages we might think they have, but use the > destructor instead. > > (*) The intention is to change struct bio_vec to be just physical address and > length, with no page pointer. You'd then use, say, kmap_local_phys() or > kmap_local_bvec() to access the contents from the cpu. We could then > revert the fragment pointers to being bio_vecs. > > (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data > to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in > any case, the object lifetime is not managed by refcount. However, if we > had a destructor, this restriction could go away. > > > So what I'd like to do is: [..] > (1) Separate fragment lifetime management from sk_buff. No more wangling of > refcounts in the skbuff code. If you clone an skb, you stick an extra > ref on the lifetime management struct, not the page. For device memory TCP we already have this: net_devmem_dmabuf_binding is the owner of the frags. And when we reference skb frag we reference only this owner, not individual chunks: __skb_frag_ref -> get_netmem -> net_devmem_get_net_iov (ref on the binding). Will it be possible to generalize this to cover MSG_ZEROCOPY and splice cases? From what I can tell, this is somewhat equivalent of your net_txbuf.