From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35351426D00 for ; Tue, 16 Jun 2026 10:28:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.131 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781605699; cv=none; b=ItTetc4TadnCBTNir+64ySs0YX+9uTFiCowRTeVM4DNbmqo8bfbcaawnsfvcYoPFIAW0xl5RMPlr70w+DUNJP3CGT6oc24UHaXpFJHUlY5HekRMO9OGMYwX249OSdiCCatLPQMLkpwKDLoJnSH+A/m89QSKZ9Vdk6rMMFidTs9A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781605699; c=relaxed/simple; bh=8DWU9VWzxC4XCBak6R0aM7lw2i8FXBebRDSkmvMuPsA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=NPN9sThb0OAow5/QrB8v4s5jqZnDchslCN000DOxM7q2Nyc/SxSB/h2tc7cZ2mBy/13SnxpaCGZTRWQFx5st3V8rX0rtOtEdXjFq11xdLOcUaBo1sDbeASrlwPaVqL/G45wwOMSWFUiNv0AsMlZXDLC/yO48xgBy8u0DDPgZdTA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de; spf=pass smtp.mailfrom=suse.de; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b=c6ZIWxrG; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b=htIhsYP4; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b=c6ZIWxrG; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b=htIhsYP4; arc=none smtp.client-ip=195.135.223.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="c6ZIWxrG"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="htIhsYP4"; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="c6ZIWxrG"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="htIhsYP4" Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 9467F75F6B; Tue, 16 Jun 2026 10:28:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1781605696; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lL22iMRFlTp29xQKd3nBq3piPO8S7G1smf2jDj2dB1g=; b=c6ZIWxrGrZXJ+1bGhWCCqluOIc3GDBkYOcQQcPLWaWIXo+WZX7gtO23kujbMHi28KxgYoo bPHGu0qFUq/oc72HRAkr9cPAQ2qGGbGr+ZvMGGBotXp5ahrxDIwJzK2Jy69dLH0UY5BllZ eY5TzAXb7xCKpcvqbLpqnPhvP4X+bWE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1781605696; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lL22iMRFlTp29xQKd3nBq3piPO8S7G1smf2jDj2dB1g=; b=htIhsYP4BWF1M1KA8hAn8VBmBPFgJbTzQsBwLMWLd9JGkId/dCEZKc3Ikxit0zG6p2Z2aG o/o+BdVRYi22etAQ== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1781605696; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lL22iMRFlTp29xQKd3nBq3piPO8S7G1smf2jDj2dB1g=; b=c6ZIWxrGrZXJ+1bGhWCCqluOIc3GDBkYOcQQcPLWaWIXo+WZX7gtO23kujbMHi28KxgYoo bPHGu0qFUq/oc72HRAkr9cPAQ2qGGbGr+ZvMGGBotXp5ahrxDIwJzK2Jy69dLH0UY5BllZ eY5TzAXb7xCKpcvqbLpqnPhvP4X+bWE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1781605696; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lL22iMRFlTp29xQKd3nBq3piPO8S7G1smf2jDj2dB1g=; b=htIhsYP4BWF1M1KA8hAn8VBmBPFgJbTzQsBwLMWLd9JGkId/dCEZKc3Ikxit0zG6p2Z2aG o/o+BdVRYi22etAQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 461E7779A8; Tue, 16 Jun 2026 10:28:15 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id jwzTDT8lMWr9dAAAD6G6ig (envelope-from ); Tue, 16 Jun 2026 10:28:15 +0000 Date: Tue, 16 Jun 2026 11:28:13 +0100 From: Pedro Falcato To: Luigi Rizzo Cc: rizzo.unipi@gmail.com, m.szyprowski@samsung.com, robin.murphy@arm.com, willemb@google.com, kuniyu@google.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, gregkh@linuxfoundation.org, rafael@kernel.org, akpm@linux-foundation.org, david@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, iommu@lists.linux.dev, driver-core@lists.linux.dev, linux-kernel@vger.kernel.org, Jesper Dangaard Brouer , Ilias Apalodimas Subject: Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket Message-ID: References: <20260615234220.3946885-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Flag: NO X-Spamd-Result: default: False [-2.80 / 50.00]; BAYES_HAM(-3.00)[100.00%]; SUSPICIOUS_RECIPS(1.50)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.20)[-0.999]; MIME_GOOD(-0.10)[text/plain]; FUZZY_RATELIMITED(0.00)[rspamd.com]; RCVD_VIA_SMTP_AUTH(0.00)[]; MIME_TRACE(0.00)[0:+]; RCPT_COUNT_TWELVE(0.00)[21]; MISSING_XM_UA(0.00)[]; ARC_NA(0.00)[]; TAGGED_RCPT(0.00)[]; FREEMAIL_ENVRCPT(0.00)[gmail.com]; TO_DN_SOME(0.00)[]; FROM_HAS_DN(0.00)[]; FREEMAIL_CC(0.00)[gmail.com,samsung.com,arm.com,google.com,davemloft.net,kernel.org,redhat.com,linuxfoundation.org,linux-foundation.org,vger.kernel.org,kvack.org,lists.linux.dev,linaro.org]; RCVD_TLS_ALL(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; DBL_BLOCKED_OPENRESOLVER(0.00)[pedro-suse.lan:mid,imap1.dmz-prg2.suse.org:helo,suse.de:email] X-Spam-Level: X-Spam-Score: -2.80 On Tue, Jun 16, 2026 at 11:48:36AM +0200, Luigi Rizzo wrote: > On Tue, Jun 16, 2026 at 11:20 AM Pedro Falcato wrote: > > > > (+cc page pool maintainers) > > On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote: > > > The use of swiotlb causes an extra data copy on I/O. For tx sockets, > > > especially with greedy senders, this has a high chance of happening in > > > the softirq handler for tx network interrupts, creating a significant > > > performance bottleneck. > > > > > > Allow tx sockets to allocate socket buffers directly from the bounce > > > buffers. This avoids the second copy and removes the above bottleneck. > > > The fraction of swiotlb buffers allowed for this feature is set with > > > /sys/module/swiotlb/parameters/zerocopy_tx_percent > > > (0 means disabled, 90 is the maximum, to avoid persistent I/O failures). > > > > > > Implementation: > > > - define a new page type to unambiguously identify bounce buffers used > > > as backing storage for socket buffers > > > - modify skb_page_frag_refill to perform the modified allocation > > > - modify the destructors __free_frozen_pages(), free_unref_folio() to > > > handle those pages and return them to the pool. > > > > > > The savings are especially visible with fewer queues. In synthetic > > > benchmarks, senders with 1-2 queues would cap around 50Gbps with > > > conventional swiotlb, and reach over 170Gbps with the feature enabled. > > > > I could be wrong, but I genuinely think that the way to go about this is > > using page_pool for regular TX as well. page_pool pages are all dma-mapped > > (so whatever swiotlb optimization you want can be done there), and the net > > stack already has awareness of these special pages and special skbs, so it > > won't Just Return Them back to the page allocator. > > I am not sure I follow your comment above, can you expand/clarify? > > The problem I am dealing with is that the copy from the socket buffer > to the bounce buffer is done in the device xmit function. Under high > it is almost always done by the tx softirq. > This means that even if we move the copy outside the HARD_TX_LOCK(), > it would still be almost completely serialized. > Hence the proposed method to make skb_page_frag_refill() allocate > directly a bounce buffer (under specific conditions) so there is a single copy > done directly to the dma-able buffer, and ii is done in the user threads/CPUs > and is not seriallized in the softirq thread. > > I am not sure how page_pool on tx could help here. Page pool would provide both the means of passing around an iommu-mapped page, and a concrete "this is where we allocate these pages" spot. Then introducing a "zero-copy" swiotlb allocation would be a simple matter of introducing this on page pool's side. In pseudo-code, something like: static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) { struct page *page; gfp |= __GFP_COMP; if (pool->dma_map && /* is_swiotlb */) { page = swiotlb_alloc_pages(pool->p.nid, gfp, pool->p.order, ...); if (!page) return NULL; /* page is implicitly swiotlb mapped (well, _actually_ it's * not that simple, because of the dma_mapped tracking that * was introduced, but PoC anyway..). */ } else { page = alloc_pages_node(pool->p.nid, gfp, pool->p.order); if (unlikely(!page)) return NULL; if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) { put_page(page); return NULL; } } } (plus other spots, obviously). No copying should be required, and the netmem desc will keep the dma_addr around. The network stack will notice pp_recycle on all of these skbs and simply refuse to throw the pages away to the page allocator. In any case, it might be that this is not feasible for XYZ reasons, but I've thought about this (making net use and reuse page pool pre-iommu-mapped pages exclusively) for a while and I definitely see a lot of similarities with your problem (that more or less reduces down to "I want to get an iommu-mapped page from the get-go"). -- Pedro