From: "Björn Töpel" <bjorn@kernel.org>
To: Stanislav Fomichev <stfomichev@gmail.com>,
Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
magnus.karlsson@intel.com, kuba@kernel.org, pabeni@redhat.com,
horms@kernel.org, larysa.zaremba@intel.com,
aleksander.lobakin@intel.com
Subject: Re: [PATCH net 1/6] xsk: respect tailroom for ZC setups
Date: Tue, 17 Mar 2026 10:19:25 +0100 [thread overview]
Message-ID: <878qbqrexu.fsf@all.your.base.are.belong.to.us> (raw)
In-Reply-To: <abiJ2C_q62bn7EC3@mini-arch>
Stanislav Fomichev <stfomichev@gmail.com> writes:
> On 03/16, Maciej Fijalkowski wrote:
>> Multi-buffer XDP stores information about frags in skb_shared_info that
>> sits at the tailroom of a packet. The storage space is reserved via
>> xdp_data_hard_end():
>>
>> ((xdp)->data_hard_start + (xdp)->frame_sz - \
>> SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>>
>> and then we refer to it via macro below:
>>
>> static inline struct skb_shared_info *
>> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
>> {
>> return (struct skb_shared_info *)xdp_data_hard_end(xdp);
>> }
>>
>> Currently we do not respect this tailroom space in multi-buffer AF_XDP
>> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
>> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
>> configure length of HW Rx buffer.
>>
>> xsk_pool_get_tailroom() is only reserving necessary space when pool is
>> zc and underlying netdev supports zc multi-buffer. Since this function
>> relies on pool->umem->zc setting, set it before ndo_bpf during zc
>> configuration, so that driver that actually calls
>> xsk_pool_get_rx_frame_size() inside ndo_bpf will get correct tailroom
>> value.
>>
>> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
>> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> ---
>> include/net/xdp_sock_drv.h | 21 ++++++++++++++++++++-
>> net/xdp/xsk_buff_pool.c | 3 ++-
>> 2 files changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
>> index 6b9ebae2dc95..13b2aae00737 100644
>> --- a/include/net/xdp_sock_drv.h
>> +++ b/include/net/xdp_sock_drv.h
>> @@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>> return XDP_PACKET_HEADROOM + pool->headroom;
>> }
>>
>> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
>> +{
>> + struct xdp_umem *umem = pool->umem;
>> +
>> + /* Reserve tailroom only for zero-copy pools that opted into
>> + * multi-buffer. The reserved area is used for skb_shared_info,
>> + * matching the XDP core's xdp_data_hard_end() layout.
>> + */
>> + if (umem->zc && (umem->flags & XDP_UMEM_SG_FLAG))
>> + return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> + return 0;
>> +}
>> +
>> static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>> {
>> return pool->chunk_size;
>> @@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>>
>> static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
>> {
>> - return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
>> + return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
>> + xsk_pool_get_tailroom(pool);
>> }
>>
>> static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
>> @@ -332,6 +346,11 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>> return 0;
>> }
>
> [..]
>
>> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
>> +{
>> + return 0;
>> +}
>
> Not sure it's needed? xsk_pool_get_tailroom is only used by
> CONFIG_XDP_SOCKETS' version of xsk_pool_get_rx_frame_size.
>
>> static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>> {
>> return 0;
>> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
>> index 37b7a68b89b3..2cfc19e363e3 100644
>> --- a/net/xdp/xsk_buff_pool.c
>> +++ b/net/xdp/xsk_buff_pool.c
>> @@ -213,6 +213,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>> bpf.command = XDP_SETUP_XSK_POOL;
>> bpf.xsk.pool = pool;
>> bpf.xsk.queue_id = queue_id;
>> + pool->umem->zc = true;
>>
>> netdev_ops_assert_locked(netdev);
>> err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
>> @@ -224,13 +225,13 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>> err = -EINVAL;
>> goto err_unreg_xsk;
>> }
>> - pool->umem->zc = true;
>> pool->xdp_zc_max_segs = netdev->xdp_zc_max_segs;
>> return 0;
>>
>> err_unreg_xsk:
>> xp_disable_drv_zc(pool);
>> err_unreg_pool:
>> + pool->umem->zc = false;
>> if (!force_zc)
>> err = 0; /* fallback to copy mode */
>> if (err) {
>
> I'm not super familiar with the shared umem patch, but is it safe to
> unconditionally undo pool->umem->zc = false here? xp_assign_dev_shared
> looks at this umem->zc flag.. Presumably other places do as well on
> teardown?
Good catch!
I can elaborate a bit; the zero-copy property of umem is shared between
all users (sockets) of that umem. IOW, all sockets sharing an umem,
inherits whatever the first socket negotiated.
So, we could get into something like:
1. Socket A binds queue 0, ndo_bpf OK (umem->zc = true)
2. Socket B binds queue 1 via xp_assign_dev_shared()
reads umem->zc == true, so flags = XDP_ZEROCOPY
xp_assign_dev() sets umem->zc = true
ndo_bpf() NOK for queue 1 -> error path: umem->zc = false (oops)
3. Socket A is still active on queue 0 in ZC mode, but umem->zc is now
false
...and we'll have a bunch of checks on umem->zc that now has incorrect
state.
From this follows that the zc flag shouldn't be toggled on a shared
resource without checking if other consumers exist. I think a per-pool
zc flag is needed here or smth. :-(
Björn
next prev parent reply other threads:[~2026-03-17 9:19 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
2026-03-16 22:53 ` Stanislav Fomichev
2026-03-17 9:19 ` Björn Töpel [this message]
2026-03-17 11:08 ` Maciej Fijalkowski
2026-03-16 17:45 ` [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool Maciej Fijalkowski
2026-03-17 9:21 ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 3/6] i40e: " Maciej Fijalkowski
2026-03-17 9:21 ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
2026-03-17 9:30 ` Björn Töpel
2026-03-18 16:46 ` Alexander Lobakin
2026-03-16 17:45 ` [PATCH net 5/6] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
2026-03-17 9:27 ` Björn Töpel
2026-03-17 10:57 ` Maciej Fijalkowski
2026-03-17 12:13 ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 6/6] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878qbqrexu.fsf@all.your.base.are.belong.to.us \
--to=bjorn@kernel.org \
--cc=aleksander.lobakin@intel.com \
--cc=bpf@vger.kernel.org \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=larysa.zaremba@intel.com \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox