From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EE2C63043C0 for ; Thu, 23 Oct 2025 11:55:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.53 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761220556; cv=none; b=LAn8TscVrphLSpc333wdVkRCQBEXu1G6J1G/AJAPdIPtYrRA3q1H+CXmqeP+Jmm+Y0iD27inWaZ8QeR8hfREVSCiRChzilunYDcIgCbxdfjEBUQmJE5eEzR2bm1Yym0/o0VhO8PXRdv/ye+QsndjpUUJLbmycRX2mEGLL+OJPHk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761220556; c=relaxed/simple; bh=azviEF2Y9DsjLGlZO8zdXXKHk7bppBImzg31itZmqpY=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=epjjw9GKV7eiKOBTzyyJsc04ACe/qjup65VgbgAqncaqqjtLzUBFBHKTVM4mjvcUYa8ObWUUbv1u+9QGvEcToPyPkyyzcpqsu51IugS0xGbnx70EoffunFE48vMGsZ8gxe/iM/vFReis6Iwf64EvqfBUQnnAG4ioPYGL0EPefsI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com; spf=pass smtp.mailfrom=cloudflare.com; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b=Uc55ETVX; arc=none smtp.client-ip=209.85.218.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="Uc55ETVX" Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-b3e9d633b78so80406166b.1 for ; Thu, 23 Oct 2025 04:55:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google09082023; t=1761220553; x=1761825353; darn=vger.kernel.org; h=mime-version:message-id:date:references:in-reply-to:subject:cc:to :from:from:to:cc:subject:date:message-id:reply-to; bh=SfSzc8kJaozUiCooo0tA6ZYtUG0roletvLtB9N7dv/Q=; b=Uc55ETVXKsRvMNmqiEV7evJnAWjyLlua4LNaEGxMQ0Z+ZPXjiP+xwrjeiBov3Pp8XF C2V1X+QLMGKqk+Zdqb+JD/2poTt20J24rpchxTkCfPuXwe8TUDaHiRqV6A5ciTaja6Wl rgLNa4vl80a0BI6FdWtuJiVMF+Qm4c48fMSG40TyiiW4v13WedpD4BR75XdeEEu2YP/z b5bS5hilKBblTyiAnrejPIzvFgxbSI+UvSBPLhDP3TLtEErfonhoQw3K8rNk5vv8yc4v 5j2k2CmA6aiAeaoblISN5/c/zPz5sYmhqic5sJZqyX80RB72SrE76g9O4vEuququSIrN n+oA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761220553; x=1761825353; h=mime-version:message-id:date:references:in-reply-to:subject:cc:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SfSzc8kJaozUiCooo0tA6ZYtUG0roletvLtB9N7dv/Q=; b=hJvX6sL2VeTenhRgOHwuuXqj1L27cnM2BGiFIkLprZbhqVzbj87EXjkbcy9vRoymZI C6Xj4thwhxKYHuJ9o7PHnz9PVKOhr6FIoIux78NMU1VKdsIu6iIC9foU2tMRGflBybuO xtipYvJG7u/PoYwoFDEsP1OBqblxZHtkM7uJ2GgvBaPWBEkBOSqdOeGy1ILeX5+sfgb1 HpXlfNyeYMhNNRMzf1oWMCmYYkuHlEtWSIvmQK6EXlPoNTiL77d1IbqWwJC6WtTZFy8f AgrN7BIWivcYaYwiaZ7vGVVm5/16Z+44R9xxTGpLtJWXMiGvayytAnIoV5AuJ6FGc+JM C0XA== X-Forwarded-Encrypted: i=1; AJvYcCXMlYVzwuC6RNsilShiAxqDLKnSpUQEUQcniOJDyCgj8B/CjexcAu6XVB4zrdvVoEwF/xabH+0=@vger.kernel.org X-Gm-Message-State: AOJu0YxHob5pVBqhaDO8M9bcCbrDZTls243fH9Mb1egjDYLEIM8KRT+Q LbofqwQ3MYZaZcngWGzwErIHwKT47GOs3sZ6l++m2DRMwOBKB+HMtTBO0tdVXTF0wr0= X-Gm-Gg: ASbGnctDsMQujyj5HJimDKDAyuawdQqkfbG66+7ScOTuNujx5XZDhdoaaKMnGBgv/7W XTV2Kc4FB5TXmYpNiCCBuK6UEOW7Vn0MI0DWl210X7rxS5uS8dHTmoo9RJs/VvvqWYie+rYzzmq ctk1pxp1lDbSFBEKc95UNy7QTSJ984oO2W0lxRV3MU0ANX53LL04pMP37S3EKQ16uTK9jnEkLIm oqpJBTMmieW98q0cs/p3P69xZpLdqmckg7WA3uPnnP8LFY9w6n84hK4H9JiwGZeMM+szVX9uFIQ 4bC30Jkjg/Yod+Nplhbcsitz2EREONHSQBvxQauJriRtQ8R9jPdaz6oh3nECYmypOCd/ndMGLN4 pC50ELnxdoJnlwZscgNKaElrNSBCP9sOtmXjg0tdaOO6LQ+1oX87QEPKhcV6eOCZVskFp X-Google-Smtp-Source: AGHT+IEI3LwYTd9zFcd/viEBkli7RyXBE4AUj66P9/ZXp+kyUrGJGoB9rBxYKTCx1VJkF4vIwOM3HA== X-Received: by 2002:a17:907:7f19:b0:afa:1d2c:bbd1 with SMTP id a640c23a62f3a-b6d2c79fd9emr844385366b.30.1761220553164; Thu, 23 Oct 2025 04:55:53 -0700 (PDT) Received: from cloudflare.com ([2a09:bac5:5063:2387::38a:7f]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b6d511f7027sm223524166b.25.2025.10.23.04.55.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Oct 2025 04:55:52 -0700 (PDT) From: Jakub Sitnicki To: Martin KaFai Lau Cc: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Daniel Borkmann , John Fastabend , Stanislav Fomichev , Alexei Starovoitov , Andrii Nakryiko , Eduard Zingerman , Song Liu , Yonghong Song , KP Singh , Hao Luo , Jiri Olsa , Arthur Fabre , bpf@vger.kernel.org, netdev@vger.kernel.org, kernel-team@cloudflare.com Subject: Re: [PATCH bpf-next v2 11/15] selftests/bpf: Expect unclone to preserve skb metadata In-Reply-To: <2753c96b-48f9-480e-923c-60d2c20ebb03@linux.dev> (Martin KaFai Lau's message of "Wed, 22 Oct 2025 16:12:39 -0700") References: <20251019-skb-meta-rx-path-v2-0-f9a58f3eb6d6@cloudflare.com> <20251019-skb-meta-rx-path-v2-11-f9a58f3eb6d6@cloudflare.com> <2753c96b-48f9-480e-923c-60d2c20ebb03@linux.dev> Date: Thu, 23 Oct 2025 13:55:51 +0200 Message-ID: <87ms5hvnlk.fsf@cloudflare.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Wed, Oct 22, 2025 at 04:12 PM -07, Martin KaFai Lau wrote: > On 10/19/25 5:45 AM, Jakub Sitnicki wrote: >> @@ -447,12 +448,14 @@ int clone_dynptr_empty_on_meta_slice_write(struct __sk_buff *ctx) >> /* >> * Check that skb_meta dynptr is read-only before prog writes to packet payload >> - * using dynptr_write helper. Applies only to cloned skbs. >> + * using dynptr_write helper, and becomes read-write afterwards. Applies only to >> + * cloned skbs. >> */ >> SEC("tc") >> -int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx) >> +int clone_dynptr_rdonly_before_data_dynptr_write_then_rw(struct __sk_buff *ctx) >> { >> struct bpf_dynptr data, meta; >> + __u8 meta_have[META_SIZE]; >> const struct ethhdr *eth; >> bpf_dynptr_from_skb(ctx, 0, &data); >> @@ -465,15 +468,23 @@ int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx) >> /* Expect read-only metadata before unclone */ >> bpf_dynptr_from_skb_meta(ctx, 0, &meta); >> - if (!bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != META_SIZE) >> + if (!bpf_dynptr_is_rdonly(&meta)) > > Can the bpf_dynptr_set_rdonly() be lifted from the bpf_dynptr_from_skb_meta()? > > iiuc, the remaining thing left should be handling a cloned skb in > __bpf_dynptr_write()? The __bpf_skb_store_bytes() is using > bpf_try_make_writable, so maybe something similar can be done for the > BPF_DYNPTR_TYPE_SKB_META? I'm with you. This is not user-friendly at all currently. This patch set has already gotten quite long so how about I split out the pskb_expand_head patch (#1) and the related selftest change (patch #11) from this series, expand it to lift bpf_dynptr_set_rdonly() limitation for skb_meta dynptr, and do that first in a dedicated series? > >> + goto out; >> + >> + bpf_dynptr_read(meta_have, META_SIZE, &meta, 0, 0); >> + if (!check_metadata(meta_have)) >> goto out; >> /* Helper write to payload will unclone the packet */ >> bpf_dynptr_write(&data, offsetof(struct ethhdr, h_proto), "x", 1, 0); >> - /* Expect no metadata after unclone */ >> + /* Expect r/w metadata after unclone */ >> bpf_dynptr_from_skb_meta(ctx, 0, &meta); >> - if (bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != 0) >> + if (bpf_dynptr_is_rdonly(&meta)) > > then it does not have to rely on the bpf_dynptr_write(&data, ...) above to make > the metadata writable. > > I have a high level question about the set. I assume the skb_data_move() in > patch 2 will be useful in the future to preserve the metadata across the > stack. Preserving the metadata across different tc progs (which this set does) > is nice to have but it is not the end goal. Can you shed some light on the plan > for building on top of this set? Right. Starting at the highest level, I want to work toward preserving the metadata on RX path first (ongoing), forward path next, and TX path last. On RX path, the end game is for sk_filter prog to be able to access metadata thru dynptr. For that we need to know where the metadata resides. I see two ways how we can tackle that: A) We keep relying on metadata being in front of skb_mac_header(). Fun fact - if you don't call any TC BPF helpers that touch skb->mac_header and don't have any tunnel or tagging devices on RX path, this works out of the box today. But we need to make sure that any call site that changes the MAC header offset, moves the metadata. I expect this approach will be a pain on TX path. ... or ... B) We track the metadata offset separately from MAC header offset This requires additional state, we need to store the metadata offset somewhere. However, in exchange for a couple bytes we gain some benefits: 1. We don't need to move the metadata after skb_pull. 2. We only need to move the metadata for skb_push if there's not enough space left, that is the gap between skb->data and where metadata ends is too small. (This means that anyone who is not using skb->data_meta on RX path but the skb_meta dynptr instead, can avoid any memmove's of the metadata itself.) 3. We can place the metadata at skb->head, which plays nicely with TX path, where we need the headroom for pushing headers. I've been trying out how (B) plays out when safe-proofing the tunnel & tagging devices, your VLANs and GREs, to preserve the metadata. To that end I've added a new u16 field in skb_shinfo to track meta_end. There a 4B hole there currently and we load the whole cacheline from skb_shinf to access meta_len anyway. Once I had that, I could modify the skb_data_move() to relocate the metadata only if necessary, which looks like so: static inline void skb_data_move(struct sk_buff *skb, const int len, const unsigned int n) { const u8 meta_len = skb_metadata_len(skb); u8 *meta, *meta_end; if (!len || (!n && !meta_len)) return; if (!meta_len) goto no_metadata; /* Not enough headroom left for metadata. Drop it. */ if (WARN_ON_ONCE(meta_len > skb_headroom(skb))) { skb_metadata_clear(skb); goto no_metadata; } meta_end = skb_metadata_end(skb); meta = meta_end - meta_len; /* Metadata in front of data before push/pull. Keep it that way. */ if (meta_end == skb->data - len) { memmove(meta + len, meta, meta_len + n); skb_shinfo(skb)->meta_end += len; return; } if (len < 0) { /* Data pushed. Move metadata to the top. */ memmove(skb->head, meta, meta_len); skb_shinfo(skb)->meta_end = meta_len; } no_metadata: memmove(skb->data, skb->data - len, n); } The goal is for RX path is to hit everwhere just the last memmove(), since we will be usually pulling from skb->data, if you're not using the skb->data_meta pseudo-pointer in your TC(X) BPF programs. There are some verifier changes needed to keep skb->data_meta working. We need to move the metadata back in front of the MAC header before a TC(X) prog that uses skb->data_meta runs, or things break. Early code for that is also available for a preview. I've pushed it to: https://github.com/jsitnicki/linux/commits/skb-meta/safeproof-netdevs/ Thanks, -jkbs