* Re: [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) [not found] ` <c389e47f-8f82-fd62-8c1d-d9481d2f71ff@I-love.SAKURA.ne.jp> @ 2022-04-22 14:40 ` Tetsuo Handa 2022-04-24 3:57 ` Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-04-22 14:40 UTC (permalink / raw) To: Santosh Shilimkar, OFED mailing list Cc: syzbot, andrii, andriin, ast, daniel, davem, dsahern, edumazet, john.fastabend, kafai, kpsingh, kuba, kuznet, netdev, songliubraving, syzkaller-bugs, tpa, yhs, yoshfuji, bpf Hello, RDS developers. I was thinking that BPF program is relevant with the TCP/IPv6 socket triggering use-after-free access. But disassembling syzkaller-generated BPF program concluded that what "char program[2053]" is doing is not important ( https://lkml.kernel.org/r/d21e278f-a3ff-8603-f6ba-b51a8cddafa8@I-love.SAKURA.ne.jp ). Then, I realized that TCP/IPv6 port 16385 (which the reproducer is accessing) is used by kernel RDS server, which can explain "It seems that a socket with sk->sk_net_refcnt=0 is created by unshare(CLONE_NEWNET)" at https://lkml.kernel.org/r/fa445f0e-32b7-5e0d-9326-94bc5adba4c1@I-love.SAKURA.ne.jp because the kernel RDS server starts during boot procedure. ------------------------------------------------------------ root@fuzz:~# unshare -n netstat -tanpe Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name tcp6 0 0 :::16385 :::* LISTEN 0 19627 - ------------------------------------------------------------ With the debug printk() patch shown below, ------------------------------------------------------------ diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 0ec2f5906a27..20b3c42b4140 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -429,7 +429,8 @@ static void net_free(struct net *net) { if (refcount_dec_and_test(&net->passive)) { kfree(rcu_access_pointer(net->gen)); - kmem_cache_free(net_cachep, net); + memset(net, POISON_FREE, sizeof(struct net)); + //kmem_cache_free(net_cachep, net); } } diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 09cadd556d1e..5792fe3df8ac 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -146,10 +146,9 @@ int rds_tcp_accept_one(struct socket *sock) my_addr = &saddr; peer_addr = &daddr; #endif - rdsdebug("accepted family %d tcp %pI6c:%u -> %pI6c:%u\n", - sock->sk->sk_family, - my_addr, ntohs(inet->inet_sport), - peer_addr, ntohs(inet->inet_dport)); + pr_info("accepted family %d tcp %pI6c:%u -> %pI6c:%u refcnt=%d sock_net=%px init_net=%px\n", + sock->sk->sk_family, my_addr, ntohs(inet->inet_sport), peer_addr, + ntohs(inet->inet_dport), sock->sk->sk_net_refcnt, sock_net(sock->sk), &init_net); #if IS_ENABLED(CONFIG_IPV6) /* sk_bound_dev_if is not set if the peer address is not link local ------------------------------------------------------------ I get accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:33086 refcnt=0 sock_net=ffffffff860d89c0 init_net=ffffffff860d89c0 if I do # echo > /dev/tcp/127.0.0.1/16385 from init_net namespace, and I get accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:33088 refcnt=0 sock_net=ffff88810a208000 init_net=ffffffff860d89c0 if I do # echo > /dev/tcp/127.0.0.1/16385 from non-init_net namespace. Note that sock->sk->sk_net_refcnt is 0 in both cases. Like commit 2303f994b3e18709 ("mptcp: Associate MPTCP context with TCP socket") says /* kernel sockets do not by default acquire net ref, but TCP timer * needs it. */ , I came to feel that e.g. rds_tcp_accept_one() is accessing sock_net(sock->sk) on accepted sockets with sock->sk->sk_net_refcnt=0 (because the listening socket was created by kernel) is causing this problem. Why not rds kernel server does sock->sk->sk_net_refcnt = 1; get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); on accepted sockets like mptcp_subflow_create_socket() does? For your testing, below is the latest reproducer. You can try this reproducer with keep-memory-poisoned patch shown above. ------------------------------------------------------------ // https://syzkaller.appspot.com/bug?id=8f0e04b2beffcd42f044d46879cc224f6eb71a99 // autogenerated by syzkaller (https://github.com/google/syzkaller) #define _GNU_SOURCE #include <arpa/inet.h> #include <endian.h> #include <errno.h> #include <fcntl.h> #include <net/if.h> #include <pthread.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/netlink.h> #include <linux/rtnetlink.h> #ifndef MSG_PROBE #define MSG_PROBE 0x10 #endif struct nlmsg { char* pos; int nesting; struct nlattr* nested[8]; char buf[4096]; }; static void netlink_init(struct nlmsg* nlmsg, int typ, int flags, const void* data, int size) { memset(nlmsg, 0, sizeof(*nlmsg)); struct nlmsghdr* hdr = (struct nlmsghdr*)nlmsg->buf; hdr->nlmsg_type = typ; hdr->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | flags; memcpy(hdr + 1, data, size); nlmsg->pos = (char*)(hdr + 1) + NLMSG_ALIGN(size); } static void netlink_attr(struct nlmsg* nlmsg, int typ, const void* data, int size) { struct nlattr* attr = (struct nlattr*)nlmsg->pos; attr->nla_len = sizeof(*attr) + size; attr->nla_type = typ; if (size > 0) memcpy(attr + 1, data, size); nlmsg->pos += NLMSG_ALIGN(attr->nla_len); } static int netlink_send_ext(struct nlmsg* nlmsg, int sock, uint16_t reply_type, int* reply_len, bool dofail) { if (nlmsg->pos > nlmsg->buf + sizeof(nlmsg->buf) || nlmsg->nesting) exit(1); struct nlmsghdr* hdr = (struct nlmsghdr*)nlmsg->buf; hdr->nlmsg_len = nlmsg->pos - nlmsg->buf; struct sockaddr_nl addr; memset(&addr, 0, sizeof(addr)); addr.nl_family = AF_NETLINK; ssize_t n = sendto(sock, nlmsg->buf, hdr->nlmsg_len, 0, (struct sockaddr*)&addr, sizeof(addr)); if (n != (ssize_t)hdr->nlmsg_len) { if (dofail) exit(1); return -1; } n = recv(sock, nlmsg->buf, sizeof(nlmsg->buf), 0); if (reply_len) *reply_len = 0; if (n < 0) { if (dofail) exit(1); return -1; } if (n < (ssize_t)sizeof(struct nlmsghdr)) { errno = EINVAL; if (dofail) exit(1); return -1; } if (hdr->nlmsg_type == NLMSG_DONE) return 0; if (reply_len && hdr->nlmsg_type == reply_type) { *reply_len = n; return 0; } if (n < (ssize_t)(sizeof(struct nlmsghdr) + sizeof(struct nlmsgerr))) { errno = EINVAL; if (dofail) exit(1); return -1; } if (hdr->nlmsg_type != NLMSG_ERROR) { errno = EINVAL; if (dofail) exit(1); return -1; } errno = -((struct nlmsgerr*)(hdr + 1))->error; return -errno; } static int netlink_send(struct nlmsg* nlmsg, int sock) { return netlink_send_ext(nlmsg, sock, 0, NULL, true); } static void netlink_device_change(int sock, const char* name, const void* mac, int macsize) { struct nlmsg nlmsg; struct ifinfomsg hdr; memset(&hdr, 0, sizeof(hdr)); hdr.ifi_flags = hdr.ifi_change = IFF_UP; hdr.ifi_index = if_nametoindex(name); netlink_init(&nlmsg, RTM_NEWLINK, 0, &hdr, sizeof(hdr)); netlink_attr(&nlmsg, IFLA_ADDRESS, mac, macsize); netlink_send(&nlmsg, sock); } static void netlink_add_addr(int sock, const char* dev, const void* addr, int addrsize) { struct nlmsg nlmsg; struct ifaddrmsg hdr; memset(&hdr, 0, sizeof(hdr)); hdr.ifa_family = addrsize == 4 ? AF_INET : AF_INET6; hdr.ifa_prefixlen = addrsize == 4 ? 24 : 120; hdr.ifa_scope = RT_SCOPE_UNIVERSE; hdr.ifa_index = if_nametoindex(dev); netlink_init(&nlmsg, RTM_NEWADDR, NLM_F_CREATE | NLM_F_REPLACE, &hdr, sizeof(hdr)); netlink_attr(&nlmsg, IFA_LOCAL, addr, addrsize); netlink_attr(&nlmsg, IFA_ADDRESS, addr, addrsize); netlink_send(&nlmsg, sock); } static void netlink_add_addr4(int sock, const char* dev, const char* addr) { struct in_addr in_addr; inet_pton(AF_INET, addr, &in_addr); netlink_add_addr(sock, dev, &in_addr, sizeof(in_addr)); } static void netlink_add_addr6(int sock, const char* dev, const char* addr) { struct in6_addr in6_addr; inet_pton(AF_INET6, addr, &in6_addr); netlink_add_addr(sock, dev, &in6_addr, sizeof(in6_addr)); } static void initialize_netdevices(void) { int fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); uint64_t macaddr = 0x00aaaaaaaaaa; if (fd == EOF) exit(1); netlink_add_addr4(fd, "lo", "127.0.0.1"); netlink_add_addr6(fd, "lo", "::1"); netlink_device_change(fd, "lo", &macaddr, ETH_ALEN); close(fd); } #ifndef __NR_bpf #define __NR_bpf 321 #endif static void execute_one(void) { const union bpf_attr attr = { .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .insn_cnt = 2, .insns = (unsigned long long) "\xb7\x00\x00\x00\x00\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00", .license = (unsigned long long) "GPL", }; struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(0x4001), /* where kernel RDS TCPv6 socket is listening */ .sin_addr.s_addr = inet_addr("127.0.0.1") }; const struct msghdr msg = { .msg_name = &addr, .msg_namelen = sizeof(addr), }; const int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, 72); const int sock_fd = socket(PF_INET, SOCK_STREAM, 0); alarm(3); while (1) { sendmsg(sock_fd, &msg, MSG_OOB | MSG_PROBE | MSG_CONFIRM | MSG_FASTOPEN); setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_BPF, &bpf_fd, sizeof(bpf_fd)); } } int main(int argc, char *argv[]) { if (unshare(CLONE_NEWNET)) return 1; initialize_netdevices(); execute_one(); return 0; } ------------------------------------------------------------ ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) 2022-04-22 14:40 ` [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) Tetsuo Handa @ 2022-04-24 3:57 ` Tetsuo Handa 2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-04-24 3:57 UTC (permalink / raw) To: Santosh Shilimkar, OFED mailing list Cc: syzbot, andrii, andriin, ast, daniel, davem, dsahern, edumazet, john.fastabend, kafai, kpsingh, kuba, kuznet, netdev, songliubraving, syzkaller-bugs, tpa, yhs, yoshfuji, bpf OK. I succeeded to reproduce this problem without BPF program. Just dropping TCP packets is sufficient. That is, this bug should be fixed in RDS code. ------------------------------------------------------------ root@fuzz:~# unshare -n sh -c ' ip link set lo up iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP telnet 127.0.0.1 16385 dmesg -c netstat -tanpe' < /dev/null Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. Connection closed by foreign host. [ 54.922280] accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:58780 refcnt=0 sock_net=ffff888035c98000 init_net=ffffffff860d89c0 Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name tcp 0 1 127.0.0.1:58780 127.0.0.1:16385 FIN_WAIT1 0 0 - tcp6 0 0 :::16385 :::* LISTEN 0 18301 - tcp6 1 1 127.0.0.1:16385 127.0.0.1:58780 LAST_ACK 0 0 - ------------------------------------------------------------ ------------------------------------------------------------ fuzz login: [ 54.849128][ T2718] ip (2718) used greatest stack depth: 11192 bytes left [ 54.922280][ T764] accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:58780 refcnt=0 sock_net=ffff888035c98000 init_net=ffffffff860d89c0 [ 224.330990][ C0] general protection fault, probably for non-canonical address 0x6b6af3ebe92b6bc3: 0000 [#1] PREEMPT SMP [ 224.344491][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc3-00016-gb253435746d9-dirty #767 [ 224.355974][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 224.361184][ C0] RIP: 0010:__tcp_transmit_skb+0x5e5/0xbf0 [ 224.364559][ C0] Code: 0f 84 33 05 00 00 4c 89 2c 24 49 89 c5 48 c7 40 10 00 00 00 00 e9 c0 fa ff ff 49 8b 46 30 41 0f b7 55 30 48 8b 80 b8 02 00 00 <65> 48 01 50 58 e9 8e fe ff ff 41 8b 86 fc 08 00 00 48 69 c0 e8 03 [ 224.375318][ C0] RSP: 0018:ffffc90000003d38 EFLAGS: 00010297 [ 224.378682][ C0] RAX: 6b6b6b6b6b6b6b6b RBX: 000000009e2a2659 RCX: ffff888104a39000 [ 224.383253][ C0] RDX: 0000000000000001 RSI: ffff8881008054e0 RDI: ffff888035340000 [ 224.387171][ C0] RBP: ffff888100805508 R08: 0000000000000000 R09: 0000000000000000 [ 224.389612][ C0] R10: ffff888104a39140 R11: 0000000000000000 R12: 0000000000000001 [ 224.392646][ C0] R13: ffff8881008054e0 R14: ffff888035340000 R15: 0000000000000020 [ 224.395626][ C0] FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 224.398662][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.400880][ C0] CR2: 000056264812f99c CR3: 000000000a58e000 CR4: 00000000000506f0 [ 224.403964][ C0] Call Trace: [ 224.405212][ C0] <IRQ> [ 224.406355][ C0] ? tcp_write_timer_handler+0x280/0x280 [ 224.408259][ C0] tcp_write_wakeup+0x112/0x160 [ 224.409932][ C0] ? ktime_get+0x1cb/0x260 [ 224.411636][ C0] tcp_send_probe0+0x13/0x150 [ 224.413393][ C0] tcp_write_timer_handler+0x248/0x280 [ 224.415433][ C0] tcp_write_timer+0xa5/0x110 [ 224.417040][ C0] ? tcp_write_timer_handler+0x280/0x280 [ 224.419142][ C0] call_timer_fn+0xa6/0x300 [ 224.420949][ C0] __run_timers.part.0+0x209/0x320 [ 224.422915][ C0] run_timer_softirq+0x2c/0x60 [ 224.424791][ C0] __do_softirq+0x174/0x53f [ 224.426462][ C0] __irq_exit_rcu+0xcb/0x120 [ 224.428188][ C0] irq_exit_rcu+0x5/0x20 [ 224.430176][ C0] sysvec_apic_timer_interrupt+0x8e/0xc0 [ 224.432301][ C0] </IRQ> [ 224.433394][ C0] <TASK> [ 224.434514][ C0] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 224.436500][ C0] RIP: 0010:default_idle+0xb/0x10 [ 224.438220][ C0] Code: 8b 04 25 40 af 01 00 f0 80 60 02 df c3 0f ae f0 0f ae 38 0f ae f0 eb b9 0f 1f 80 00 00 00 00 eb 07 0f 00 2d e3 b6 56 00 fb f4 <c3> cc cc cc cc 53 48 89 fb e8 67 fb fe ff 48 8b 15 a0 91 4e 02 89 [ 224.444865][ C0] RSP: 0018:ffffffff83e03ea8 EFLAGS: 00000202 [ 224.447077][ C0] RAX: 00000000000223b5 RBX: ffffffff83e61a00 RCX: 0000000000000001 [ 224.449957][ C0] RDX: 0000000000000000 RSI: ffffffff832e9bf1 RDI: ffffffff83246666 [ 224.452916][ C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 224.455677][ C0] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 224.458458][ C0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 224.461642][ C0] default_idle_call+0x54/0x90 [ 224.463888][ C0] do_idle+0x1f3/0x240 [ 224.465531][ C0] cpu_startup_entry+0x14/0x20 [ 224.467193][ C0] start_kernel+0x69c/0x6c1 [ 224.469040][ C0] secondary_startup_64_no_verify+0xc3/0xcb [ 224.471179][ C0] </TASK> [ 224.472438][ C0] Modules linked in: [ 224.474387][ C0] ---[ end trace 0000000000000000 ]--- [ 224.476521][ C0] RIP: 0010:__tcp_transmit_skb+0x5e5/0xbf0 [ 224.478893][ C0] Code: 0f 84 33 05 00 00 4c 89 2c 24 49 89 c5 48 c7 40 10 00 00 00 00 e9 c0 fa ff ff 49 8b 46 30 41 0f b7 55 30 48 8b 80 b8 02 00 00 <65> 48 01 50 58 e9 8e fe ff ff 41 8b 86 fc 08 00 00 48 69 c0 e8 03 [ 224.485948][ C0] RSP: 0018:ffffc90000003d38 EFLAGS: 00010297 [ 224.488110][ C0] RAX: 6b6b6b6b6b6b6b6b RBX: 000000009e2a2659 RCX: ffff888104a39000 [ 224.491186][ C0] RDX: 0000000000000001 RSI: ffff8881008054e0 RDI: ffff888035340000 [ 224.494378][ C0] RBP: ffff888100805508 R08: 0000000000000000 R09: 0000000000000000 [ 224.497576][ C0] R10: ffff888104a39140 R11: 0000000000000000 R12: 0000000000000001 [ 224.500600][ C0] R13: ffff8881008054e0 R14: ffff888035340000 R15: 0000000000000020 [ 224.503814][ C0] FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 224.507136][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.509421][ C0] CR2: 000056264812f99c CR3: 000000000a58e000 CR4: 00000000000506f0 [ 224.512699][ C0] Kernel panic - not syncing: Fatal exception in interrupt [ 224.515847][ C0] Kernel Offset: disabled [ 224.517636][ C0] Rebooting in 10 seconds.. ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] net: rds: acquire refcount on TCP sockets 2022-04-24 3:57 ` Tetsuo Handa @ 2022-05-01 15:29 ` Tetsuo Handa 2022-05-01 16:14 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-05-01 15:29 UTC (permalink / raw) To: Santosh Shilimkar, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni Cc: syzbot, netdev, syzkaller-bugs, OFED mailing list syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], for TCP socket used by RDS is accessing sock_net() without acquiring a refcount on net namespace. Since TCP's retransmission can happen after a process which created net namespace terminated, we need to explicitly acquire a refcount. Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> --- net/rds/tcp.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 5327d130c4b5..8015d2695784 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -493,6 +493,15 @@ void rds_tcp_tune(struct socket *sock) struct net *net = sock_net(sk); struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + /* TCP timer functions might access net namespace even after + * a process which created this net namespace terminated. + */ + if (!sk->sk_net_refcnt) { + sk->sk_net_refcnt = 1; + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + } + tcp_sock_set_nodelay(sock->sk); lock_sock(sk); if (rtn->sndbuf_size > 0) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: acquire refcount on TCP sockets 2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa @ 2022-05-01 16:14 ` Eric Dumazet 2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2022-05-01 16:14 UTC (permalink / raw) To: Tetsuo Handa Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list On Sun, May 1, 2022 at 8:29 AM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > Please add a Fixes: tag > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > --- > net/rds/tcp.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 5327d130c4b5..8015d2695784 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -493,6 +493,15 @@ void rds_tcp_tune(struct socket *sock) > struct net *net = sock_net(sk); > struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); > > + /* TCP timer functions might access net namespace even after > + * a process which created this net namespace terminated. > + */ Please move this after the lock_sock(sk) [1], so that we are protected correctly ? > + if (!sk->sk_net_refcnt) { > + sk->sk_net_refcnt = 1; > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > + sock_inuse_add(net, 1); > + } > + > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); [1] Here. > if (rtn->sndbuf_size > 0) { > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-01 16:14 ` Eric Dumazet @ 2022-05-02 1:40 ` Tetsuo Handa 2022-05-02 14:12 ` Haakon Bugge ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Tetsuo Handa @ 2022-05-02 1:40 UTC (permalink / raw) To: Eric Dumazet Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], for TCP socket used by RDS is accessing sock_net() without acquiring a refcount on net namespace. Since TCP's retransmission can happen after a process which created net namespace terminated, we need to explicitly acquire a refcount. Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> --- Changes in v2: Add Fixes: tag. Move to inside lock_sock() section. I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport to RDS") was added to 2.6.32. net/rds/tcp.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 5327d130c4b5..2f638f8b7b1e 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) tcp_sock_set_nodelay(sock->sk); lock_sock(sk); + /* TCP timer functions might access net namespace even after + * a process which created this net namespace terminated. + */ + if (!sk->sk_net_refcnt) { + sk->sk_net_refcnt = 1; + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + } if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; -- 2.34.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa @ 2022-05-02 14:12 ` Haakon Bugge 2022-05-02 14:29 ` Tetsuo Handa 2022-05-03 9:02 ` Paolo Abeni 2022-05-03 11:40 ` patchwork-bot+netdevbpf 2 siblings, 1 reply; 30+ messages in thread From: Haakon Bugge @ 2022-05-02 14:12 UTC (permalink / raw) To: Tetsuo Handa Cc: Eric Dumazet, Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list > On 2 May 2022, at 03:40, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > --- > Changes in v2: > Add Fixes: tag. > Move to inside lock_sock() section. > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > to RDS") was added to 2.6.32. > > net/rds/tcp.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 5327d130c4b5..2f638f8b7b1e 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); > + /* TCP timer functions might access net namespace even after > + * a process which created this net namespace terminated. > + */ > + if (!sk->sk_net_refcnt) { > + sk->sk_net_refcnt = 1; > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); Don't you need a corresponding put_net_track()? Thxs, Håkon > + sock_inuse_add(net, 1); > + } > if (rtn->sndbuf_size > 0) { > sk->sk_sndbuf = rtn->sndbuf_size; > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > -- > 2.34.1 > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-02 14:12 ` Haakon Bugge @ 2022-05-02 14:29 ` Tetsuo Handa 0 siblings, 0 replies; 30+ messages in thread From: Tetsuo Handa @ 2022-05-02 14:29 UTC (permalink / raw) To: Haakon Bugge Cc: Eric Dumazet, Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list On 2022/05/02 23:12, Haakon Bugge wrote: >> + /* TCP timer functions might access net namespace even after >> + * a process which created this net namespace terminated. >> + */ >> + if (!sk->sk_net_refcnt) { >> + sk->sk_net_refcnt = 1; >> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > Don't you need a corresponding put_net_track()? __sk_free() and __sk_destruct() will do if sk->sk_net_refcnt is set. > >> + sock_inuse_add(net, 1); >> + } ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa 2022-05-02 14:12 ` Haakon Bugge @ 2022-05-03 9:02 ` Paolo Abeni 2022-05-03 9:56 ` Tetsuo Handa ` (2 more replies) 2022-05-03 11:40 ` patchwork-bot+netdevbpf 2 siblings, 3 replies; 30+ messages in thread From: Paolo Abeni @ 2022-05-03 9:02 UTC (permalink / raw) To: Tetsuo Handa, Eric Dumazet Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list Hello, On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > --- > Changes in v2: > Add Fixes: tag. > Move to inside lock_sock() section. > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > to RDS") was added to 2.6.32. > > net/rds/tcp.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 5327d130c4b5..2f638f8b7b1e 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); > + /* TCP timer functions might access net namespace even after > + * a process which created this net namespace terminated. > + */ > + if (!sk->sk_net_refcnt) { > + sk->sk_net_refcnt = 1; > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > + sock_inuse_add(net, 1); > + } > if (rtn->sndbuf_size > 0) { > sk->sk_sndbuf = rtn->sndbuf_size; > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; This looks equivalent to the fix presented here: https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ but the latter looks a more generic solution. @Tetsuo could you please test the above in your setup? Thanks! Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 9:02 ` Paolo Abeni @ 2022-05-03 9:56 ` Tetsuo Handa 2022-05-03 11:10 ` Paolo Abeni 2022-05-03 13:27 ` David Laight 2022-05-03 13:45 ` Eric Dumazet 2 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-05-03 9:56 UTC (permalink / raw) To: Paolo Abeni, Eric Dumazet Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list On 2022/05/03 18:02, Paolo Abeni wrote: > This looks equivalent to the fix presented here: Not equivalent. > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? I already tested that fix, and the result was https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ . ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 9:56 ` Tetsuo Handa @ 2022-05-03 11:10 ` Paolo Abeni 0 siblings, 0 replies; 30+ messages in thread From: Paolo Abeni @ 2022-05-03 11:10 UTC (permalink / raw) To: Tetsuo Handa, Eric Dumazet Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list On Tue, 2022-05-03 at 18:56 +0900, Tetsuo Handa wrote: > On 2022/05/03 18:02, Paolo Abeni wrote: > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > but the latter looks a more generic solution. @Tetsuo could you please > > test the above in your setup? > > I already tested that fix, and the result was > https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ . Thanks, I somewhat missed that reply. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 9:02 ` Paolo Abeni 2022-05-03 9:56 ` Tetsuo Handa @ 2022-05-03 13:27 ` David Laight 2022-05-03 13:43 ` Eric Dumazet 2022-05-03 13:45 ` Eric Dumazet 2 siblings, 1 reply; 30+ messages in thread From: David Laight @ 2022-05-03 13:27 UTC (permalink / raw) To: 'Paolo Abeni', Tetsuo Handa, Eric Dumazet Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list From: Paolo Abeni > Sent: 03 May 2022 10:03 > > Hello, > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > --- > > Changes in v2: > > Add Fixes: tag. > > Move to inside lock_sock() section. > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > to RDS") was added to 2.6.32. > > > > net/rds/tcp.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > index 5327d130c4b5..2f638f8b7b1e 100644 > > --- a/net/rds/tcp.c > > +++ b/net/rds/tcp.c > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > tcp_sock_set_nodelay(sock->sk); > > lock_sock(sk); > > + /* TCP timer functions might access net namespace even after > > + * a process which created this net namespace terminated. > > + */ > > + if (!sk->sk_net_refcnt) { > > + sk->sk_net_refcnt = 1; > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > + sock_inuse_add(net, 1); > > + } > > if (rtn->sndbuf_size > 0) { > > sk->sk_sndbuf = rtn->sndbuf_size; > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > This looks equivalent to the fix presented here: > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? Wouldn't a more generic solution be to add a flag to sock_create_kern() so that it acquires a reference to the namespace? This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. I've a driver that uses __sock_create() in order to get that reference. I'm pretty sure the extra 'security' check will never fail. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 13:27 ` David Laight @ 2022-05-03 13:43 ` Eric Dumazet 2022-05-03 14:25 ` David Laight 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2022-05-03 13:43 UTC (permalink / raw) To: David Laight Cc: Paolo Abeni, Tetsuo Handa, Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote: > > From: Paolo Abeni > > Sent: 03 May 2022 10:03 > > > > Hello, > > > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > > kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > --- > > > Changes in v2: > > > Add Fixes: tag. > > > Move to inside lock_sock() section. > > > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > > to RDS") was added to 2.6.32. > > > > > > net/rds/tcp.c | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > > index 5327d130c4b5..2f638f8b7b1e 100644 > > > --- a/net/rds/tcp.c > > > +++ b/net/rds/tcp.c > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > > > tcp_sock_set_nodelay(sock->sk); > > > lock_sock(sk); > > > + /* TCP timer functions might access net namespace even after > > > + * a process which created this net namespace terminated. > > > + */ > > > + if (!sk->sk_net_refcnt) { > > > + sk->sk_net_refcnt = 1; > > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > > + sock_inuse_add(net, 1); > > > + } > > > if (rtn->sndbuf_size > 0) { > > > sk->sk_sndbuf = rtn->sndbuf_size; > > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > > > This looks equivalent to the fix presented here: > > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > but the latter looks a more generic solution. @Tetsuo could you please > > test the above in your setup? > > Wouldn't a more generic solution be to add a flag to sock_create_kern() > so that it acquires a reference to the namespace? > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. > > I've a driver that uses __sock_create() in order to get that reference. > I'm pretty sure the extra 'security' check will never fail. > This would be silly really. Definition of a 'kernel socket' is that it does not hold a reference to the namespace. (otherwise a netns could not be destroyed by user space) A kernel layer using kernel sockets needs to properly dismantle them when a namespace is destroyed. In the RDS case, the socket was a user socket, or RDS lacked proper tracking of all the sockets so that they can be dismantled properly. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 13:43 ` Eric Dumazet @ 2022-05-03 14:25 ` David Laight 0 siblings, 0 replies; 30+ messages in thread From: David Laight @ 2022-05-03 14:25 UTC (permalink / raw) To: 'Eric Dumazet' Cc: Paolo Abeni, Tetsuo Handa, Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list From: Eric Dumazet > Sent: 03 May 2022 14:43 > > On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote: > > > > From: Paolo Abeni > > > Sent: 03 May 2022 10:03 > > > > > > Hello, > > > > > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > > refcount on net namespace. Since TCP's retransmission can happen after > > > > a process which created net namespace terminated, we need to explicitly > > > > acquire a refcount. > > > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel > sockets.") > > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > > > kernel socket") > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > --- > > > > Changes in v2: > > > > Add Fixes: tag. > > > > Move to inside lock_sock() section. > > > > > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > > > to RDS") was added to 2.6.32. > > > > > > > > net/rds/tcp.c | 8 ++++++++ > > > > 1 file changed, 8 insertions(+) > > > > > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > > > index 5327d130c4b5..2f638f8b7b1e 100644 > > > > --- a/net/rds/tcp.c > > > > +++ b/net/rds/tcp.c > > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > > > > > tcp_sock_set_nodelay(sock->sk); > > > > lock_sock(sk); > > > > + /* TCP timer functions might access net namespace even after > > > > + * a process which created this net namespace terminated. > > > > + */ > > > > + if (!sk->sk_net_refcnt) { > > > > + sk->sk_net_refcnt = 1; > > > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > > > + sock_inuse_add(net, 1); > > > > + } > > > > if (rtn->sndbuf_size > 0) { > > > > sk->sk_sndbuf = rtn->sndbuf_size; > > > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > > > > > This looks equivalent to the fix presented here: > > > > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > > > but the latter looks a more generic solution. @Tetsuo could you please > > > test the above in your setup? > > > > Wouldn't a more generic solution be to add a flag to sock_create_kern() > > so that it acquires a reference to the namespace? > > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. > > > > I've a driver that uses __sock_create() in order to get that reference. > > I'm pretty sure the extra 'security' check will never fail. > > > > This would be silly really. > > Definition of a 'kernel socket' is that it does not hold a reference > to the namespace. > (otherwise a netns could not be destroyed by user space) > > A kernel layer using kernel sockets needs to properly dismantle them > when a namespace is destroyed. I think it depends on why the driver is using a socket. If the driver is a 'user' of a TCP connection that happens to be is a kernel driver then holding the a reference to the namespace is no different to an application socket holding a reference. An example might be nfs/tcp - you need to unmount the filesystem before you can delete the namespace. OTOH if part of a protocol stack is using a socket for internal calls (I think I've seen routing sockets used that way) then the presence of the socket probably shouldn't stop the namespace being deleted. Listening sockets are a slight problem - probably for userspace as well. It would be nicer to be able to get TCP (etc) to error out listening sockets if they are the only thing stopping a namespace being deleted. > In the RDS case, the socket was a user socket, or RDS lacked proper > tracking of all the sockets > so that they can be dismantled properly. I think they probably are sockets created in order act on requests from applications. I think they should have the same effect on namespaces as a direct user socket - you can't delete the socket while the connection is active. Kill all the relevant processes, tell the driver to stop, and you can delete the namespace. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 9:02 ` Paolo Abeni 2022-05-03 9:56 ` Tetsuo Handa 2022-05-03 13:27 ` David Laight @ 2022-05-03 13:45 ` Eric Dumazet 2022-05-03 14:08 ` Tetsuo Handa 2 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2022-05-03 13:45 UTC (permalink / raw) To: Paolo Abeni Cc: Tetsuo Handa, Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list On Tue, May 3, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hello, > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > --- > > Changes in v2: > > Add Fixes: tag. > > Move to inside lock_sock() section. > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > to RDS") was added to 2.6.32. > > > > net/rds/tcp.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > index 5327d130c4b5..2f638f8b7b1e 100644 > > --- a/net/rds/tcp.c > > +++ b/net/rds/tcp.c > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > tcp_sock_set_nodelay(sock->sk); > > lock_sock(sk); > > + /* TCP timer functions might access net namespace even after > > + * a process which created this net namespace terminated. > > + */ > > + if (!sk->sk_net_refcnt) { > > + sk->sk_net_refcnt = 1; > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > + sock_inuse_add(net, 1); > > + } > > if (rtn->sndbuf_size > 0) { > > sk->sk_sndbuf = rtn->sndbuf_size; > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > This looks equivalent to the fix presented here: > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ I think this is still needed for layers (NFS ?) that dismantle their TCP sockets whenever a netns is dismantled. But RDS case was different, only the listener is a kernel socket. > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? > > Thanks! > > Paolo > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 13:45 ` Eric Dumazet @ 2022-05-03 14:08 ` Tetsuo Handa 0 siblings, 0 replies; 30+ messages in thread From: Tetsuo Handa @ 2022-05-03 14:08 UTC (permalink / raw) To: Eric Dumazet, Paolo Abeni Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list On 2022/05/03 22:45, Eric Dumazet wrote: >> This looks equivalent to the fix presented here: >> >> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ I retested the fix above using unshare -n sh -c ' ip link set lo up iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP telnet 127.0.0.1 16385 dmesg -c netstat -tanpe' < /dev/null as a test case, but it seems racy; sometimes timer function is called again and crashes. [ 426.086565][ C2] general protection fault, probably for non-canonical address 0x6b6af3ebcc3b6bc3: 0000 [#1] PREEMPT SMP KASAN [ 426.096339][ C2] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.18.0-rc5-dirty #807 [ 426.103769][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 426.111851][ C2] RIP: 0010:__tcp_transmit_skb+0xe72/0x1b80 [ 426.117512][ C2] Code: e8 b3 ea dc fd 48 8d 7d 30 45 0f b7 77 30 e8 95 ec dc fd 48 8b 5d 30 48 8d bb b8 02 00 00 e8 85 ec dc fd 48 8b 83 b8 02 00 00 <65> 4c 01 70 58 e9 67 fd ff ff e8 ef 56 ac fd 48 8d bd d0 09 00 00 [ 426.124692][ C2] RSP: 0018:ffff888060d09ac8 EFLAGS: 00010246 [ 426.126845][ C2] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8880145c8000 RCX: ffffffff838cc28b [ 426.129616][ C2] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880145c82b8 [ 426.132374][ C2] RBP: ffff8880129f8000 R08: 0000000000000000 R09: 0000000000000007 [ 426.135077][ C2] R10: ffffffff838cbfd4 R11: 0000000000000001 R12: ffff8880129f8760 [ 426.137793][ C2] R13: ffff88800f6e0118 R14: 0000000000000001 R15: ffff88800f6e00e8 [ 426.140489][ C2] FS: 0000000000000000(0000) GS:ffff888060d00000(0000) knlGS:0000000000000000 [ 426.143525][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 426.145792][ C2] CR2: 000055b5bb0adabc CR3: 000000000e003000 CR4: 00000000000506e0 [ 426.148509][ C2] Call Trace: [ 426.149442][ C2] <IRQ> [ 426.150183][ C2] ? __tcp_select_window+0x710/0x710 [ 426.151457][ C2] ? __sanitizer_cov_trace_cmp4+0x1c/0x70 [ 426.153007][ C2] ? tcp_current_mss+0x165/0x280 [ 426.154245][ C2] ? tcp_trim_head+0x300/0x300 [ 426.155396][ C2] ? find_held_lock+0x85/0xa0 [ 426.156734][ C2] ? mark_held_locks+0x65/0x90 [ 426.157967][ C2] tcp_write_wakeup+0x2e2/0x340 [ 426.159149][ C2] tcp_send_probe0+0x2a/0x2c0 [ 426.160368][ C2] tcp_write_timer_handler+0x5cb/0x670 [ 426.161740][ C2] tcp_write_timer+0x86/0x250 [ 426.162896][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.164285][ C2] call_timer_fn+0x15d/0x5f0 [ 426.165481][ C2] ? add_timer_on+0x2e0/0x2e0 [ 426.166667][ C2] ? lock_downgrade+0x3c0/0x3c0 [ 426.167921][ C2] ? mark_held_locks+0x24/0x90 [ 426.169263][ C2] ? _raw_spin_unlock_irq+0x1f/0x40 [ 426.170564][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.171920][ C2] __run_timers.part.0+0x523/0x740 [ 426.173181][ C2] ? call_timer_fn+0x5f0/0x5f0 [ 426.174321][ C2] ? pvclock_clocksource_read+0xdc/0x1a0 [ 426.175655][ C2] run_timer_softirq+0x66/0xe0 [ 426.176825][ C2] __do_softirq+0x1c2/0x670 [ 426.177944][ C2] __irq_exit_rcu+0xf8/0x140 [ 426.179120][ C2] irq_exit_rcu+0x5/0x20 [ 426.180150][ C2] sysvec_apic_timer_interrupt+0x8e/0xc0 [ 426.181486][ C2] </IRQ> [ 426.182180][ C2] <TASK> [ 426.182845][ C2] asm_sysvec_apic_timer_interrupt+0x12/0x20 > > I think this is still needed for layers (NFS ?) that dismantle their > TCP sockets whenever a netns > is dismantled. But RDS case was different, only the listener is a kernel socket. We can't apply the fix above. I think that the fundamental problem is that we use net->ns.count for both "avoiding use-after-free" purpose and "allowing dismantle from user event" purpose. Why not to use separated counters? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa 2022-05-02 14:12 ` Haakon Bugge 2022-05-03 9:02 ` Paolo Abeni @ 2022-05-03 11:40 ` patchwork-bot+netdevbpf 2022-05-03 21:17 ` Eric Dumazet 2 siblings, 1 reply; 30+ messages in thread From: patchwork-bot+netdevbpf @ 2022-05-03 11:40 UTC (permalink / raw) To: Tetsuo Handa Cc: edumazet, santosh.shilimkar, davem, kuba, pabeni, syzbot+694120e1002c117747ed, netdev, syzkaller-bugs, linux-rdma Hello: This patch was applied to netdev/net.git (master) by Paolo Abeni <pabeni@redhat.com>: On Mon, 2 May 2022 10:40:18 +0900 you wrote: > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > [...] Here is the summary with links: - [v2] net: rds: acquire refcount on TCP sockets https://git.kernel.org/netdev/net/c/3a58f13a881e You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 11:40 ` patchwork-bot+netdevbpf @ 2022-05-03 21:17 ` Eric Dumazet 2022-05-03 22:37 ` Eric Dumazet 2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni 0 siblings, 2 replies; 30+ messages in thread From: Eric Dumazet @ 2022-05-03 21:17 UTC (permalink / raw) To: patchwork-bot+netdevbpf Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > Hello: > > This patch was applied to netdev/net.git (master) > by Paolo Abeni <pabeni@redhat.com>: > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > [...] > > Here is the summary with links: > - [v2] net: rds: acquire refcount on TCP sockets > https://git.kernel.org/netdev/net/c/3a58f13a881e > > You are awesome, thank you! > -- > Deet-doot-dot, I am a bot. > https://korg.docs.kernel.org/patchwork/pwbot.html > > I think we merged this patch too soon. My question is : What prevents rds_tcp_conn_path_connect(), and thus rds_tcp_tune() to be called after the netns refcount already reached 0 ? I guess we can wait for next syzbot report, but I think that get_net() should be replaced by maybe_get_net() ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 21:17 ` Eric Dumazet @ 2022-05-03 22:37 ` Eric Dumazet 2022-05-04 1:04 ` Tetsuo Handa 2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni 1 sibling, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2022-05-03 22:37 UTC (permalink / raw) To: patchwork-bot+netdevbpf Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma On Tue, May 3, 2022 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > Hello: > > > > This patch was applied to netdev/net.git (master) > > by Paolo Abeni <pabeni@redhat.com>: > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > [...] > > > > Here is the summary with links: > > - [v2] net: rds: acquire refcount on TCP sockets > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > You are awesome, thank you! > > -- > > Deet-doot-dot, I am a bot. > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > I think we merged this patch too soon. > > My question is : What prevents rds_tcp_conn_path_connect(), and thus > rds_tcp_tune() to be called > after the netns refcount already reached 0 ? > > I guess we can wait for next syzbot report, but I think that get_net() > should be replaced > by maybe_get_net() Yes, syzbot was fast to trigger this exact issue: HEAD commit: 3a58f13a net: rds: acquire refcount on TCP sockets git tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 6934 at lib/refcount.c:25 refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25 Modules linked in: CPU: 1 PID: 6934 Comm: kworker/u4:17 Not tainted 5.18.0-rc4-syzkaller-00209-g3a58f13a881e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: krdsd rds_connect_worker RIP: 0010:refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25 Code: 09 31 ff 89 de e8 f7 b9 81 fd 84 db 0f 85 36 ff ff ff e8 0a b6 81 fd 48 c7 c7 40 eb 26 8a c6 05 75 1f ac 09 01 e8 56 75 2d 05 <0f> 0b e9 17 ff ff ff e8 eb b5 81 fd 0f b6 1d 5a 1f ac 09 31 ff 89 RSP: 0018:ffffc9000b5e7b80 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88807a948000 RSI: ffffffff81600c08 RDI: fffff520016bcf62 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815fb5de R11: 0000000000000000 R12: ffff888021e69b80 R13: ffff88805bc82a00 R14: ffff888021e69ccc R15: ffff8880741a2900 FS: 0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2cb5c000 CR3: 000000005688f000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_add include/linux/refcount.h:199 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] get_net include/net/net_namespace.h:248 [inline] get_net_track include/net/net_namespace.h:334 [inline] rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298 </TASK> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 22:37 ` Eric Dumazet @ 2022-05-04 1:04 ` Tetsuo Handa 2022-05-04 3:09 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-05-04 1:04 UTC (permalink / raw) To: Eric Dumazet, patchwork-bot+netdevbpf Cc: Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma On 2022/05/04 7:37, Eric Dumazet wrote: >> I think we merged this patch too soon. >> >> My question is : What prevents rds_tcp_conn_path_connect(), and thus >> rds_tcp_tune() to be called >> after the netns refcount already reached 0 ? >> >> I guess we can wait for next syzbot report, but I think that get_net() >> should be replaced >> by maybe_get_net() > > Yes, syzbot was fast to trigger this exact issue: Does maybe_get_net() help? Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). rds_tcp_conn_path_connect() { sock_create_kern(net = rds_conn_net(conn)) { __sock_create(net = rds_conn_net(conn), kern = 1) { err = pf->create(net = rds_conn_net(conn), kern = 1) { // pf->create is either inet_create or inet6_create sk_alloc(net = rds_conn_net(conn), kern = 1) { sk->sk_net_refcnt = kern ? 0 : 1; if (likely(sk->sk_net_refcnt)) { get_net_track(net, &sk->ns_tracker, priority); sock_inuse_add(net, 1); } sock_net_set(sk, net); } } } } rds_tcp_tune() { if (!sk->sk_net_refcnt) { sk->sk_net_refcnt = 1; get_net_track(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } } } "struct rds_connection" needs to hold a ref in order to safely allow rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection" must have a ref. Situation where we are protected by maybe_get_net() is quite limited if long-lived object is not holding a ref. Hmm, can we simply use &init_net instead of rds_conn_net(conn) ? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-04 1:04 ` Tetsuo Handa @ 2022-05-04 3:09 ` Eric Dumazet 2022-05-04 4:58 ` Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2022-05-04 3:09 UTC (permalink / raw) To: Tetsuo Handa Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma On Tue, May 3, 2022 at 6:04 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2022/05/04 7:37, Eric Dumazet wrote: > >> I think we merged this patch too soon. > >> > >> My question is : What prevents rds_tcp_conn_path_connect(), and thus > >> rds_tcp_tune() to be called > >> after the netns refcount already reached 0 ? > >> > >> I guess we can wait for next syzbot report, but I think that get_net() > >> should be replaced > >> by maybe_get_net() > > > > Yes, syzbot was fast to trigger this exact issue: > > Does maybe_get_net() help? > > Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically > possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d > if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). Nope. RDS has an exit() handler called from cleanup_net() (struct pernet_operations)->exit() or exit_batch() : rds_tcp_exit_net() (rds_tcp_kill_sock()) This exit() handler _has_ to remove all known listeners, and definitely cancel work queues (synchronous operation) before the actual "struct net" free can happen later. > > rds_tcp_conn_path_connect() { > sock_create_kern(net = rds_conn_net(conn)) { > __sock_create(net = rds_conn_net(conn), kern = 1) { > err = pf->create(net = rds_conn_net(conn), kern = 1) { > // pf->create is either inet_create or inet6_create > sk_alloc(net = rds_conn_net(conn), kern = 1) { > sk->sk_net_refcnt = kern ? 0 : 1; > if (likely(sk->sk_net_refcnt)) { > get_net_track(net, &sk->ns_tracker, priority); > sock_inuse_add(net, 1); > } > sock_net_set(sk, net); > } > } > } > } > rds_tcp_tune() { > if (!sk->sk_net_refcnt) { > sk->sk_net_refcnt = 1; > get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > sock_inuse_add(net, 1); > } > } > } > > "struct rds_connection" needs to hold a ref in order to safely allow > rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless > to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection" > must have a ref. Situation where we are protected by maybe_get_net() is > quite limited if long-lived object is not holding a ref. > > Hmm, can we simply use &init_net instead of rds_conn_net(conn) ? Only if you plan making RDS unavailable for non init netns. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-04 3:09 ` Eric Dumazet @ 2022-05-04 4:58 ` Tetsuo Handa 2022-05-04 15:15 ` Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-05-04 4:58 UTC (permalink / raw) To: Eric Dumazet Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma On 2022/05/04 12:09, Eric Dumazet wrote: >> Does maybe_get_net() help? >> >> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically >> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d >> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). > > Nope. RDS has an exit() handler called from cleanup_net() > > (struct pernet_operations)->exit() or exit_batch() : > rds_tcp_exit_net() (rds_tcp_kill_sock()) Hmm, when put_net() called __put_net(), this "struct net" is chained to cleanup_list. When cleanup_net() is called via net_cleanup_work, rds_tcp_exit_net() is called from ops_exit_list(). Therefore, we can call maybe_get_net() until rds_tcp_exit_net() returns. That's good. > > This exit() handler _has_ to remove all known listeners, and > definitely cancel work queues (synchronous operation) > before the actual "struct net" free can happen later. But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w). But I can't see how rds_tcp_exit_net() synchronously cancels all works associated with "struct rds_conn_path". struct rds_conn_path { struct delayed_work cp_send_w; struct delayed_work cp_recv_w; struct delayed_work cp_conn_w; struct work_struct cp_down_w; } These works are queued to rds_wq, but flush_workqueue() waits for completion only if already queued. What if timer for queue_delayed_work() has not expired, or was about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient? Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed or cancelled, the fix would look like something below? net/rds/tcp.c | 11 ++++++++--- net/rds/tcp.h | 2 +- net/rds/tcp_connect.c | 5 ++++- net/rds/tcp_listen.c | 5 ++++- 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 2f638f8b7b1e..8e26bcf02044 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -487,11 +487,11 @@ struct rds_tcp_net { /* All module specific customizations to the RDS-TCP socket should be done in * rds_tcp_tune() and applied after socket creation. */ -void rds_tcp_tune(struct socket *sock) +bool rds_tcp_tune(struct socket *sock) { struct sock *sk = sock->sk; struct net *net = sock_net(sk); - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct rds_tcp_net *rtn; tcp_sock_set_nodelay(sock->sk); lock_sock(sk); @@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock) * a process which created this net namespace terminated. */ if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return false; + } sk->sk_net_refcnt = 1; - get_net_track(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } + rtn = net_generic(net, rds_tcp_netid); if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; @@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock) sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); + return true; } static void rds_tcp_accept_worker(struct work_struct *work) diff --git a/net/rds/tcp.h b/net/rds/tcp.h index dc8d745d6857..f8b5930d7b34 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -49,7 +49,7 @@ struct rds_tcp_statistics { }; /* tcp.c */ -void rds_tcp_tune(struct socket *sock); +bool rds_tcp_tune(struct socket *sock); void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index 5461d77fff4f..f0c477c5d1db 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp) if (ret < 0) goto out; - rds_tcp_tune(sock); + if (!rds_tcp_tune(sock)) { + ret = -EINVAL; + goto out; + } if (isv6) { sin6.sin6_family = AF_INET6; diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 09cadd556d1e..7edf2e69d3fe 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock) __module_get(new_sock->ops->owner); rds_tcp_keepalive(new_sock); - rds_tcp_tune(new_sock); + if (!rds_tcp_tune(new_sock)) { + ret = -EINVAL; + goto out; + } inet = inet_sk(new_sock->sk); -- 2.34.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-04 4:58 ` Tetsuo Handa @ 2022-05-04 15:15 ` Tetsuo Handa 2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa 0 siblings, 1 reply; 30+ messages in thread From: Tetsuo Handa @ 2022-05-04 15:15 UTC (permalink / raw) To: Eric Dumazet, Paolo Abeni Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma On 2022/05/04 13:58, Tetsuo Handa wrote: > On 2022/05/04 12:09, Eric Dumazet wrote: >> This exit() handler _has_ to remove all known listeners, and >> definitely cancel work queues (synchronous operation) >> before the actual "struct net" free can happen later. > > But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from > rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that > rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and > rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w). > > But I can't see how rds_tcp_exit_net() synchronously cancels all works associated > with "struct rds_conn_path". > > struct rds_conn_path { > struct delayed_work cp_send_w; > struct delayed_work cp_recv_w; > struct delayed_work cp_conn_w; > struct work_struct cp_down_w; > } > > These works are queued to rds_wq, but flush_workqueue() waits for completion only > if already queued. What if timer for queue_delayed_work() has not expired, or was > about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient? rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 rds_tcp_conn_path_connect is referenced by "struct rds_transport rds_tcp_transport"->conn_path_connect. It is invoked by ret = conn->c_trans->conn_path_connect(cp) in rds_connect_worker(). rds_connect_worker is referenced by "struct rds_conn_path"->cp_conn_w via INIT_DELAYED_WORK(). queue_delayed_work(rds_wq, &cp->cp_conn_w, *) is called by rds_queue_reconnect() or rds_conn_path_connect_if_down(). If rds_conn_path_connect_if_down() were called from rds_tcp_accept_one_path() from rds_tcp_accept_one(), rds_tcp_tune() from rds_tcp_accept_one() was already called before rds_tcp_tune() from rds_tcp_conn_path_connect() is called. Since the addition on 0 was not reported at rds_tcp_tune() from rds_tcp_accept_one(), what Eric is reporting cannot be from rds_tcp_accept_one() from rds_tcp_accept_worker(). Despite rds_tcp_kill_sock() sets rtn->rds_tcp_listen_sock = NULL and waits for rds_tcp_accept_one() from rds_tcp_accept_worker() to complete using flush_workqueue(rds_wq), what Eric is reporting is different from what syzbot+694120e1002c117747ed was reporting. > > Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed > or cancelled, the fix would look like something below? I think it is OK to apply below diff in order to avoid addition on 0 problem, but it is not proven that kmem_cache_free() is not yet called. What should we do? > > net/rds/tcp.c | 11 ++++++++--- > net/rds/tcp.h | 2 +- > net/rds/tcp_connect.c | 5 ++++- > net/rds/tcp_listen.c | 5 ++++- > 4 files changed, 17 insertions(+), 6 deletions(-) > ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-04 15:15 ` Tetsuo Handa @ 2022-05-05 0:45 ` Tetsuo Handa 2022-05-05 0:53 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Tetsuo Handa @ 2022-05-05 0:45 UTC (permalink / raw) To: Eric Dumazet, Paolo Abeni Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for delayed works queued in rds_wq might be invoked after a net namespace's refcount already reached 0. Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), it is guaranteed that we can instead use maybe_get_net() from delayed work functions until rds_tcp_exit_net() returns. Note that I'm not convinced that all works which might access a net namespace are already queued in rds_wq by the moment rds_tcp_exit_net() calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() will fail to wait for work functions, and kmem_cache_free() could be called from net_free() before maybe_get_net() is called from rds_tcp_tune(). Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- net/rds/tcp.c | 11 ++++++++--- net/rds/tcp.h | 2 +- net/rds/tcp_connect.c | 5 ++++- net/rds/tcp_listen.c | 5 ++++- 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 2f638f8b7b1e..8e26bcf02044 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -487,11 +487,11 @@ struct rds_tcp_net { /* All module specific customizations to the RDS-TCP socket should be done in * rds_tcp_tune() and applied after socket creation. */ -void rds_tcp_tune(struct socket *sock) +bool rds_tcp_tune(struct socket *sock) { struct sock *sk = sock->sk; struct net *net = sock_net(sk); - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct rds_tcp_net *rtn; tcp_sock_set_nodelay(sock->sk); lock_sock(sk); @@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock) * a process which created this net namespace terminated. */ if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return false; + } sk->sk_net_refcnt = 1; - get_net_track(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } + rtn = net_generic(net, rds_tcp_netid); if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; @@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock) sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); + return true; } static void rds_tcp_accept_worker(struct work_struct *work) diff --git a/net/rds/tcp.h b/net/rds/tcp.h index dc8d745d6857..f8b5930d7b34 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -49,7 +49,7 @@ struct rds_tcp_statistics { }; /* tcp.c */ -void rds_tcp_tune(struct socket *sock); +bool rds_tcp_tune(struct socket *sock); void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index 5461d77fff4f..f0c477c5d1db 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp) if (ret < 0) goto out; - rds_tcp_tune(sock); + if (!rds_tcp_tune(sock)) { + ret = -EINVAL; + goto out; + } if (isv6) { sin6.sin6_family = AF_INET6; diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 09cadd556d1e..7edf2e69d3fe 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock) __module_get(new_sock->ops->owner); rds_tcp_keepalive(new_sock); - rds_tcp_tune(new_sock); + if (!rds_tcp_tune(new_sock)) { + ret = -EINVAL; + goto out; + } inet = inet_sk(new_sock->sk); -- 2.34.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa @ 2022-05-05 0:53 ` Eric Dumazet 2022-05-05 1:04 ` Jakub Kicinski 2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa 2 siblings, 0 replies; 30+ messages in thread From: Eric Dumazet @ 2022-05-05 0:53 UTC (permalink / raw) To: Tetsuo Handa Cc: Paolo Abeni, patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma On Wed, May 4, 2022 at 5:45 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for > delayed works queued in rds_wq might be invoked after a net namespace's > refcount already reached 0. > > Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), > it is guaranteed that we can instead use maybe_get_net() from delayed work > functions until rds_tcp_exit_net() returns. > > Note that I'm not convinced that all works which might access a net > namespace are already queued in rds_wq by the moment rds_tcp_exit_net() > calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() > will fail to wait for work functions, and kmem_cache_free() could be > called from net_free() before maybe_get_net() is called from > rds_tcp_tune(). > > Reported-by: Eric Dumazet <edumazet@google.com> > Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > net/rds/tcp.c | 11 ++++++++--- > net/rds/tcp.h | 2 +- > net/rds/tcp_connect.c | 5 ++++- > net/rds/tcp_listen.c | 5 ++++- > 4 files changed, 17 insertions(+), 6 deletions(-) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 2f638f8b7b1e..8e26bcf02044 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -487,11 +487,11 @@ struct rds_tcp_net { > /* All module specific customizations to the RDS-TCP socket should be done in > * rds_tcp_tune() and applied after socket creation. > */ > -void rds_tcp_tune(struct socket *sock) > +bool rds_tcp_tune(struct socket *sock) > { > struct sock *sk = sock->sk; > struct net *net = sock_net(sk); > - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); > + struct rds_tcp_net *rtn; > > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); > @@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock) > * a process which created this net namespace terminated. > */ > if (!sk->sk_net_refcnt) { > + if (!maybe_get_net(net)) { > + release_sock(sk); > + return false; > + } > sk->sk_net_refcnt = 1; > - get_net_track(net, &sk->ns_tracker, GFP_KERNEL); This could use: netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL); > sock_inuse_add(net, 1); > } > + rtn = net_generic(net, rds_tcp_netid); > if (rtn->sndbuf_size > 0) { > sk->sk_sndbuf = rtn->sndbuf_size; > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > @@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock) > sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > } > release_sock(sk); > + return true; > } > Otherwise, patch looks good to me, thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa 2022-05-05 0:53 ` Eric Dumazet @ 2022-05-05 1:04 ` Jakub Kicinski 2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa 2 siblings, 0 replies; 30+ messages in thread From: Jakub Kicinski @ 2022-05-05 1:04 UTC (permalink / raw) To: Tetsuo Handa Cc: Eric Dumazet, Paolo Abeni, patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, syzbot, netdev, syzkaller-bugs, linux-rdma On Thu, 5 May 2022 09:45:49 +0900 Tetsuo Handa wrote: > Subject: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets Please tag the next version as [PATCH net v2], and make sure it applies cleanly on top of net/master, 'cause reportedly this one didn't? https://patchwork.kernel.org/project/netdevbpf/patch/63dab11e-2aeb-5608-6dcb-6ebc3e98056e@I-love.SAKURA.ne.jp/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa 2022-05-05 0:53 ` Eric Dumazet 2022-05-05 1:04 ` Jakub Kicinski @ 2022-05-05 1:53 ` Tetsuo Handa 2022-05-05 19:13 ` Eric Dumazet 2022-05-06 1:20 ` patchwork-bot+netdevbpf 2 siblings, 2 replies; 30+ messages in thread From: Tetsuo Handa @ 2022-05-05 1:53 UTC (permalink / raw) To: Eric Dumazet, Paolo Abeni Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for delayed works queued in rds_wq might be invoked after a net namespace's refcount already reached 0. Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), it is guaranteed that we can instead use maybe_get_net() from delayed work functions until rds_tcp_exit_net() returns. Note that I'm not convinced that all works which might access a net namespace are already queued in rds_wq by the moment rds_tcp_exit_net() calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() will fail to wait for work functions, and kmem_cache_free() could be called from net_free() before maybe_get_net() is called from rds_tcp_tune(). Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- Changes in v2: Add netns_tracker_alloc(). net/rds/tcp.c | 12 +++++++++--- net/rds/tcp.h | 2 +- net/rds/tcp_connect.c | 5 ++++- net/rds/tcp_listen.c | 5 ++++- 4 files changed, 18 insertions(+), 6 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 2f638f8b7b1e..73ee2771093d 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -487,11 +487,11 @@ struct rds_tcp_net { /* All module specific customizations to the RDS-TCP socket should be done in * rds_tcp_tune() and applied after socket creation. */ -void rds_tcp_tune(struct socket *sock) +bool rds_tcp_tune(struct socket *sock) { struct sock *sk = sock->sk; struct net *net = sock_net(sk); - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct rds_tcp_net *rtn; tcp_sock_set_nodelay(sock->sk); lock_sock(sk); @@ -499,10 +499,15 @@ void rds_tcp_tune(struct socket *sock) * a process which created this net namespace terminated. */ if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return false; + } sk->sk_net_refcnt = 1; - get_net_track(net, &sk->ns_tracker, GFP_KERNEL); + netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } + rtn = net_generic(net, rds_tcp_netid); if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; @@ -512,6 +517,7 @@ void rds_tcp_tune(struct socket *sock) sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); + return true; } static void rds_tcp_accept_worker(struct work_struct *work) diff --git a/net/rds/tcp.h b/net/rds/tcp.h index dc8d745d6857..f8b5930d7b34 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -49,7 +49,7 @@ struct rds_tcp_statistics { }; /* tcp.c */ -void rds_tcp_tune(struct socket *sock); +bool rds_tcp_tune(struct socket *sock); void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index 5461d77fff4f..f0c477c5d1db 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp) if (ret < 0) goto out; - rds_tcp_tune(sock); + if (!rds_tcp_tune(sock)) { + ret = -EINVAL; + goto out; + } if (isv6) { sin6.sin6_family = AF_INET6; diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 09cadd556d1e..7edf2e69d3fe 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock) __module_get(new_sock->ops->owner); rds_tcp_keepalive(new_sock); - rds_tcp_tune(new_sock); + if (!rds_tcp_tune(new_sock)) { + ret = -EINVAL; + goto out; + } inet = inet_sk(new_sock->sk); -- 2.34.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa @ 2022-05-05 19:13 ` Eric Dumazet 2022-05-06 1:20 ` patchwork-bot+netdevbpf 1 sibling, 0 replies; 30+ messages in thread From: Eric Dumazet @ 2022-05-05 19:13 UTC (permalink / raw) To: Tetsuo Handa Cc: Paolo Abeni, patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma On Wed, May 4, 2022 at 6:54 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for > delayed works queued in rds_wq might be invoked after a net namespace's > refcount already reached 0. > > Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), > it is guaranteed that we can instead use maybe_get_net() from delayed work > functions until rds_tcp_exit_net() returns. > > Note that I'm not convinced that all works which might access a net > namespace are already queued in rds_wq by the moment rds_tcp_exit_net() > calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() > will fail to wait for work functions, and kmem_cache_free() could be > called from net_free() before maybe_get_net() is called from > rds_tcp_tune(). > > Reported-by: Eric Dumazet <edumazet@google.com> > Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > Reviewed-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets 2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa 2022-05-05 19:13 ` Eric Dumazet @ 2022-05-06 1:20 ` patchwork-bot+netdevbpf 1 sibling, 0 replies; 30+ messages in thread From: patchwork-bot+netdevbpf @ 2022-05-06 1:20 UTC (permalink / raw) To: Tetsuo Handa Cc: edumazet, pabeni, patchwork-bot+netdevbpf, santosh.shilimkar, davem, kuba, syzbot+694120e1002c117747ed, netdev, syzkaller-bugs, linux-rdma Hello: This patch was applied to netdev/net.git (master) by Jakub Kicinski <kuba@kernel.org>: On Thu, 5 May 2022 10:53:53 +0900 you wrote: > Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for > delayed works queued in rds_wq might be invoked after a net namespace's > refcount already reached 0. > > Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), > it is guaranteed that we can instead use maybe_get_net() from delayed work > functions until rds_tcp_exit_net() returns. > > [...] Here is the summary with links: - [net,v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets https://git.kernel.org/netdev/net/c/6997fbd7a3da You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-03 21:17 ` Eric Dumazet 2022-05-03 22:37 ` Eric Dumazet @ 2022-05-04 13:09 ` Paolo Abeni 2022-05-04 13:25 ` Eric Dumazet 1 sibling, 1 reply; 30+ messages in thread From: Paolo Abeni @ 2022-05-04 13:09 UTC (permalink / raw) To: Eric Dumazet, patchwork-bot+netdevbpf Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote: > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > Hello: > > > > This patch was applied to netdev/net.git (master) > > by Paolo Abeni <pabeni@redhat.com>: > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > [...] > > > > Here is the summary with links: > > - [v2] net: rds: acquire refcount on TCP sockets > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > You are awesome, thank you! > > -- > > Deet-doot-dot, I am a bot. > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > I think we merged this patch too soon. My fault. > My question is : What prevents rds_tcp_conn_path_connect(), and thus > rds_tcp_tune() to be called > after the netns refcount already reached 0 ? > > I guess we can wait for next syzbot report, but I think that get_net() > should be replaced > by maybe_get_net() > Should we revert this patch before the next pull request, if a suitable incremental fix is not available by then? It looks like the window of opportunity for the race is roughly the same? Thanks! Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets 2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni @ 2022-05-04 13:25 ` Eric Dumazet 0 siblings, 0 replies; 30+ messages in thread From: Eric Dumazet @ 2022-05-04 13:25 UTC (permalink / raw) To: Paolo Abeni Cc: patchwork-bot+netdevbpf, Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma On Wed, May 4, 2022 at 6:09 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote: > > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > > > Hello: > > > > > > This patch was applied to netdev/net.git (master) > > > by Paolo Abeni <pabeni@redhat.com>: > > > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > > refcount on net namespace. Since TCP's retransmission can happen after > > > > a process which created net namespace terminated, we need to explicitly > > > > acquire a refcount. > > > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [v2] net: rds: acquire refcount on TCP sockets > > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > > > You are awesome, thank you! > > > -- > > > Deet-doot-dot, I am a bot. > > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > > > > > I think we merged this patch too soon. > > My fault. > > > > My question is : What prevents rds_tcp_conn_path_connect(), and thus > > rds_tcp_tune() to be called > > after the netns refcount already reached 0 ? > > > > I guess we can wait for next syzbot report, but I think that get_net() > > should be replaced > > by maybe_get_net() > > > Should we revert this patch before the next pull request, if a suitable > incremental fix is not available by then? > > It looks like the window of opportunity for the race is roughly the > same? > No need to revert the patch, we certainly are in a better situation, as refcount_t helps here. We can refine the logic in a followup. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2022-05-06 1:20 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <00000000000045dc96059f4d7b02@google.com>
[not found] ` <000000000000f75af905d3ba0716@google.com>
[not found] ` <c389e47f-8f82-fd62-8c1d-d9481d2f71ff@I-love.SAKURA.ne.jp>
2022-04-22 14:40 ` [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) Tetsuo Handa
2022-04-24 3:57 ` Tetsuo Handa
2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa
2022-05-01 16:14 ` Eric Dumazet
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
2022-05-02 14:12 ` Haakon Bugge
2022-05-02 14:29 ` Tetsuo Handa
2022-05-03 9:02 ` Paolo Abeni
2022-05-03 9:56 ` Tetsuo Handa
2022-05-03 11:10 ` Paolo Abeni
2022-05-03 13:27 ` David Laight
2022-05-03 13:43 ` Eric Dumazet
2022-05-03 14:25 ` David Laight
2022-05-03 13:45 ` Eric Dumazet
2022-05-03 14:08 ` Tetsuo Handa
2022-05-03 11:40 ` patchwork-bot+netdevbpf
2022-05-03 21:17 ` Eric Dumazet
2022-05-03 22:37 ` Eric Dumazet
2022-05-04 1:04 ` Tetsuo Handa
2022-05-04 3:09 ` Eric Dumazet
2022-05-04 4:58 ` Tetsuo Handa
2022-05-04 15:15 ` Tetsuo Handa
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
2022-05-05 0:53 ` Eric Dumazet
2022-05-05 1:04 ` Jakub Kicinski
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
2022-05-05 19:13 ` Eric Dumazet
2022-05-06 1:20 ` patchwork-bot+netdevbpf
2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni
2022-05-04 13:25 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).