* Re: [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5)
[not found] ` <c389e47f-8f82-fd62-8c1d-d9481d2f71ff@I-love.SAKURA.ne.jp>
@ 2022-04-22 14:40 ` Tetsuo Handa
2022-04-24 3:57 ` Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-04-22 14:40 UTC (permalink / raw)
To: Santosh Shilimkar, OFED mailing list
Cc: syzbot, andrii, andriin, ast, daniel, davem, dsahern, edumazet,
john.fastabend, kafai, kpsingh, kuba, kuznet, netdev,
songliubraving, syzkaller-bugs, tpa, yhs, yoshfuji, bpf
Hello, RDS developers.
I was thinking that BPF program is relevant with the TCP/IPv6 socket triggering
use-after-free access. But disassembling syzkaller-generated BPF program concluded
that what "char program[2053]" is doing is not important
( https://lkml.kernel.org/r/d21e278f-a3ff-8603-f6ba-b51a8cddafa8@I-love.SAKURA.ne.jp ).
Then, I realized that TCP/IPv6 port 16385 (which the reproducer is accessing) is
used by kernel RDS server, which can explain
"It seems that a socket with sk->sk_net_refcnt=0 is created by unshare(CLONE_NEWNET)"
at https://lkml.kernel.org/r/fa445f0e-32b7-5e0d-9326-94bc5adba4c1@I-love.SAKURA.ne.jp
because the kernel RDS server starts during boot procedure.
------------------------------------------------------------
root@fuzz:~# unshare -n netstat -tanpe
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp6 0 0 :::16385 :::* LISTEN 0 19627 -
------------------------------------------------------------
With the debug printk() patch shown below,
------------------------------------------------------------
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 0ec2f5906a27..20b3c42b4140 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -429,7 +429,8 @@ static void net_free(struct net *net)
{
if (refcount_dec_and_test(&net->passive)) {
kfree(rcu_access_pointer(net->gen));
- kmem_cache_free(net_cachep, net);
+ memset(net, POISON_FREE, sizeof(struct net));
+ //kmem_cache_free(net_cachep, net);
}
}
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 09cadd556d1e..5792fe3df8ac 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -146,10 +146,9 @@ int rds_tcp_accept_one(struct socket *sock)
my_addr = &saddr;
peer_addr = &daddr;
#endif
- rdsdebug("accepted family %d tcp %pI6c:%u -> %pI6c:%u\n",
- sock->sk->sk_family,
- my_addr, ntohs(inet->inet_sport),
- peer_addr, ntohs(inet->inet_dport));
+ pr_info("accepted family %d tcp %pI6c:%u -> %pI6c:%u refcnt=%d sock_net=%px init_net=%px\n",
+ sock->sk->sk_family, my_addr, ntohs(inet->inet_sport), peer_addr,
+ ntohs(inet->inet_dport), sock->sk->sk_net_refcnt, sock_net(sock->sk), &init_net);
#if IS_ENABLED(CONFIG_IPV6)
/* sk_bound_dev_if is not set if the peer address is not link local
------------------------------------------------------------
I get
accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:33086 refcnt=0 sock_net=ffffffff860d89c0 init_net=ffffffff860d89c0
if I do
# echo > /dev/tcp/127.0.0.1/16385
from init_net namespace, and I get
accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:33088 refcnt=0 sock_net=ffff88810a208000 init_net=ffffffff860d89c0
if I do
# echo > /dev/tcp/127.0.0.1/16385
from non-init_net namespace. Note that sock->sk->sk_net_refcnt is 0 in both cases.
Like commit 2303f994b3e18709 ("mptcp: Associate MPTCP context with TCP socket") says
/* kernel sockets do not by default acquire net ref, but TCP timer
* needs it.
*/
, I came to feel that e.g. rds_tcp_accept_one() is accessing sock_net(sock->sk) on
accepted sockets with sock->sk->sk_net_refcnt=0 (because the listening socket was
created by kernel) is causing this problem. Why not rds kernel server does
sock->sk->sk_net_refcnt = 1;
get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
on accepted sockets like mptcp_subflow_create_socket() does?
For your testing, below is the latest reproducer.
You can try this reproducer with keep-memory-poisoned patch shown above.
------------------------------------------------------------
// https://syzkaller.appspot.com/bug?id=8f0e04b2beffcd42f044d46879cc224f6eb71a99
// autogenerated by syzkaller (https://github.com/google/syzkaller)
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#ifndef MSG_PROBE
#define MSG_PROBE 0x10
#endif
struct nlmsg {
char* pos;
int nesting;
struct nlattr* nested[8];
char buf[4096];
};
static void netlink_init(struct nlmsg* nlmsg, int typ, int flags,
const void* data, int size)
{
memset(nlmsg, 0, sizeof(*nlmsg));
struct nlmsghdr* hdr = (struct nlmsghdr*)nlmsg->buf;
hdr->nlmsg_type = typ;
hdr->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | flags;
memcpy(hdr + 1, data, size);
nlmsg->pos = (char*)(hdr + 1) + NLMSG_ALIGN(size);
}
static void netlink_attr(struct nlmsg* nlmsg, int typ, const void* data,
int size)
{
struct nlattr* attr = (struct nlattr*)nlmsg->pos;
attr->nla_len = sizeof(*attr) + size;
attr->nla_type = typ;
if (size > 0)
memcpy(attr + 1, data, size);
nlmsg->pos += NLMSG_ALIGN(attr->nla_len);
}
static int netlink_send_ext(struct nlmsg* nlmsg, int sock, uint16_t reply_type,
int* reply_len, bool dofail)
{
if (nlmsg->pos > nlmsg->buf + sizeof(nlmsg->buf) || nlmsg->nesting)
exit(1);
struct nlmsghdr* hdr = (struct nlmsghdr*)nlmsg->buf;
hdr->nlmsg_len = nlmsg->pos - nlmsg->buf;
struct sockaddr_nl addr;
memset(&addr, 0, sizeof(addr));
addr.nl_family = AF_NETLINK;
ssize_t n = sendto(sock, nlmsg->buf, hdr->nlmsg_len, 0,
(struct sockaddr*)&addr, sizeof(addr));
if (n != (ssize_t)hdr->nlmsg_len) {
if (dofail)
exit(1);
return -1;
}
n = recv(sock, nlmsg->buf, sizeof(nlmsg->buf), 0);
if (reply_len)
*reply_len = 0;
if (n < 0) {
if (dofail)
exit(1);
return -1;
}
if (n < (ssize_t)sizeof(struct nlmsghdr)) {
errno = EINVAL;
if (dofail)
exit(1);
return -1;
}
if (hdr->nlmsg_type == NLMSG_DONE)
return 0;
if (reply_len && hdr->nlmsg_type == reply_type) {
*reply_len = n;
return 0;
}
if (n < (ssize_t)(sizeof(struct nlmsghdr) + sizeof(struct nlmsgerr))) {
errno = EINVAL;
if (dofail)
exit(1);
return -1;
}
if (hdr->nlmsg_type != NLMSG_ERROR) {
errno = EINVAL;
if (dofail)
exit(1);
return -1;
}
errno = -((struct nlmsgerr*)(hdr + 1))->error;
return -errno;
}
static int netlink_send(struct nlmsg* nlmsg, int sock)
{
return netlink_send_ext(nlmsg, sock, 0, NULL, true);
}
static void netlink_device_change(int sock, const char* name, const void* mac, int macsize)
{
struct nlmsg nlmsg;
struct ifinfomsg hdr;
memset(&hdr, 0, sizeof(hdr));
hdr.ifi_flags = hdr.ifi_change = IFF_UP;
hdr.ifi_index = if_nametoindex(name);
netlink_init(&nlmsg, RTM_NEWLINK, 0, &hdr, sizeof(hdr));
netlink_attr(&nlmsg, IFLA_ADDRESS, mac, macsize);
netlink_send(&nlmsg, sock);
}
static void netlink_add_addr(int sock, const char* dev, const void* addr, int addrsize)
{
struct nlmsg nlmsg;
struct ifaddrmsg hdr;
memset(&hdr, 0, sizeof(hdr));
hdr.ifa_family = addrsize == 4 ? AF_INET : AF_INET6;
hdr.ifa_prefixlen = addrsize == 4 ? 24 : 120;
hdr.ifa_scope = RT_SCOPE_UNIVERSE;
hdr.ifa_index = if_nametoindex(dev);
netlink_init(&nlmsg, RTM_NEWADDR, NLM_F_CREATE | NLM_F_REPLACE, &hdr,
sizeof(hdr));
netlink_attr(&nlmsg, IFA_LOCAL, addr, addrsize);
netlink_attr(&nlmsg, IFA_ADDRESS, addr, addrsize);
netlink_send(&nlmsg, sock);
}
static void netlink_add_addr4(int sock, const char* dev, const char* addr)
{
struct in_addr in_addr;
inet_pton(AF_INET, addr, &in_addr);
netlink_add_addr(sock, dev, &in_addr, sizeof(in_addr));
}
static void netlink_add_addr6(int sock, const char* dev, const char* addr)
{
struct in6_addr in6_addr;
inet_pton(AF_INET6, addr, &in6_addr);
netlink_add_addr(sock, dev, &in6_addr, sizeof(in6_addr));
}
static void initialize_netdevices(void)
{
int fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
uint64_t macaddr = 0x00aaaaaaaaaa;
if (fd == EOF)
exit(1);
netlink_add_addr4(fd, "lo", "127.0.0.1");
netlink_add_addr6(fd, "lo", "::1");
netlink_device_change(fd, "lo", &macaddr, ETH_ALEN);
close(fd);
}
#ifndef __NR_bpf
#define __NR_bpf 321
#endif
static void execute_one(void)
{
const union bpf_attr attr = {
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.insn_cnt = 2,
.insns = (unsigned long long) "\xb7\x00\x00\x00\x00\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00",
.license = (unsigned long long) "GPL",
};
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(0x4001), /* where kernel RDS TCPv6 socket is listening */
.sin_addr.s_addr = inet_addr("127.0.0.1")
};
const struct msghdr msg = {
.msg_name = &addr,
.msg_namelen = sizeof(addr),
};
const int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, 72);
const int sock_fd = socket(PF_INET, SOCK_STREAM, 0);
alarm(3);
while (1) {
sendmsg(sock_fd, &msg, MSG_OOB | MSG_PROBE | MSG_CONFIRM | MSG_FASTOPEN);
setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_BPF, &bpf_fd, sizeof(bpf_fd));
}
}
int main(int argc, char *argv[])
{
if (unshare(CLONE_NEWNET))
return 1;
initialize_netdevices();
execute_one();
return 0;
}
------------------------------------------------------------
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5)
2022-04-22 14:40 ` [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) Tetsuo Handa
@ 2022-04-24 3:57 ` Tetsuo Handa
2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-04-24 3:57 UTC (permalink / raw)
To: Santosh Shilimkar, OFED mailing list
Cc: syzbot, andrii, andriin, ast, daniel, davem, dsahern, edumazet,
john.fastabend, kafai, kpsingh, kuba, kuznet, netdev,
songliubraving, syzkaller-bugs, tpa, yhs, yoshfuji, bpf
OK. I succeeded to reproduce this problem without BPF program.
Just dropping TCP packets is sufficient. That is, this bug should be fixed in RDS code.
------------------------------------------------------------
root@fuzz:~# unshare -n sh -c '
ip link set lo up
iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
telnet 127.0.0.1 16385
dmesg -c
netstat -tanpe' < /dev/null
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Connection closed by foreign host.
[ 54.922280] accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:58780 refcnt=0 sock_net=ffff888035c98000 init_net=ffffffff860d89c0
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 1 127.0.0.1:58780 127.0.0.1:16385 FIN_WAIT1 0 0 -
tcp6 0 0 :::16385 :::* LISTEN 0 18301 -
tcp6 1 1 127.0.0.1:16385 127.0.0.1:58780 LAST_ACK 0 0 -
------------------------------------------------------------
------------------------------------------------------------
fuzz login: [ 54.849128][ T2718] ip (2718) used greatest stack depth: 11192 bytes left
[ 54.922280][ T764] accepted family 10 tcp ::ffff:127.0.0.1:16385 -> ::ffff:127.0.0.1:58780 refcnt=0 sock_net=ffff888035c98000 init_net=ffffffff860d89c0
[ 224.330990][ C0] general protection fault, probably for non-canonical address 0x6b6af3ebe92b6bc3: 0000 [#1] PREEMPT SMP
[ 224.344491][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc3-00016-gb253435746d9-dirty #767
[ 224.355974][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 224.361184][ C0] RIP: 0010:__tcp_transmit_skb+0x5e5/0xbf0
[ 224.364559][ C0] Code: 0f 84 33 05 00 00 4c 89 2c 24 49 89 c5 48 c7 40 10 00 00 00 00 e9 c0 fa ff ff 49 8b 46 30 41 0f b7 55 30 48 8b 80 b8 02 00 00 <65> 48 01 50 58 e9 8e fe ff ff 41 8b 86 fc 08 00 00 48 69 c0 e8 03
[ 224.375318][ C0] RSP: 0018:ffffc90000003d38 EFLAGS: 00010297
[ 224.378682][ C0] RAX: 6b6b6b6b6b6b6b6b RBX: 000000009e2a2659 RCX: ffff888104a39000
[ 224.383253][ C0] RDX: 0000000000000001 RSI: ffff8881008054e0 RDI: ffff888035340000
[ 224.387171][ C0] RBP: ffff888100805508 R08: 0000000000000000 R09: 0000000000000000
[ 224.389612][ C0] R10: ffff888104a39140 R11: 0000000000000000 R12: 0000000000000001
[ 224.392646][ C0] R13: ffff8881008054e0 R14: ffff888035340000 R15: 0000000000000020
[ 224.395626][ C0] FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[ 224.398662][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.400880][ C0] CR2: 000056264812f99c CR3: 000000000a58e000 CR4: 00000000000506f0
[ 224.403964][ C0] Call Trace:
[ 224.405212][ C0] <IRQ>
[ 224.406355][ C0] ? tcp_write_timer_handler+0x280/0x280
[ 224.408259][ C0] tcp_write_wakeup+0x112/0x160
[ 224.409932][ C0] ? ktime_get+0x1cb/0x260
[ 224.411636][ C0] tcp_send_probe0+0x13/0x150
[ 224.413393][ C0] tcp_write_timer_handler+0x248/0x280
[ 224.415433][ C0] tcp_write_timer+0xa5/0x110
[ 224.417040][ C0] ? tcp_write_timer_handler+0x280/0x280
[ 224.419142][ C0] call_timer_fn+0xa6/0x300
[ 224.420949][ C0] __run_timers.part.0+0x209/0x320
[ 224.422915][ C0] run_timer_softirq+0x2c/0x60
[ 224.424791][ C0] __do_softirq+0x174/0x53f
[ 224.426462][ C0] __irq_exit_rcu+0xcb/0x120
[ 224.428188][ C0] irq_exit_rcu+0x5/0x20
[ 224.430176][ C0] sysvec_apic_timer_interrupt+0x8e/0xc0
[ 224.432301][ C0] </IRQ>
[ 224.433394][ C0] <TASK>
[ 224.434514][ C0] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 224.436500][ C0] RIP: 0010:default_idle+0xb/0x10
[ 224.438220][ C0] Code: 8b 04 25 40 af 01 00 f0 80 60 02 df c3 0f ae f0 0f ae 38 0f ae f0 eb b9 0f 1f 80 00 00 00 00 eb 07 0f 00 2d e3 b6 56 00 fb f4 <c3> cc cc cc cc 53 48 89 fb e8 67 fb fe ff 48 8b 15 a0 91 4e 02 89
[ 224.444865][ C0] RSP: 0018:ffffffff83e03ea8 EFLAGS: 00000202
[ 224.447077][ C0] RAX: 00000000000223b5 RBX: ffffffff83e61a00 RCX: 0000000000000001
[ 224.449957][ C0] RDX: 0000000000000000 RSI: ffffffff832e9bf1 RDI: ffffffff83246666
[ 224.452916][ C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[ 224.455677][ C0] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 224.458458][ C0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 224.461642][ C0] default_idle_call+0x54/0x90
[ 224.463888][ C0] do_idle+0x1f3/0x240
[ 224.465531][ C0] cpu_startup_entry+0x14/0x20
[ 224.467193][ C0] start_kernel+0x69c/0x6c1
[ 224.469040][ C0] secondary_startup_64_no_verify+0xc3/0xcb
[ 224.471179][ C0] </TASK>
[ 224.472438][ C0] Modules linked in:
[ 224.474387][ C0] ---[ end trace 0000000000000000 ]---
[ 224.476521][ C0] RIP: 0010:__tcp_transmit_skb+0x5e5/0xbf0
[ 224.478893][ C0] Code: 0f 84 33 05 00 00 4c 89 2c 24 49 89 c5 48 c7 40 10 00 00 00 00 e9 c0 fa ff ff 49 8b 46 30 41 0f b7 55 30 48 8b 80 b8 02 00 00 <65> 48 01 50 58 e9 8e fe ff ff 41 8b 86 fc 08 00 00 48 69 c0 e8 03
[ 224.485948][ C0] RSP: 0018:ffffc90000003d38 EFLAGS: 00010297
[ 224.488110][ C0] RAX: 6b6b6b6b6b6b6b6b RBX: 000000009e2a2659 RCX: ffff888104a39000
[ 224.491186][ C0] RDX: 0000000000000001 RSI: ffff8881008054e0 RDI: ffff888035340000
[ 224.494378][ C0] RBP: ffff888100805508 R08: 0000000000000000 R09: 0000000000000000
[ 224.497576][ C0] R10: ffff888104a39140 R11: 0000000000000000 R12: 0000000000000001
[ 224.500600][ C0] R13: ffff8881008054e0 R14: ffff888035340000 R15: 0000000000000020
[ 224.503814][ C0] FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[ 224.507136][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.509421][ C0] CR2: 000056264812f99c CR3: 000000000a58e000 CR4: 00000000000506f0
[ 224.512699][ C0] Kernel panic - not syncing: Fatal exception in interrupt
[ 224.515847][ C0] Kernel Offset: disabled
[ 224.517636][ C0] Rebooting in 10 seconds..
------------------------------------------------------------
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] net: rds: acquire refcount on TCP sockets
2022-04-24 3:57 ` Tetsuo Handa
@ 2022-05-01 15:29 ` Tetsuo Handa
2022-05-01 16:14 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-01 15:29 UTC (permalink / raw)
To: Santosh Shilimkar, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: syzbot, netdev, syzkaller-bugs, OFED mailing list
syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
for TCP socket used by RDS is accessing sock_net() without acquiring a
refcount on net namespace. Since TCP's retransmission can happen after
a process which created net namespace terminated, we need to explicitly
acquire a refcount.
Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
---
net/rds/tcp.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 5327d130c4b5..8015d2695784 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -493,6 +493,15 @@ void rds_tcp_tune(struct socket *sock)
struct net *net = sock_net(sk);
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+ /* TCP timer functions might access net namespace even after
+ * a process which created this net namespace terminated.
+ */
+ if (!sk->sk_net_refcnt) {
+ sk->sk_net_refcnt = 1;
+ get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
+ sock_inuse_add(net, 1);
+ }
+
tcp_sock_set_nodelay(sock->sk);
lock_sock(sk);
if (rtn->sndbuf_size > 0) {
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: acquire refcount on TCP sockets
2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa
@ 2022-05-01 16:14 ` Eric Dumazet
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2022-05-01 16:14 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni,
syzbot, netdev, syzkaller-bugs, OFED mailing list
On Sun, May 1, 2022 at 8:29 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
>
Please add a Fixes: tag
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> ---
> net/rds/tcp.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 5327d130c4b5..8015d2695784 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -493,6 +493,15 @@ void rds_tcp_tune(struct socket *sock)
> struct net *net = sock_net(sk);
> struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
>
> + /* TCP timer functions might access net namespace even after
> + * a process which created this net namespace terminated.
> + */
Please move this after the lock_sock(sk) [1], so that we are protected
correctly ?
> + if (!sk->sk_net_refcnt) {
> + sk->sk_net_refcnt = 1;
> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> + sock_inuse_add(net, 1);
> + }
> +
> tcp_sock_set_nodelay(sock->sk);
> lock_sock(sk);
[1] Here.
> if (rtn->sndbuf_size > 0) {
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-01 16:14 ` Eric Dumazet
@ 2022-05-02 1:40 ` Tetsuo Handa
2022-05-02 14:12 ` Haakon Bugge
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-02 1:40 UTC (permalink / raw)
To: Eric Dumazet
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, Paolo Abeni,
syzbot, netdev, syzkaller-bugs, OFED mailing list
syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
for TCP socket used by RDS is accessing sock_net() without acquiring a
refcount on net namespace. Since TCP's retransmission can happen after
a process which created net namespace terminated, we need to explicitly
acquire a refcount.
Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
---
Changes in v2:
Add Fixes: tag.
Move to inside lock_sock() section.
I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
to RDS") was added to 2.6.32.
net/rds/tcp.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 5327d130c4b5..2f638f8b7b1e 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
tcp_sock_set_nodelay(sock->sk);
lock_sock(sk);
+ /* TCP timer functions might access net namespace even after
+ * a process which created this net namespace terminated.
+ */
+ if (!sk->sk_net_refcnt) {
+ sk->sk_net_refcnt = 1;
+ get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
+ sock_inuse_add(net, 1);
+ }
if (rtn->sndbuf_size > 0) {
sk->sk_sndbuf = rtn->sndbuf_size;
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
@ 2022-05-02 14:12 ` Haakon Bugge
2022-05-02 14:29 ` Tetsuo Handa
2022-05-03 9:02 ` Paolo Abeni
2022-05-03 11:40 ` patchwork-bot+netdevbpf
2 siblings, 1 reply; 30+ messages in thread
From: Haakon Bugge @ 2022-05-02 14:12 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Eric Dumazet, Santosh Shilimkar, David S. Miller, Jakub Kicinski,
Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list
> On 2 May 2022, at 03:40, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
>
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> ---
> Changes in v2:
> Add Fixes: tag.
> Move to inside lock_sock() section.
>
> I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> to RDS") was added to 2.6.32.
>
> net/rds/tcp.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 5327d130c4b5..2f638f8b7b1e 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
>
> tcp_sock_set_nodelay(sock->sk);
> lock_sock(sk);
> + /* TCP timer functions might access net namespace even after
> + * a process which created this net namespace terminated.
> + */
> + if (!sk->sk_net_refcnt) {
> + sk->sk_net_refcnt = 1;
> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
Don't you need a corresponding put_net_track()?
Thxs, Håkon
> + sock_inuse_add(net, 1);
> + }
> if (rtn->sndbuf_size > 0) {
> sk->sk_sndbuf = rtn->sndbuf_size;
> sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-02 14:12 ` Haakon Bugge
@ 2022-05-02 14:29 ` Tetsuo Handa
0 siblings, 0 replies; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-02 14:29 UTC (permalink / raw)
To: Haakon Bugge
Cc: Eric Dumazet, Santosh Shilimkar, David S. Miller, Jakub Kicinski,
Paolo Abeni, syzbot, netdev, syzkaller-bugs, OFED mailing list
On 2022/05/02 23:12, Haakon Bugge wrote:
>> + /* TCP timer functions might access net namespace even after
>> + * a process which created this net namespace terminated.
>> + */
>> + if (!sk->sk_net_refcnt) {
>> + sk->sk_net_refcnt = 1;
>> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
>
> Don't you need a corresponding put_net_track()?
__sk_free() and __sk_destruct() will do if sk->sk_net_refcnt is set.
>
>> + sock_inuse_add(net, 1);
>> + }
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
2022-05-02 14:12 ` Haakon Bugge
@ 2022-05-03 9:02 ` Paolo Abeni
2022-05-03 9:56 ` Tetsuo Handa
` (2 more replies)
2022-05-03 11:40 ` patchwork-bot+netdevbpf
2 siblings, 3 replies; 30+ messages in thread
From: Paolo Abeni @ 2022-05-03 9:02 UTC (permalink / raw)
To: Tetsuo Handa, Eric Dumazet
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot,
netdev, syzkaller-bugs, OFED mailing list
Hello,
On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
>
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> ---
> Changes in v2:
> Add Fixes: tag.
> Move to inside lock_sock() section.
>
> I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> to RDS") was added to 2.6.32.
>
> net/rds/tcp.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 5327d130c4b5..2f638f8b7b1e 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
>
> tcp_sock_set_nodelay(sock->sk);
> lock_sock(sk);
> + /* TCP timer functions might access net namespace even after
> + * a process which created this net namespace terminated.
> + */
> + if (!sk->sk_net_refcnt) {
> + sk->sk_net_refcnt = 1;
> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> + sock_inuse_add(net, 1);
> + }
> if (rtn->sndbuf_size > 0) {
> sk->sk_sndbuf = rtn->sndbuf_size;
> sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
This looks equivalent to the fix presented here:
https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
but the latter looks a more generic solution. @Tetsuo could you please
test the above in your setup?
Thanks!
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 9:02 ` Paolo Abeni
@ 2022-05-03 9:56 ` Tetsuo Handa
2022-05-03 11:10 ` Paolo Abeni
2022-05-03 13:27 ` David Laight
2022-05-03 13:45 ` Eric Dumazet
2 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-03 9:56 UTC (permalink / raw)
To: Paolo Abeni, Eric Dumazet
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot,
netdev, syzkaller-bugs, OFED mailing list
On 2022/05/03 18:02, Paolo Abeni wrote:
> This looks equivalent to the fix presented here:
Not equivalent.
>
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
>
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?
I already tested that fix, and the result was
https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ .
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 9:56 ` Tetsuo Handa
@ 2022-05-03 11:10 ` Paolo Abeni
0 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2022-05-03 11:10 UTC (permalink / raw)
To: Tetsuo Handa, Eric Dumazet
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot,
netdev, syzkaller-bugs, OFED mailing list
On Tue, 2022-05-03 at 18:56 +0900, Tetsuo Handa wrote:
> On 2022/05/03 18:02, Paolo Abeni wrote:
>
> > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> >
> > but the latter looks a more generic solution. @Tetsuo could you please
> > test the above in your setup?
>
> I already tested that fix, and the result was
> https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ .
Thanks, I somewhat missed that reply.
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
2022-05-02 14:12 ` Haakon Bugge
2022-05-03 9:02 ` Paolo Abeni
@ 2022-05-03 11:40 ` patchwork-bot+netdevbpf
2022-05-03 21:17 ` Eric Dumazet
2 siblings, 1 reply; 30+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-05-03 11:40 UTC (permalink / raw)
To: Tetsuo Handa
Cc: edumazet, santosh.shilimkar, davem, kuba, pabeni,
syzbot+694120e1002c117747ed, netdev, syzkaller-bugs, linux-rdma
Hello:
This patch was applied to netdev/net.git (master)
by Paolo Abeni <pabeni@redhat.com>:
On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
>
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
>
> [...]
Here is the summary with links:
- [v2] net: rds: acquire refcount on TCP sockets
https://git.kernel.org/netdev/net/c/3a58f13a881e
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 9:02 ` Paolo Abeni
2022-05-03 9:56 ` Tetsuo Handa
@ 2022-05-03 13:27 ` David Laight
2022-05-03 13:43 ` Eric Dumazet
2022-05-03 13:45 ` Eric Dumazet
2 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2022-05-03 13:27 UTC (permalink / raw)
To: 'Paolo Abeni', Tetsuo Handa, Eric Dumazet
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot,
netdev, syzkaller-bugs, OFED mailing list
From: Paolo Abeni
> Sent: 03 May 2022 10:03
>
> Hello,
>
> On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > ---
> > Changes in v2:
> > Add Fixes: tag.
> > Move to inside lock_sock() section.
> >
> > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > to RDS") was added to 2.6.32.
> >
> > net/rds/tcp.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > index 5327d130c4b5..2f638f8b7b1e 100644
> > --- a/net/rds/tcp.c
> > +++ b/net/rds/tcp.c
> > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> >
> > tcp_sock_set_nodelay(sock->sk);
> > lock_sock(sk);
> > + /* TCP timer functions might access net namespace even after
> > + * a process which created this net namespace terminated.
> > + */
> > + if (!sk->sk_net_refcnt) {
> > + sk->sk_net_refcnt = 1;
> > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > + sock_inuse_add(net, 1);
> > + }
> > if (rtn->sndbuf_size > 0) {
> > sk->sk_sndbuf = rtn->sndbuf_size;
> > sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
>
> This looks equivalent to the fix presented here:
>
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
>
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?
Wouldn't a more generic solution be to add a flag to sock_create_kern()
so that it acquires a reference to the namespace?
This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.
I've a driver that uses __sock_create() in order to get that reference.
I'm pretty sure the extra 'security' check will never fail.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 13:27 ` David Laight
@ 2022-05-03 13:43 ` Eric Dumazet
2022-05-03 14:25 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2022-05-03 13:43 UTC (permalink / raw)
To: David Laight
Cc: Paolo Abeni, Tetsuo Handa, Santosh Shilimkar, David S. Miller,
Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list
On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Paolo Abeni
> > Sent: 03 May 2022 10:03
> >
> > Hello,
> >
> > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> > kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > ---
> > > Changes in v2:
> > > Add Fixes: tag.
> > > Move to inside lock_sock() section.
> > >
> > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > > to RDS") was added to 2.6.32.
> > >
> > > net/rds/tcp.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > > index 5327d130c4b5..2f638f8b7b1e 100644
> > > --- a/net/rds/tcp.c
> > > +++ b/net/rds/tcp.c
> > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> > >
> > > tcp_sock_set_nodelay(sock->sk);
> > > lock_sock(sk);
> > > + /* TCP timer functions might access net namespace even after
> > > + * a process which created this net namespace terminated.
> > > + */
> > > + if (!sk->sk_net_refcnt) {
> > > + sk->sk_net_refcnt = 1;
> > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > > + sock_inuse_add(net, 1);
> > > + }
> > > if (rtn->sndbuf_size > 0) {
> > > sk->sk_sndbuf = rtn->sndbuf_size;
> > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> >
> > This looks equivalent to the fix presented here:
> >
> > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> >
> > but the latter looks a more generic solution. @Tetsuo could you please
> > test the above in your setup?
>
> Wouldn't a more generic solution be to add a flag to sock_create_kern()
> so that it acquires a reference to the namespace?
> This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.
>
> I've a driver that uses __sock_create() in order to get that reference.
> I'm pretty sure the extra 'security' check will never fail.
>
This would be silly really.
Definition of a 'kernel socket' is that it does not hold a reference
to the namespace.
(otherwise a netns could not be destroyed by user space)
A kernel layer using kernel sockets needs to properly dismantle them
when a namespace is destroyed.
In the RDS case, the socket was a user socket, or RDS lacked proper
tracking of all the sockets
so that they can be dismantled properly.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 9:02 ` Paolo Abeni
2022-05-03 9:56 ` Tetsuo Handa
2022-05-03 13:27 ` David Laight
@ 2022-05-03 13:45 ` Eric Dumazet
2022-05-03 14:08 ` Tetsuo Handa
2 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2022-05-03 13:45 UTC (permalink / raw)
To: Paolo Abeni
Cc: Tetsuo Handa, Santosh Shilimkar, David S. Miller, Jakub Kicinski,
syzbot, netdev, syzkaller-bugs, OFED mailing list
On Tue, May 3, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > ---
> > Changes in v2:
> > Add Fixes: tag.
> > Move to inside lock_sock() section.
> >
> > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > to RDS") was added to 2.6.32.
> >
> > net/rds/tcp.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > index 5327d130c4b5..2f638f8b7b1e 100644
> > --- a/net/rds/tcp.c
> > +++ b/net/rds/tcp.c
> > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> >
> > tcp_sock_set_nodelay(sock->sk);
> > lock_sock(sk);
> > + /* TCP timer functions might access net namespace even after
> > + * a process which created this net namespace terminated.
> > + */
> > + if (!sk->sk_net_refcnt) {
> > + sk->sk_net_refcnt = 1;
> > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > + sock_inuse_add(net, 1);
> > + }
> > if (rtn->sndbuf_size > 0) {
> > sk->sk_sndbuf = rtn->sndbuf_size;
> > sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
>
> This looks equivalent to the fix presented here:
>
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
I think this is still needed for layers (NFS ?) that dismantle their
TCP sockets whenever a netns
is dismantled. But RDS case was different, only the listener is a kernel socket.
>
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?
>
> Thanks!
>
> Paolo
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 13:45 ` Eric Dumazet
@ 2022-05-03 14:08 ` Tetsuo Handa
0 siblings, 0 replies; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-03 14:08 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni
Cc: Santosh Shilimkar, David S. Miller, Jakub Kicinski, syzbot,
netdev, syzkaller-bugs, OFED mailing list
On 2022/05/03 22:45, Eric Dumazet wrote:
>> This looks equivalent to the fix presented here:
>>
>> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
I retested the fix above using
unshare -n sh -c '
ip link set lo up
iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
telnet 127.0.0.1 16385
dmesg -c
netstat -tanpe' < /dev/null
as a test case, but it seems racy; sometimes timer function is called again and crashes.
[ 426.086565][ C2] general protection fault, probably for non-canonical address 0x6b6af3ebcc3b6bc3: 0000 [#1] PREEMPT SMP KASAN
[ 426.096339][ C2] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.18.0-rc5-dirty #807
[ 426.103769][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 426.111851][ C2] RIP: 0010:__tcp_transmit_skb+0xe72/0x1b80
[ 426.117512][ C2] Code: e8 b3 ea dc fd 48 8d 7d 30 45 0f b7 77 30 e8 95 ec dc fd 48 8b 5d 30 48 8d bb b8 02 00 00 e8 85 ec dc fd 48 8b 83 b8 02 00 00 <65> 4c 01 70 58 e9 67 fd ff ff e8 ef 56 ac fd 48 8d bd d0 09 00 00
[ 426.124692][ C2] RSP: 0018:ffff888060d09ac8 EFLAGS: 00010246
[ 426.126845][ C2] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8880145c8000 RCX: ffffffff838cc28b
[ 426.129616][ C2] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880145c82b8
[ 426.132374][ C2] RBP: ffff8880129f8000 R08: 0000000000000000 R09: 0000000000000007
[ 426.135077][ C2] R10: ffffffff838cbfd4 R11: 0000000000000001 R12: ffff8880129f8760
[ 426.137793][ C2] R13: ffff88800f6e0118 R14: 0000000000000001 R15: ffff88800f6e00e8
[ 426.140489][ C2] FS: 0000000000000000(0000) GS:ffff888060d00000(0000) knlGS:0000000000000000
[ 426.143525][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 426.145792][ C2] CR2: 000055b5bb0adabc CR3: 000000000e003000 CR4: 00000000000506e0
[ 426.148509][ C2] Call Trace:
[ 426.149442][ C2] <IRQ>
[ 426.150183][ C2] ? __tcp_select_window+0x710/0x710
[ 426.151457][ C2] ? __sanitizer_cov_trace_cmp4+0x1c/0x70
[ 426.153007][ C2] ? tcp_current_mss+0x165/0x280
[ 426.154245][ C2] ? tcp_trim_head+0x300/0x300
[ 426.155396][ C2] ? find_held_lock+0x85/0xa0
[ 426.156734][ C2] ? mark_held_locks+0x65/0x90
[ 426.157967][ C2] tcp_write_wakeup+0x2e2/0x340
[ 426.159149][ C2] tcp_send_probe0+0x2a/0x2c0
[ 426.160368][ C2] tcp_write_timer_handler+0x5cb/0x670
[ 426.161740][ C2] tcp_write_timer+0x86/0x250
[ 426.162896][ C2] ? tcp_write_timer_handler+0x670/0x670
[ 426.164285][ C2] call_timer_fn+0x15d/0x5f0
[ 426.165481][ C2] ? add_timer_on+0x2e0/0x2e0
[ 426.166667][ C2] ? lock_downgrade+0x3c0/0x3c0
[ 426.167921][ C2] ? mark_held_locks+0x24/0x90
[ 426.169263][ C2] ? _raw_spin_unlock_irq+0x1f/0x40
[ 426.170564][ C2] ? tcp_write_timer_handler+0x670/0x670
[ 426.171920][ C2] __run_timers.part.0+0x523/0x740
[ 426.173181][ C2] ? call_timer_fn+0x5f0/0x5f0
[ 426.174321][ C2] ? pvclock_clocksource_read+0xdc/0x1a0
[ 426.175655][ C2] run_timer_softirq+0x66/0xe0
[ 426.176825][ C2] __do_softirq+0x1c2/0x670
[ 426.177944][ C2] __irq_exit_rcu+0xf8/0x140
[ 426.179120][ C2] irq_exit_rcu+0x5/0x20
[ 426.180150][ C2] sysvec_apic_timer_interrupt+0x8e/0xc0
[ 426.181486][ C2] </IRQ>
[ 426.182180][ C2] <TASK>
[ 426.182845][ C2] asm_sysvec_apic_timer_interrupt+0x12/0x20
>
> I think this is still needed for layers (NFS ?) that dismantle their
> TCP sockets whenever a netns
> is dismantled. But RDS case was different, only the listener is a kernel socket.
We can't apply the fix above.
I think that the fundamental problem is that we use net->ns.count for both
"avoiding use-after-free" purpose and "allowing dismantle from user event" purpose.
Why not to use separated counters?
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 13:43 ` Eric Dumazet
@ 2022-05-03 14:25 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2022-05-03 14:25 UTC (permalink / raw)
To: 'Eric Dumazet'
Cc: Paolo Abeni, Tetsuo Handa, Santosh Shilimkar, David S. Miller,
Jakub Kicinski, syzbot, netdev, syzkaller-bugs, OFED mailing list
From: Eric Dumazet
> Sent: 03 May 2022 14:43
>
> On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Paolo Abeni
> > > Sent: 03 May 2022 10:03
> > >
> > > Hello,
> > >
> > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > > refcount on net namespace. Since TCP's retransmission can happen after
> > > > a process which created net namespace terminated, we need to explicitly
> > > > acquire a refcount.
> > > >
> > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel
> sockets.")
> > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> > > kernel socket")
> > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > ---
> > > > Changes in v2:
> > > > Add Fixes: tag.
> > > > Move to inside lock_sock() section.
> > > >
> > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > > > to RDS") was added to 2.6.32.
> > > >
> > > > net/rds/tcp.c | 8 ++++++++
> > > > 1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > > > index 5327d130c4b5..2f638f8b7b1e 100644
> > > > --- a/net/rds/tcp.c
> > > > +++ b/net/rds/tcp.c
> > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> > > >
> > > > tcp_sock_set_nodelay(sock->sk);
> > > > lock_sock(sk);
> > > > + /* TCP timer functions might access net namespace even after
> > > > + * a process which created this net namespace terminated.
> > > > + */
> > > > + if (!sk->sk_net_refcnt) {
> > > > + sk->sk_net_refcnt = 1;
> > > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > > > + sock_inuse_add(net, 1);
> > > > + }
> > > > if (rtn->sndbuf_size > 0) {
> > > > sk->sk_sndbuf = rtn->sndbuf_size;
> > > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> > >
> > > This looks equivalent to the fix presented here:
> > >
> > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> > >
> > > but the latter looks a more generic solution. @Tetsuo could you please
> > > test the above in your setup?
> >
> > Wouldn't a more generic solution be to add a flag to sock_create_kern()
> > so that it acquires a reference to the namespace?
> > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.
> >
> > I've a driver that uses __sock_create() in order to get that reference.
> > I'm pretty sure the extra 'security' check will never fail.
> >
>
> This would be silly really.
>
> Definition of a 'kernel socket' is that it does not hold a reference
> to the namespace.
> (otherwise a netns could not be destroyed by user space)
>
> A kernel layer using kernel sockets needs to properly dismantle them
> when a namespace is destroyed.
I think it depends on why the driver is using a socket.
If the driver is a 'user' of a TCP connection that happens to
be is a kernel driver then holding the a reference to the namespace
is no different to an application socket holding a reference.
An example might be nfs/tcp - you need to unmount the filesystem
before you can delete the namespace.
OTOH if part of a protocol stack is using a socket for internal
calls (I think I've seen routing sockets used that way) then the
presence of the socket probably shouldn't stop the namespace
being deleted.
Listening sockets are a slight problem - probably for userspace as well.
It would be nicer to be able to get TCP (etc) to error out listening
sockets if they are the only thing stopping a namespace being deleted.
> In the RDS case, the socket was a user socket, or RDS lacked proper
> tracking of all the sockets
> so that they can be dismantled properly.
I think they probably are sockets created in order act on requests
from applications.
I think they should have the same effect on namespaces as a direct
user socket - you can't delete the socket while the connection is
active.
Kill all the relevant processes, tell the driver to stop, and you
can delete the namespace.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 11:40 ` patchwork-bot+netdevbpf
@ 2022-05-03 21:17 ` Eric Dumazet
2022-05-03 22:37 ` Eric Dumazet
2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni
0 siblings, 2 replies; 30+ messages in thread
From: Eric Dumazet @ 2022-05-03 21:17 UTC (permalink / raw)
To: patchwork-bot+netdevbpf
Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski,
Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma
On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
>
> Hello:
>
> This patch was applied to netdev/net.git (master)
> by Paolo Abeni <pabeni@redhat.com>:
>
> On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> >
> > [...]
>
> Here is the summary with links:
> - [v2] net: rds: acquire refcount on TCP sockets
> https://git.kernel.org/netdev/net/c/3a58f13a881e
>
> You are awesome, thank you!
> --
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/patchwork/pwbot.html
>
>
I think we merged this patch too soon.
My question is : What prevents rds_tcp_conn_path_connect(), and thus
rds_tcp_tune() to be called
after the netns refcount already reached 0 ?
I guess we can wait for next syzbot report, but I think that get_net()
should be replaced
by maybe_get_net()
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 21:17 ` Eric Dumazet
@ 2022-05-03 22:37 ` Eric Dumazet
2022-05-04 1:04 ` Tetsuo Handa
2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni
1 sibling, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2022-05-03 22:37 UTC (permalink / raw)
To: patchwork-bot+netdevbpf
Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski,
Paolo Abeni, syzbot, netdev, syzkaller-bugs, linux-rdma
On Tue, May 3, 2022 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> >
> > Hello:
> >
> > This patch was applied to netdev/net.git (master)
> > by Paolo Abeni <pabeni@redhat.com>:
> >
> > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > >
> > > [...]
> >
> > Here is the summary with links:
> > - [v2] net: rds: acquire refcount on TCP sockets
> > https://git.kernel.org/netdev/net/c/3a58f13a881e
> >
> > You are awesome, thank you!
> > --
> > Deet-doot-dot, I am a bot.
> > https://korg.docs.kernel.org/patchwork/pwbot.html
> >
> >
>
> I think we merged this patch too soon.
>
> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> rds_tcp_tune() to be called
> after the netns refcount already reached 0 ?
>
> I guess we can wait for next syzbot report, but I think that get_net()
> should be replaced
> by maybe_get_net()
Yes, syzbot was fast to trigger this exact issue:
HEAD commit: 3a58f13a net: rds: acquire refcount on TCP sockets
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 6934 at lib/refcount.c:25
refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25
Modules linked in:
CPU: 1 PID: 6934 Comm: kworker/u4:17 Not tainted
5.18.0-rc4-syzkaller-00209-g3a58f13a881e #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: krdsd rds_connect_worker
RIP: 0010:refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25
Code: 09 31 ff 89 de e8 f7 b9 81 fd 84 db 0f 85 36 ff ff ff e8 0a b6
81 fd 48 c7 c7 40 eb 26 8a c6 05 75 1f ac 09 01 e8 56 75 2d 05 <0f> 0b
e9 17 ff ff ff e8 eb b5 81 fd 0f b6 1d 5a 1f ac 09 31 ff 89
RSP: 0018:ffffc9000b5e7b80 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88807a948000 RSI: ffffffff81600c08 RDI: fffff520016bcf62
RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815fb5de R11: 0000000000000000 R12: ffff888021e69b80
R13: ffff88805bc82a00 R14: ffff888021e69ccc R15: ffff8880741a2900
FS: 0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2cb5c000 CR3: 000000005688f000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__refcount_add include/linux/refcount.h:199 [inline]
__refcount_inc include/linux/refcount.h:250 [inline]
refcount_inc include/linux/refcount.h:267 [inline]
get_net include/net/net_namespace.h:248 [inline]
get_net_track include/net/net_namespace.h:334 [inline]
rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503
rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127
rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176
process_one_work+0x996/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e9/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298
</TASK>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 22:37 ` Eric Dumazet
@ 2022-05-04 1:04 ` Tetsuo Handa
2022-05-04 3:09 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-04 1:04 UTC (permalink / raw)
To: Eric Dumazet, patchwork-bot+netdevbpf
Cc: Santosh Shilimkar, David Miller, Jakub Kicinski, Paolo Abeni,
syzbot, netdev, syzkaller-bugs, linux-rdma
On 2022/05/04 7:37, Eric Dumazet wrote:
>> I think we merged this patch too soon.
>>
>> My question is : What prevents rds_tcp_conn_path_connect(), and thus
>> rds_tcp_tune() to be called
>> after the netns refcount already reached 0 ?
>>
>> I guess we can wait for next syzbot report, but I think that get_net()
>> should be replaced
>> by maybe_get_net()
>
> Yes, syzbot was fast to trigger this exact issue:
Does maybe_get_net() help?
Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().
rds_tcp_conn_path_connect() {
sock_create_kern(net = rds_conn_net(conn)) {
__sock_create(net = rds_conn_net(conn), kern = 1) {
err = pf->create(net = rds_conn_net(conn), kern = 1) {
// pf->create is either inet_create or inet6_create
sk_alloc(net = rds_conn_net(conn), kern = 1) {
sk->sk_net_refcnt = kern ? 0 : 1;
if (likely(sk->sk_net_refcnt)) {
get_net_track(net, &sk->ns_tracker, priority);
sock_inuse_add(net, 1);
}
sock_net_set(sk, net);
}
}
}
}
rds_tcp_tune() {
if (!sk->sk_net_refcnt) {
sk->sk_net_refcnt = 1;
get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
}
}
}
"struct rds_connection" needs to hold a ref in order to safely allow
rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless
to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection"
must have a ref. Situation where we are protected by maybe_get_net() is
quite limited if long-lived object is not holding a ref.
Hmm, can we simply use &init_net instead of rds_conn_net(conn) ?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-04 1:04 ` Tetsuo Handa
@ 2022-05-04 3:09 ` Eric Dumazet
2022-05-04 4:58 ` Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2022-05-04 3:09 UTC (permalink / raw)
To: Tetsuo Handa
Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller,
Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs,
linux-rdma
On Tue, May 3, 2022 at 6:04 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2022/05/04 7:37, Eric Dumazet wrote:
> >> I think we merged this patch too soon.
> >>
> >> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> >> rds_tcp_tune() to be called
> >> after the netns refcount already reached 0 ?
> >>
> >> I guess we can wait for next syzbot report, but I think that get_net()
> >> should be replaced
> >> by maybe_get_net()
> >
> > Yes, syzbot was fast to trigger this exact issue:
>
> Does maybe_get_net() help?
>
> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().
Nope. RDS has an exit() handler called from cleanup_net()
(struct pernet_operations)->exit() or exit_batch() :
rds_tcp_exit_net() (rds_tcp_kill_sock())
This exit() handler _has_ to remove all known listeners, and
definitely cancel work queues (synchronous operation)
before the actual "struct net" free can happen later.
>
> rds_tcp_conn_path_connect() {
> sock_create_kern(net = rds_conn_net(conn)) {
> __sock_create(net = rds_conn_net(conn), kern = 1) {
> err = pf->create(net = rds_conn_net(conn), kern = 1) {
> // pf->create is either inet_create or inet6_create
> sk_alloc(net = rds_conn_net(conn), kern = 1) {
> sk->sk_net_refcnt = kern ? 0 : 1;
> if (likely(sk->sk_net_refcnt)) {
> get_net_track(net, &sk->ns_tracker, priority);
> sock_inuse_add(net, 1);
> }
> sock_net_set(sk, net);
> }
> }
> }
> }
> rds_tcp_tune() {
> if (!sk->sk_net_refcnt) {
> sk->sk_net_refcnt = 1;
> get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> sock_inuse_add(net, 1);
> }
> }
> }
>
> "struct rds_connection" needs to hold a ref in order to safely allow
> rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless
> to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection"
> must have a ref. Situation where we are protected by maybe_get_net() is
> quite limited if long-lived object is not holding a ref.
>
> Hmm, can we simply use &init_net instead of rds_conn_net(conn) ?
Only if you plan making RDS unavailable for non init netns.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-04 3:09 ` Eric Dumazet
@ 2022-05-04 4:58 ` Tetsuo Handa
2022-05-04 15:15 ` Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-04 4:58 UTC (permalink / raw)
To: Eric Dumazet
Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller,
Jakub Kicinski, Paolo Abeni, syzbot, netdev, syzkaller-bugs,
linux-rdma
On 2022/05/04 12:09, Eric Dumazet wrote:
>> Does maybe_get_net() help?
>>
>> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
>> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
>> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().
>
> Nope. RDS has an exit() handler called from cleanup_net()
>
> (struct pernet_operations)->exit() or exit_batch() :
> rds_tcp_exit_net() (rds_tcp_kill_sock())
Hmm, when put_net() called __put_net(), this "struct net" is chained to cleanup_list.
When cleanup_net() is called via net_cleanup_work, rds_tcp_exit_net() is called from
ops_exit_list(). Therefore, we can call maybe_get_net() until rds_tcp_exit_net() returns.
That's good.
>
> This exit() handler _has_ to remove all known listeners, and
> definitely cancel work queues (synchronous operation)
> before the actual "struct net" free can happen later.
But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from
rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that
rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and
rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w).
But I can't see how rds_tcp_exit_net() synchronously cancels all works associated
with "struct rds_conn_path".
struct rds_conn_path {
struct delayed_work cp_send_w;
struct delayed_work cp_recv_w;
struct delayed_work cp_conn_w;
struct work_struct cp_down_w;
}
These works are queued to rds_wq, but flush_workqueue() waits for completion only
if already queued. What if timer for queue_delayed_work() has not expired, or was
about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient?
Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed
or cancelled, the fix would look like something below?
net/rds/tcp.c | 11 ++++++++---
net/rds/tcp.h | 2 +-
net/rds/tcp_connect.c | 5 ++++-
net/rds/tcp_listen.c | 5 ++++-
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2f638f8b7b1e..8e26bcf02044 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -487,11 +487,11 @@ struct rds_tcp_net {
/* All module specific customizations to the RDS-TCP socket should be done in
* rds_tcp_tune() and applied after socket creation.
*/
-void rds_tcp_tune(struct socket *sock)
+bool rds_tcp_tune(struct socket *sock)
{
struct sock *sk = sock->sk;
struct net *net = sock_net(sk);
- struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+ struct rds_tcp_net *rtn;
tcp_sock_set_nodelay(sock->sk);
lock_sock(sk);
@@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock)
* a process which created this net namespace terminated.
*/
if (!sk->sk_net_refcnt) {
+ if (!maybe_get_net(net)) {
+ release_sock(sk);
+ return false;
+ }
sk->sk_net_refcnt = 1;
- get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
}
+ rtn = net_generic(net, rds_tcp_netid);
if (rtn->sndbuf_size > 0) {
sk->sk_sndbuf = rtn->sndbuf_size;
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
@@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock)
sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
}
release_sock(sk);
+ return true;
}
static void rds_tcp_accept_worker(struct work_struct *work)
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index dc8d745d6857..f8b5930d7b34 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -49,7 +49,7 @@ struct rds_tcp_statistics {
};
/* tcp.c */
-void rds_tcp_tune(struct socket *sock);
+bool rds_tcp_tune(struct socket *sock);
void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_restore_callbacks(struct socket *sock,
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 5461d77fff4f..f0c477c5d1db 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
if (ret < 0)
goto out;
- rds_tcp_tune(sock);
+ if (!rds_tcp_tune(sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
if (isv6) {
sin6.sin6_family = AF_INET6;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 09cadd556d1e..7edf2e69d3fe 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock)
__module_get(new_sock->ops->owner);
rds_tcp_keepalive(new_sock);
- rds_tcp_tune(new_sock);
+ if (!rds_tcp_tune(new_sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
inet = inet_sk(new_sock->sk);
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-03 21:17 ` Eric Dumazet
2022-05-03 22:37 ` Eric Dumazet
@ 2022-05-04 13:09 ` Paolo Abeni
2022-05-04 13:25 ` Eric Dumazet
1 sibling, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2022-05-04 13:09 UTC (permalink / raw)
To: Eric Dumazet, patchwork-bot+netdevbpf
Cc: Tetsuo Handa, Santosh Shilimkar, David Miller, Jakub Kicinski,
syzbot, netdev, syzkaller-bugs, linux-rdma
On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote:
> On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> >
> > Hello:
> >
> > This patch was applied to netdev/net.git (master)
> > by Paolo Abeni <pabeni@redhat.com>:
> >
> > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > >
> > > [...]
> >
> > Here is the summary with links:
> > - [v2] net: rds: acquire refcount on TCP sockets
> > https://git.kernel.org/netdev/net/c/3a58f13a881e
> >
> > You are awesome, thank you!
> > --
> > Deet-doot-dot, I am a bot.
> > https://korg.docs.kernel.org/patchwork/pwbot.html
> >
> >
>
> I think we merged this patch too soon.
My fault.
> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> rds_tcp_tune() to be called
> after the netns refcount already reached 0 ?
>
> I guess we can wait for next syzbot report, but I think that get_net()
> should be replaced
> by maybe_get_net()
>
Should we revert this patch before the next pull request, if a suitable
incremental fix is not available by then?
It looks like the window of opportunity for the race is roughly the
same?
Thanks!
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni
@ 2022-05-04 13:25 ` Eric Dumazet
0 siblings, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2022-05-04 13:25 UTC (permalink / raw)
To: Paolo Abeni
Cc: patchwork-bot+netdevbpf, Tetsuo Handa, Santosh Shilimkar,
David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs,
linux-rdma
On Wed, May 4, 2022 at 6:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote:
> > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> > >
> > > Hello:
> > >
> > > This patch was applied to netdev/net.git (master)
> > > by Paolo Abeni <pabeni@redhat.com>:
> > >
> > > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > > refcount on net namespace. Since TCP's retransmission can happen after
> > > > a process which created net namespace terminated, we need to explicitly
> > > > acquire a refcount.
> > > >
> > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > >
> > > > [...]
> > >
> > > Here is the summary with links:
> > > - [v2] net: rds: acquire refcount on TCP sockets
> > > https://git.kernel.org/netdev/net/c/3a58f13a881e
> > >
> > > You are awesome, thank you!
> > > --
> > > Deet-doot-dot, I am a bot.
> > > https://korg.docs.kernel.org/patchwork/pwbot.html
> > >
> > >
> >
> > I think we merged this patch too soon.
>
> My fault.
>
>
> > My question is : What prevents rds_tcp_conn_path_connect(), and thus
> > rds_tcp_tune() to be called
> > after the netns refcount already reached 0 ?
> >
> > I guess we can wait for next syzbot report, but I think that get_net()
> > should be replaced
> > by maybe_get_net()
> >
> Should we revert this patch before the next pull request, if a suitable
> incremental fix is not available by then?
>
> It looks like the window of opportunity for the race is roughly the
> same?
>
No need to revert the patch, we certainly are in a better situation,
as refcount_t helps here.
We can refine the logic in a followup.
Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] net: rds: acquire refcount on TCP sockets
2022-05-04 4:58 ` Tetsuo Handa
@ 2022-05-04 15:15 ` Tetsuo Handa
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-04 15:15 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni
Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller,
Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma
On 2022/05/04 13:58, Tetsuo Handa wrote:
> On 2022/05/04 12:09, Eric Dumazet wrote:
>> This exit() handler _has_ to remove all known listeners, and
>> definitely cancel work queues (synchronous operation)
>> before the actual "struct net" free can happen later.
>
> But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from
> rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that
> rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and
> rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w).
>
> But I can't see how rds_tcp_exit_net() synchronously cancels all works associated
> with "struct rds_conn_path".
>
> struct rds_conn_path {
> struct delayed_work cp_send_w;
> struct delayed_work cp_recv_w;
> struct delayed_work cp_conn_w;
> struct work_struct cp_down_w;
> }
>
> These works are queued to rds_wq, but flush_workqueue() waits for completion only
> if already queued. What if timer for queue_delayed_work() has not expired, or was
> about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient?
rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503
rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127
rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176
process_one_work+0x996/0x1610 kernel/workqueue.c:2289
rds_tcp_conn_path_connect is referenced by
"struct rds_transport rds_tcp_transport"->conn_path_connect.
It is invoked by
ret = conn->c_trans->conn_path_connect(cp)
in rds_connect_worker().
rds_connect_worker is referenced by "struct rds_conn_path"->cp_conn_w
via INIT_DELAYED_WORK().
queue_delayed_work(rds_wq, &cp->cp_conn_w, *) is called by
rds_queue_reconnect() or rds_conn_path_connect_if_down().
If rds_conn_path_connect_if_down() were called from
rds_tcp_accept_one_path() from rds_tcp_accept_one(),
rds_tcp_tune() from rds_tcp_accept_one() was already called
before rds_tcp_tune() from rds_tcp_conn_path_connect() is called.
Since the addition on 0 was not reported at rds_tcp_tune() from
rds_tcp_accept_one(), what Eric is reporting cannot be from
rds_tcp_accept_one() from rds_tcp_accept_worker().
Despite rds_tcp_kill_sock() sets rtn->rds_tcp_listen_sock = NULL and
waits for rds_tcp_accept_one() from rds_tcp_accept_worker() to complete
using flush_workqueue(rds_wq), what Eric is reporting is different from
what syzbot+694120e1002c117747ed was reporting.
>
> Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed
> or cancelled, the fix would look like something below?
I think it is OK to apply below diff in order to avoid addition on 0 problem, but
it is not proven that kmem_cache_free() is not yet called. What should we do?
>
> net/rds/tcp.c | 11 ++++++++---
> net/rds/tcp.h | 2 +-
> net/rds/tcp_connect.c | 5 ++++-
> net/rds/tcp_listen.c | 5 ++++-
> 4 files changed, 17 insertions(+), 6 deletions(-)
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-04 15:15 ` Tetsuo Handa
@ 2022-05-05 0:45 ` Tetsuo Handa
2022-05-05 0:53 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-05 0:45 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni
Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller,
Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
delayed works queued in rds_wq might be invoked after a net namespace's
refcount already reached 0.
Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
it is guaranteed that we can instead use maybe_get_net() from delayed work
functions until rds_tcp_exit_net() returns.
Note that I'm not convinced that all works which might access a net
namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
will fail to wait for work functions, and kmem_cache_free() could be
called from net_free() before maybe_get_net() is called from
rds_tcp_tune().
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
net/rds/tcp.c | 11 ++++++++---
net/rds/tcp.h | 2 +-
net/rds/tcp_connect.c | 5 ++++-
net/rds/tcp_listen.c | 5 ++++-
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2f638f8b7b1e..8e26bcf02044 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -487,11 +487,11 @@ struct rds_tcp_net {
/* All module specific customizations to the RDS-TCP socket should be done in
* rds_tcp_tune() and applied after socket creation.
*/
-void rds_tcp_tune(struct socket *sock)
+bool rds_tcp_tune(struct socket *sock)
{
struct sock *sk = sock->sk;
struct net *net = sock_net(sk);
- struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+ struct rds_tcp_net *rtn;
tcp_sock_set_nodelay(sock->sk);
lock_sock(sk);
@@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock)
* a process which created this net namespace terminated.
*/
if (!sk->sk_net_refcnt) {
+ if (!maybe_get_net(net)) {
+ release_sock(sk);
+ return false;
+ }
sk->sk_net_refcnt = 1;
- get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
}
+ rtn = net_generic(net, rds_tcp_netid);
if (rtn->sndbuf_size > 0) {
sk->sk_sndbuf = rtn->sndbuf_size;
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
@@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock)
sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
}
release_sock(sk);
+ return true;
}
static void rds_tcp_accept_worker(struct work_struct *work)
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index dc8d745d6857..f8b5930d7b34 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -49,7 +49,7 @@ struct rds_tcp_statistics {
};
/* tcp.c */
-void rds_tcp_tune(struct socket *sock);
+bool rds_tcp_tune(struct socket *sock);
void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_restore_callbacks(struct socket *sock,
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 5461d77fff4f..f0c477c5d1db 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
if (ret < 0)
goto out;
- rds_tcp_tune(sock);
+ if (!rds_tcp_tune(sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
if (isv6) {
sin6.sin6_family = AF_INET6;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 09cadd556d1e..7edf2e69d3fe 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock)
__module_get(new_sock->ops->owner);
rds_tcp_keepalive(new_sock);
- rds_tcp_tune(new_sock);
+ if (!rds_tcp_tune(new_sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
inet = inet_sk(new_sock->sk);
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
@ 2022-05-05 0:53 ` Eric Dumazet
2022-05-05 1:04 ` Jakub Kicinski
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
2 siblings, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2022-05-05 0:53 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Paolo Abeni, patchwork-bot+netdevbpf, Santosh Shilimkar,
David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs,
linux-rdma
On Wed, May 4, 2022 at 5:45 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
> delayed works queued in rds_wq might be invoked after a net namespace's
> refcount already reached 0.
>
> Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
> it is guaranteed that we can instead use maybe_get_net() from delayed work
> functions until rds_tcp_exit_net() returns.
>
> Note that I'm not convinced that all works which might access a net
> namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
> calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
> will fail to wait for work functions, and kmem_cache_free() could be
> called from net_free() before maybe_get_net() is called from
> rds_tcp_tune().
>
> Reported-by: Eric Dumazet <edumazet@google.com>
> Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
> net/rds/tcp.c | 11 ++++++++---
> net/rds/tcp.h | 2 +-
> net/rds/tcp_connect.c | 5 ++++-
> net/rds/tcp_listen.c | 5 ++++-
> 4 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 2f638f8b7b1e..8e26bcf02044 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -487,11 +487,11 @@ struct rds_tcp_net {
> /* All module specific customizations to the RDS-TCP socket should be done in
> * rds_tcp_tune() and applied after socket creation.
> */
> -void rds_tcp_tune(struct socket *sock)
> +bool rds_tcp_tune(struct socket *sock)
> {
> struct sock *sk = sock->sk;
> struct net *net = sock_net(sk);
> - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> + struct rds_tcp_net *rtn;
>
> tcp_sock_set_nodelay(sock->sk);
> lock_sock(sk);
> @@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock)
> * a process which created this net namespace terminated.
> */
> if (!sk->sk_net_refcnt) {
> + if (!maybe_get_net(net)) {
> + release_sock(sk);
> + return false;
> + }
> sk->sk_net_refcnt = 1;
> - get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
This could use:
netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
> sock_inuse_add(net, 1);
> }
> + rtn = net_generic(net, rds_tcp_netid);
> if (rtn->sndbuf_size > 0) {
> sk->sk_sndbuf = rtn->sndbuf_size;
> sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> @@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock)
> sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> }
> release_sock(sk);
> + return true;
> }
>
Otherwise, patch looks good to me, thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
2022-05-05 0:53 ` Eric Dumazet
@ 2022-05-05 1:04 ` Jakub Kicinski
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
2 siblings, 0 replies; 30+ messages in thread
From: Jakub Kicinski @ 2022-05-05 1:04 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Eric Dumazet, Paolo Abeni, patchwork-bot+netdevbpf,
Santosh Shilimkar, David Miller, syzbot, netdev, syzkaller-bugs,
linux-rdma
On Thu, 5 May 2022 09:45:49 +0900 Tetsuo Handa wrote:
> Subject: [PATCH] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
Please tag the next version as [PATCH net v2], and make sure it applies
cleanly on top of net/master, 'cause reportedly this one didn't?
https://patchwork.kernel.org/project/netdevbpf/patch/63dab11e-2aeb-5608-6dcb-6ebc3e98056e@I-love.SAKURA.ne.jp/
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
2022-05-05 0:53 ` Eric Dumazet
2022-05-05 1:04 ` Jakub Kicinski
@ 2022-05-05 1:53 ` Tetsuo Handa
2022-05-05 19:13 ` Eric Dumazet
2022-05-06 1:20 ` patchwork-bot+netdevbpf
2 siblings, 2 replies; 30+ messages in thread
From: Tetsuo Handa @ 2022-05-05 1:53 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni
Cc: patchwork-bot+netdevbpf, Santosh Shilimkar, David Miller,
Jakub Kicinski, syzbot, netdev, syzkaller-bugs, linux-rdma
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
delayed works queued in rds_wq might be invoked after a net namespace's
refcount already reached 0.
Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
it is guaranteed that we can instead use maybe_get_net() from delayed work
functions until rds_tcp_exit_net() returns.
Note that I'm not convinced that all works which might access a net
namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
will fail to wait for work functions, and kmem_cache_free() could be
called from net_free() before maybe_get_net() is called from
rds_tcp_tune().
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
Changes in v2:
Add netns_tracker_alloc().
net/rds/tcp.c | 12 +++++++++---
net/rds/tcp.h | 2 +-
net/rds/tcp_connect.c | 5 ++++-
net/rds/tcp_listen.c | 5 ++++-
4 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2f638f8b7b1e..73ee2771093d 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -487,11 +487,11 @@ struct rds_tcp_net {
/* All module specific customizations to the RDS-TCP socket should be done in
* rds_tcp_tune() and applied after socket creation.
*/
-void rds_tcp_tune(struct socket *sock)
+bool rds_tcp_tune(struct socket *sock)
{
struct sock *sk = sock->sk;
struct net *net = sock_net(sk);
- struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+ struct rds_tcp_net *rtn;
tcp_sock_set_nodelay(sock->sk);
lock_sock(sk);
@@ -499,10 +499,15 @@ void rds_tcp_tune(struct socket *sock)
* a process which created this net namespace terminated.
*/
if (!sk->sk_net_refcnt) {
+ if (!maybe_get_net(net)) {
+ release_sock(sk);
+ return false;
+ }
sk->sk_net_refcnt = 1;
- get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
+ netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
}
+ rtn = net_generic(net, rds_tcp_netid);
if (rtn->sndbuf_size > 0) {
sk->sk_sndbuf = rtn->sndbuf_size;
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
@@ -512,6 +517,7 @@ void rds_tcp_tune(struct socket *sock)
sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
}
release_sock(sk);
+ return true;
}
static void rds_tcp_accept_worker(struct work_struct *work)
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index dc8d745d6857..f8b5930d7b34 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -49,7 +49,7 @@ struct rds_tcp_statistics {
};
/* tcp.c */
-void rds_tcp_tune(struct socket *sock);
+bool rds_tcp_tune(struct socket *sock);
void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
void rds_tcp_restore_callbacks(struct socket *sock,
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 5461d77fff4f..f0c477c5d1db 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
if (ret < 0)
goto out;
- rds_tcp_tune(sock);
+ if (!rds_tcp_tune(sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
if (isv6) {
sin6.sin6_family = AF_INET6;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 09cadd556d1e..7edf2e69d3fe 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock)
__module_get(new_sock->ops->owner);
rds_tcp_keepalive(new_sock);
- rds_tcp_tune(new_sock);
+ if (!rds_tcp_tune(new_sock)) {
+ ret = -EINVAL;
+ goto out;
+ }
inet = inet_sk(new_sock->sk);
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
@ 2022-05-05 19:13 ` Eric Dumazet
2022-05-06 1:20 ` patchwork-bot+netdevbpf
1 sibling, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2022-05-05 19:13 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Paolo Abeni, patchwork-bot+netdevbpf, Santosh Shilimkar,
David Miller, Jakub Kicinski, syzbot, netdev, syzkaller-bugs,
linux-rdma
On Wed, May 4, 2022 at 6:54 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
> delayed works queued in rds_wq might be invoked after a net namespace's
> refcount already reached 0.
>
> Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
> it is guaranteed that we can instead use maybe_get_net() from delayed work
> functions until rds_tcp_exit_net() returns.
>
> Note that I'm not convinced that all works which might access a net
> namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
> calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
> will fail to wait for work functions, and kmem_cache_free() could be
> called from net_free() before maybe_get_net() is called from
> rds_tcp_tune().
>
> Reported-by: Eric Dumazet <edumazet@google.com>
> Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
2022-05-05 19:13 ` Eric Dumazet
@ 2022-05-06 1:20 ` patchwork-bot+netdevbpf
1 sibling, 0 replies; 30+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-05-06 1:20 UTC (permalink / raw)
To: Tetsuo Handa
Cc: edumazet, pabeni, patchwork-bot+netdevbpf, santosh.shilimkar,
davem, kuba, syzbot+694120e1002c117747ed, netdev, syzkaller-bugs,
linux-rdma
Hello:
This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 5 May 2022 10:53:53 +0900 you wrote:
> Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
> delayed works queued in rds_wq might be invoked after a net namespace's
> refcount already reached 0.
>
> Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
> it is guaranteed that we can instead use maybe_get_net() from delayed work
> functions until rds_tcp_exit_net() returns.
>
> [...]
Here is the summary with links:
- [net,v2] net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
https://git.kernel.org/netdev/net/c/6997fbd7a3da
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2022-05-06 1:20 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <00000000000045dc96059f4d7b02@google.com>
[not found] ` <000000000000f75af905d3ba0716@google.com>
[not found] ` <c389e47f-8f82-fd62-8c1d-d9481d2f71ff@I-love.SAKURA.ne.jp>
2022-04-22 14:40 ` [syzbot] KASAN: use-after-free Read in tcp_retransmit_timer (5) Tetsuo Handa
2022-04-24 3:57 ` Tetsuo Handa
2022-05-01 15:29 ` [PATCH] net: rds: acquire refcount on TCP sockets Tetsuo Handa
2022-05-01 16:14 ` Eric Dumazet
2022-05-02 1:40 ` [PATCH v2] " Tetsuo Handa
2022-05-02 14:12 ` Haakon Bugge
2022-05-02 14:29 ` Tetsuo Handa
2022-05-03 9:02 ` Paolo Abeni
2022-05-03 9:56 ` Tetsuo Handa
2022-05-03 11:10 ` Paolo Abeni
2022-05-03 13:27 ` David Laight
2022-05-03 13:43 ` Eric Dumazet
2022-05-03 14:25 ` David Laight
2022-05-03 13:45 ` Eric Dumazet
2022-05-03 14:08 ` Tetsuo Handa
2022-05-03 11:40 ` patchwork-bot+netdevbpf
2022-05-03 21:17 ` Eric Dumazet
2022-05-03 22:37 ` Eric Dumazet
2022-05-04 1:04 ` Tetsuo Handa
2022-05-04 3:09 ` Eric Dumazet
2022-05-04 4:58 ` Tetsuo Handa
2022-05-04 15:15 ` Tetsuo Handa
2022-05-05 0:45 ` [PATCH] net: rds: use maybe_get_net() when acquiring " Tetsuo Handa
2022-05-05 0:53 ` Eric Dumazet
2022-05-05 1:04 ` Jakub Kicinski
2022-05-05 1:53 ` [PATCH net v2] " Tetsuo Handa
2022-05-05 19:13 ` Eric Dumazet
2022-05-06 1:20 ` patchwork-bot+netdevbpf
2022-05-04 13:09 ` [PATCH v2] net: rds: acquire " Paolo Abeni
2022-05-04 13:25 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).