* [PATCH net-next 0/8] net: more data-races fixes and lockless socket options
@ 2023-09-21 20:28 Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 1/8] net: implement lockless SO_PRIORITY Eric Dumazet
` (9 more replies)
0 siblings, 10 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
This is yet another round of data-races fixes,
and lockless socket options.
Eric Dumazet (8):
net: implement lockless SO_PRIORITY
net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC
net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()
net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL,
SO_BUSY_POLL_BUDGET
net: implement lockless SO_MAX_PACING_RATE
net: lockless implementation of SO_TXREHASH
net: annotate data-races around sk->sk_tx_queue_mapping
net: annotate data-races around sk->sk_dst_pending_confirm
drivers/net/ppp/pppoe.c | 2 +-
include/net/bluetooth/bluetooth.h | 2 +-
include/net/sock.h | 26 +++--
include/trace/events/mptcp.h | 2 +-
net/appletalk/aarp.c | 2 +-
net/ax25/af_ax25.c | 2 +-
net/bluetooth/l2cap_sock.c | 2 +-
net/can/j1939/socket.c | 2 +-
net/can/raw.c | 2 +-
net/core/sock.c | 163 ++++++++++++++----------------
net/dccp/ipv6.c | 2 +-
net/ipv4/inet_diag.c | 2 +-
net/ipv4/ip_output.c | 2 +-
net/ipv4/tcp_bbr.c | 13 +--
net/ipv4/tcp_input.c | 4 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv4/tcp_output.c | 11 +-
net/ipv6/inet6_connection_sock.c | 2 +-
net/ipv6/ip6_output.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 +-
net/mptcp/sockopt.c | 2 +-
net/netrom/af_netrom.c | 2 +-
net/rose/af_rose.c | 2 +-
net/sched/em_meta.c | 2 +-
net/sched/sch_fq.c | 2 +-
net/sctp/ipv6.c | 2 +-
net/smc/af_smc.c | 2 +-
net/x25/af_x25.c | 2 +-
net/xdp/xsk.c | 2 +-
30 files changed, 138 insertions(+), 131 deletions(-)
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next 1/8] net: implement lockless SO_PRIORITY
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 23:37 ` Wenjia Zhang
2023-09-21 20:28 ` [PATCH net-next 2/8] net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC Eric Dumazet
` (8 subsequent siblings)
9 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
This is a followup of 8bf43be799d4 ("net: annotate data-races
around sk->sk_priority").
sk->sk_priority can be read and written without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
drivers/net/ppp/pppoe.c | 2 +-
include/net/bluetooth/bluetooth.h | 2 +-
net/appletalk/aarp.c | 2 +-
net/ax25/af_ax25.c | 2 +-
net/bluetooth/l2cap_sock.c | 2 +-
net/can/j1939/socket.c | 2 +-
net/can/raw.c | 2 +-
net/core/sock.c | 23 ++++++++++++-----------
net/dccp/ipv6.c | 2 +-
net/ipv4/inet_diag.c | 2 +-
net/ipv4/ip_output.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv6/inet6_connection_sock.c | 2 +-
net/ipv6/ip6_output.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 ++--
net/mptcp/sockopt.c | 2 +-
net/netrom/af_netrom.c | 2 +-
net/rose/af_rose.c | 2 +-
net/sched/em_meta.c | 2 +-
net/sctp/ipv6.c | 2 +-
net/smc/af_smc.c | 2 +-
net/x25/af_x25.c | 2 +-
net/xdp/xsk.c | 2 +-
24 files changed, 36 insertions(+), 35 deletions(-)
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index ba8b6bd8233cad65dc96c94fb461a6eb31d85fa1..8e7238e97d0a71708ebcddda9b1e1a50ab28c17d 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -877,7 +877,7 @@ static int pppoe_sendmsg(struct socket *sock, struct msghdr *m,
skb->dev = dev;
- skb->priority = sk->sk_priority;
+ skb->priority = READ_ONCE(sk->sk_priority);
skb->protocol = cpu_to_be16(ETH_P_PPP_SES);
ph = skb_put(skb, total_len + sizeof(struct pppoe_hdr));
diff --git a/include/net/bluetooth/bluetooth.h b/include/net/bluetooth/bluetooth.h
index aa90adc3b2a4d7b8dab5759bd5392c164e238d37..7ffa8c192c3f2eecea9dac1073af4853f59fadc7 100644
--- a/include/net/bluetooth/bluetooth.h
+++ b/include/net/bluetooth/bluetooth.h
@@ -541,7 +541,7 @@ static inline struct sk_buff *bt_skb_sendmsg(struct sock *sk,
return ERR_PTR(-EFAULT);
}
- skb->priority = sk->sk_priority;
+ skb->priority = READ_ONCE(sk->sk_priority);
return skb;
}
diff --git a/net/appletalk/aarp.c b/net/appletalk/aarp.c
index c7236daa24152a10cec6c7c9a34f8a86367ebd21..9fa0b246902bef97e07349475fe71ca0cacaf85e 100644
--- a/net/appletalk/aarp.c
+++ b/net/appletalk/aarp.c
@@ -664,7 +664,7 @@ int aarp_send_ddp(struct net_device *dev, struct sk_buff *skb,
sendit:
if (skb->sk)
- skb->priority = skb->sk->sk_priority;
+ skb->priority = READ_ONCE(skb->sk->sk_priority);
if (dev_queue_xmit(skb))
goto drop;
sent:
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 5db805d5f74d73902071e04802d658e2abef95b6..558e158c98d01075b7614b754a256124c3700a84 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -939,7 +939,7 @@ struct sock *ax25_make_new(struct sock *osk, struct ax25_dev *ax25_dev)
sock_init_data(NULL, sk);
sk->sk_type = osk->sk_type;
- sk->sk_priority = osk->sk_priority;
+ sk->sk_priority = READ_ONCE(osk->sk_priority);
sk->sk_protocol = osk->sk_protocol;
sk->sk_rcvbuf = osk->sk_rcvbuf;
sk->sk_sndbuf = osk->sk_sndbuf;
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 3bdfc3f1e73d0f5e24ca30aaed038e45efab437e..e50d3d102078ec4c82ebc844eba913cc19a00c1e 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1615,7 +1615,7 @@ static struct sk_buff *l2cap_sock_alloc_skb_cb(struct l2cap_chan *chan,
return ERR_PTR(-ENOTCONN);
}
- skb->priority = sk->sk_priority;
+ skb->priority = READ_ONCE(sk->sk_priority);
bt_cb(skb)->l2cap.chan = chan;
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index b28c976f52a0a16e13e23aee7c6fe4a7a8c844af..14c43166323393541bc102f47a311c79199a2acd 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -884,7 +884,7 @@ static struct sk_buff *j1939_sk_alloc_skb(struct net_device *ndev,
skcb = j1939_skb_to_cb(skb);
memset(skcb, 0, sizeof(*skcb));
skcb->addr = jsk->addr;
- skcb->priority = j1939_prio(sk->sk_priority);
+ skcb->priority = j1939_prio(READ_ONCE(sk->sk_priority));
if (msg->msg_name) {
struct sockaddr_can *addr = msg->msg_name;
diff --git a/net/can/raw.c b/net/can/raw.c
index d50c3f3d892f9382a547b7e8dddcab327623ff88..73468d2ebd51effd2a91e660ce3937ddf9c7b39f 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -881,7 +881,7 @@ static int raw_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
skb->dev = dev;
- skb->priority = sk->sk_priority;
+ skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = READ_ONCE(sk->sk_mark);
skb->tstamp = sockc.transmit_time;
diff --git a/net/core/sock.c b/net/core/sock.c
index a5995750c5c542d33e8c8c36a701ee9a9e17783d..1fdc0a0d8ff2fb2342618677c3adef2b485c6776 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -806,9 +806,7 @@ EXPORT_SYMBOL(sock_no_linger);
void sock_set_priority(struct sock *sk, u32 priority)
{
- lock_sock(sk);
WRITE_ONCE(sk->sk_priority, priority);
- release_sock(sk);
}
EXPORT_SYMBOL(sock_set_priority);
@@ -1118,6 +1116,18 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
valbool = val ? 1 : 0;
+ /* handle options which do not require locking the socket. */
+ switch (optname) {
+ case SO_PRIORITY:
+ if ((val >= 0 && val <= 6) ||
+ sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
+ sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+ sock_set_priority(sk, val);
+ return 0;
+ }
+ return -EPERM;
+ }
+
sockopt_lock_sock(sk);
switch (optname) {
@@ -1213,15 +1223,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
sk->sk_no_check_tx = valbool;
break;
- case SO_PRIORITY:
- if ((val >= 0 && val <= 6) ||
- sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
- sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
- WRITE_ONCE(sk->sk_priority, val);
- else
- ret = -EPERM;
- break;
-
case SO_LINGER:
if (optlen < sizeof(ling)) {
ret = -EINVAL; /* 1003.1g */
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 80b956b392529dbbe0bf8a04f515118e2ad858ff..8d344b219f84ae391f640d9a2d09700883123dce 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -239,7 +239,7 @@ static int dccp_v6_send_response(const struct sock *sk, struct request_sock *req
if (!opt)
opt = rcu_dereference(np->opt);
err = ip6_xmit(sk, skb, &fl6, READ_ONCE(sk->sk_mark), opt,
- np->tclass, sk->sk_priority);
+ np->tclass, READ_ONCE(sk->sk_priority));
rcu_read_unlock();
err = net_xmit_eval(err);
}
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index e13a84433413ed88088435ff8e11efeb30fc3cca..9f0bd518901a7fd037037d05465d5d9be66f42a7 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -165,7 +165,7 @@ int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
* For cgroup2 classid is always zero.
*/
if (!classid)
- classid = sk->sk_priority;
+ classid = READ_ONCE(sk->sk_priority);
if (nla_put_u32(skb, INET_DIAG_CLASS_ID, classid))
goto errout;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4ab877cf6d35f229761986d5c6a17eb2a3ad4043..6b14097e80ad35e42b9a7d5da977f5f0a7ea2c78 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1449,7 +1449,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
ip_options_build(skb, opt, cork->addr, rt);
}
- skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
+ skb->priority = (cork->tos != -1) ? cork->priority: READ_ONCE(sk->sk_priority);
skb->mark = cork->mark;
skb->tstamp = cork->transmit_time;
/*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f13eb7e23d03f3681055257e6ebea0612ae3f9b3..95e972be0c05c17138a293ed891a896ba6ea411e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -828,7 +828,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
inet_twsk(sk)->tw_mark : sk->sk_mark;
ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ?
- inet_twsk(sk)->tw_priority : sk->sk_priority;
+ inet_twsk(sk)->tw_priority : READ_ONCE(sk->sk_priority);
transmit_time = tcp_transmit_time(sk);
xfrm_sk_clone_policy(ctl_sk, sk);
txhash = (sk->sk_state == TCP_TIME_WAIT) ?
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index eee8ab1bfa0e4fecde0cd1ff5d480d11c6741049..3f87611077ef21edb61f3d6c751c88c515bb4b5b 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -292,7 +292,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tw->tw_transparent = inet_test_bit(TRANSPARENT, sk);
tw->tw_mark = sk->sk_mark;
- tw->tw_priority = sk->sk_priority;
+ tw->tw_priority = READ_ONCE(sk->sk_priority);
tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
tcptw->tw_rcv_nxt = tp->rcv_nxt;
tcptw->tw_snd_nxt = tp->snd_nxt;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 0c50dcd35fe8c7179e8ea0d86c49f891a26fe59e..80043e46117c51b720ace671d8b2edafd022841c 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -133,7 +133,7 @@ int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl_unused
fl6.daddr = sk->sk_v6_daddr;
res = ip6_xmit(sk, skb, &fl6, sk->sk_mark, rcu_dereference(np->opt),
- np->tclass, sk->sk_priority);
+ np->tclass, READ_ONCE(sk->sk_priority));
rcu_read_unlock();
return res;
}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 951ba8089b5b44c589f1b497e645ffc15a86c7c8..cdaa9275e99053488c684bb19c4ed651101c2b1c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1984,7 +1984,7 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
hdr->saddr = fl6->saddr;
hdr->daddr = *final_dst;
- skb->priority = sk->sk_priority;
+ skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = cork->base.mark;
skb->tstamp = cork->base.transmit_time;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 94afb8d0f2d0e4974c3dbe4e3301f0152b5cb9e1..8a6e2e97f673d774f7917d7040bc9dde7c33cbd3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -565,7 +565,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
if (!opt)
opt = rcu_dereference(np->opt);
err = ip6_xmit(sk, skb, fl6, skb->mark ? : READ_ONCE(sk->sk_mark),
- opt, tclass, sk->sk_priority);
+ opt, tclass, READ_ONCE(sk->sk_priority));
rcu_read_unlock();
err = net_xmit_eval(err);
}
@@ -1058,7 +1058,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
trace_tcp_send_reset(sk, skb);
if (inet6_test_bit(REPFLOW, sk))
label = ip6_flowlabel(ipv6h);
- priority = sk->sk_priority;
+ priority = READ_ONCE(sk->sk_priority);
txhash = sk->sk_txhash;
}
if (sk->sk_state == TCP_TIME_WAIT) {
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 8260202c00669fd7d2eed2f94a3c2cf225a0d89c..f3485a6b35e706a3da52bb98ac17f1eeaa455b2e 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -89,7 +89,7 @@ static void mptcp_sol_socket_sync_intval(struct mptcp_sock *msk, int optname, in
sock_valbool_flag(ssk, SOCK_KEEPOPEN, !!val);
break;
case SO_PRIORITY:
- ssk->sk_priority = val;
+ WRITE_ONCE(ssk->sk_priority, val);
break;
case SO_SNDBUF:
case SO_SNDBUFFORCE:
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index 96e91ab71573cf391da1627af675f3e6004e94b5..0eed00184adf454d2e06bb44330c079a402a959e 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -487,7 +487,7 @@ static struct sock *nr_make_new(struct sock *osk)
sock_init_data(NULL, sk);
sk->sk_type = osk->sk_type;
- sk->sk_priority = osk->sk_priority;
+ sk->sk_priority = READ_ONCE(osk->sk_priority);
sk->sk_protocol = osk->sk_protocol;
sk->sk_rcvbuf = osk->sk_rcvbuf;
sk->sk_sndbuf = osk->sk_sndbuf;
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 49dafe9ac72f010c56a5546926ee1a360fa767b7..0cc5a4e19900e10b31172433f36f5835101908ed 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -583,7 +583,7 @@ static struct sock *rose_make_new(struct sock *osk)
#endif
sk->sk_type = osk->sk_type;
- sk->sk_priority = osk->sk_priority;
+ sk->sk_priority = READ_ONCE(osk->sk_priority);
sk->sk_protocol = osk->sk_protocol;
sk->sk_rcvbuf = osk->sk_rcvbuf;
sk->sk_sndbuf = osk->sk_sndbuf;
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index da34fd4c92695f453f1d6547c6e4e8d3afe7a116..09d8afd04a2a78ac55b0ddd1b424ddcb28b9ba83 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -546,7 +546,7 @@ META_COLLECTOR(int_sk_prio)
*err = -1;
return;
}
- dst->value = sk->sk_priority;
+ dst->value = READ_ONCE(sk->sk_priority);
}
META_COLLECTOR(int_sk_rcvlowat)
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 5c0ed5909d85a1fc137e8652e32df75d8bef28ac..24368f755ab19a07e6e6ed4be99043fd41b99421 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -247,7 +247,7 @@ static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *t)
rcu_read_lock();
res = ip6_xmit(sk, skb, fl6, sk->sk_mark,
rcu_dereference(np->opt),
- tclass, sk->sk_priority);
+ tclass, READ_ONCE(sk->sk_priority));
rcu_read_unlock();
return res;
}
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index bacdd971615e43b9bdabcd1395caccd5320e549f..29768160141467515903d994864b16cf0fb19a71 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -493,7 +493,7 @@ static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,
nsk->sk_sndtimeo = osk->sk_sndtimeo;
nsk->sk_rcvtimeo = osk->sk_rcvtimeo;
nsk->sk_mark = READ_ONCE(osk->sk_mark);
- nsk->sk_priority = osk->sk_priority;
+ nsk->sk_priority = READ_ONCE(osk->sk_priority);
nsk->sk_rcvlowat = osk->sk_rcvlowat;
nsk->sk_bound_dev_if = osk->sk_bound_dev_if;
nsk->sk_err = osk->sk_err;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 0fb5143bec7ac45374f6b2e1c6133072c8e8145c..aad8ffeaee0415ca907c116016d326e43a3018f2 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -598,7 +598,7 @@ static struct sock *x25_make_new(struct sock *osk)
x25 = x25_sk(sk);
sk->sk_type = osk->sk_type;
- sk->sk_priority = osk->sk_priority;
+ sk->sk_priority = READ_ONCE(osk->sk_priority);
sk->sk_protocol = osk->sk_protocol;
sk->sk_rcvbuf = osk->sk_rcvbuf;
sk->sk_sndbuf = osk->sk_sndbuf;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7482d0aca5046ed637fcbeca7f5e403ed60eec08..f5e96e0d6e01d4c0121201bc74f40e0785762b2c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -684,7 +684,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
skb->dev = dev;
- skb->priority = xs->sk.sk_priority;
+ skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
skb->destructor = xsk_destruct_skb;
xsk_set_destructor_arg(skb);
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 2/8] net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 1/8] net: implement lockless SO_PRIORITY Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 3/8] net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt() Eric Dumazet
` (7 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
sock->flags are atomic, no need to hold the socket lock
in sk_setsockopt() for SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/sock.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 1fdc0a0d8ff2fb2342618677c3adef2b485c6776..f01c757245683452fd6c30c51b885d09427ef697 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1126,6 +1126,15 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
return 0;
}
return -EPERM;
+ case SO_PASSSEC:
+ assign_bit(SOCK_PASSSEC, &sock->flags, valbool);
+ return 0;
+ case SO_PASSCRED:
+ assign_bit(SOCK_PASSCRED, &sock->flags, valbool);
+ return 0;
+ case SO_PASSPIDFD:
+ assign_bit(SOCK_PASSPIDFD, &sock->flags, valbool);
+ return 0;
}
sockopt_lock_sock(sk);
@@ -1248,14 +1257,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
case SO_BSDCOMPAT:
break;
- case SO_PASSCRED:
- assign_bit(SOCK_PASSCRED, &sock->flags, valbool);
- break;
-
- case SO_PASSPIDFD:
- assign_bit(SOCK_PASSPIDFD, &sock->flags, valbool);
- break;
-
case SO_TIMESTAMP_OLD:
case SO_TIMESTAMP_NEW:
case SO_TIMESTAMPNS_OLD:
@@ -1361,9 +1362,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
sock_valbool_flag(sk, SOCK_FILTER_LOCKED, valbool);
break;
- case SO_PASSSEC:
- assign_bit(SOCK_PASSSEC, &sock->flags, valbool);
- break;
case SO_MARK:
if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 3/8] net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 1/8] net: implement lockless SO_PRIORITY Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 2/8] net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 4/8] net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET Eric Dumazet
` (6 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
This options can not be set and return -ENOPROTOOPT,
no need to acqure socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/sock.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index f01c757245683452fd6c30c51b885d09427ef697..4d20b74a93cb57bba58447f37e87b677167b8425 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1135,6 +1135,11 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
case SO_PASSPIDFD:
assign_bit(SOCK_PASSPIDFD, &sock->flags, valbool);
return 0;
+ case SO_TYPE:
+ case SO_PROTOCOL:
+ case SO_DOMAIN:
+ case SO_ERROR:
+ return -ENOPROTOOPT;
}
sockopt_lock_sock(sk);
@@ -1152,12 +1157,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
case SO_REUSEPORT:
sk->sk_reuseport = valbool;
break;
- case SO_TYPE:
- case SO_PROTOCOL:
- case SO_DOMAIN:
- case SO_ERROR:
- ret = -ENOPROTOOPT;
- break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
sk_dst_reset(sk);
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 4/8] net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (2 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 3/8] net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt() Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 5/8] net: implement lockless SO_MAX_PACING_RATE Eric Dumazet
` (5 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
Setting sk->sk_ll_usec, sk_prefer_busy_poll and sk_busy_poll_budget
do not require the socket lock, readers are lockless anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/sock.c | 44 ++++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 4d20b74a93cb57bba58447f37e87b677167b8425..408081549bd777811058d5de3e9df0f459e6e999 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1140,6 +1140,26 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
case SO_DOMAIN:
case SO_ERROR:
return -ENOPROTOOPT;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ case SO_BUSY_POLL:
+ if (val < 0)
+ return -EINVAL;
+ WRITE_ONCE(sk->sk_ll_usec, val);
+ return 0;
+ case SO_PREFER_BUSY_POLL:
+ if (valbool && !sockopt_capable(CAP_NET_ADMIN))
+ return -EPERM;
+ WRITE_ONCE(sk->sk_prefer_busy_poll, valbool);
+ return 0;
+ case SO_BUSY_POLL_BUDGET:
+ if (val > READ_ONCE(sk->sk_busy_poll_budget) &&
+ !sockopt_capable(CAP_NET_ADMIN))
+ return -EPERM;
+ if (val < 0 || val > U16_MAX)
+ return -EINVAL;
+ WRITE_ONCE(sk->sk_busy_poll_budget, val);
+ return 0;
+#endif
}
sockopt_lock_sock(sk);
@@ -1402,30 +1422,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
sock_valbool_flag(sk, SOCK_SELECT_ERR_QUEUE, valbool);
break;
-#ifdef CONFIG_NET_RX_BUSY_POLL
- case SO_BUSY_POLL:
- if (val < 0)
- ret = -EINVAL;
- else
- WRITE_ONCE(sk->sk_ll_usec, val);
- break;
- case SO_PREFER_BUSY_POLL:
- if (valbool && !sockopt_capable(CAP_NET_ADMIN))
- ret = -EPERM;
- else
- WRITE_ONCE(sk->sk_prefer_busy_poll, valbool);
- break;
- case SO_BUSY_POLL_BUDGET:
- if (val > READ_ONCE(sk->sk_busy_poll_budget) && !sockopt_capable(CAP_NET_ADMIN)) {
- ret = -EPERM;
- } else {
- if (val < 0 || val > U16_MAX)
- ret = -EINVAL;
- else
- WRITE_ONCE(sk->sk_busy_poll_budget, val);
- }
- break;
-#endif
case SO_MAX_PACING_RATE:
{
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 5/8] net: implement lockless SO_MAX_PACING_RATE
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (3 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 4/8] net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 6/8] net: lockless implementation of SO_TXREHASH Eric Dumazet
` (4 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
SO_MAX_PACING_RATE setsockopt() does not need to hold
the socket lock, because sk->sk_pacing_rate readers
can run fine if the value is changed by other threads,
after adding READ_ONCE() accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/trace/events/mptcp.h | 2 +-
net/core/sock.c | 40 +++++++++++++++++++-----------------
net/ipv4/tcp_bbr.c | 13 ++++++------
net/ipv4/tcp_input.c | 4 ++--
net/ipv4/tcp_output.c | 9 ++++----
net/sched/sch_fq.c | 2 +-
6 files changed, 37 insertions(+), 33 deletions(-)
diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 563e48617374d3f68dd86b78c13fe6bc28bf6947..09e72215b9f9bb53ec363d7690e9b87a09d172cb 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -44,7 +44,7 @@ TRACE_EVENT(mptcp_subflow_get_send,
ssk = mptcp_subflow_tcp_sock(subflow);
if (ssk && sk_fullsock(ssk)) {
__entry->snd_wnd = tcp_sk(ssk)->snd_wnd;
- __entry->pace = ssk->sk_pacing_rate;
+ __entry->pace = READ_ONCE(ssk->sk_pacing_rate);
} else {
__entry->snd_wnd = 0;
__entry->pace = 0;
diff --git a/net/core/sock.c b/net/core/sock.c
index 408081549bd777811058d5de3e9df0f459e6e999..4254ed0e4817d60cb2bf9d8e62ffcd98a90f7ec6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1160,6 +1160,27 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
WRITE_ONCE(sk->sk_busy_poll_budget, val);
return 0;
#endif
+ case SO_MAX_PACING_RATE:
+ {
+ unsigned long ulval = (val == ~0U) ? ~0UL : (unsigned int)val;
+ unsigned long pacing_rate;
+
+ if (sizeof(ulval) != sizeof(val) &&
+ optlen >= sizeof(ulval) &&
+ copy_from_sockptr(&ulval, optval, sizeof(ulval))) {
+ return -EFAULT;
+ }
+ if (ulval != ~0UL)
+ cmpxchg(&sk->sk_pacing_status,
+ SK_PACING_NONE,
+ SK_PACING_NEEDED);
+ /* Pairs with READ_ONCE() from sk_getsockopt() */
+ WRITE_ONCE(sk->sk_max_pacing_rate, ulval);
+ pacing_rate = READ_ONCE(sk->sk_pacing_rate);
+ if (ulval < pacing_rate)
+ WRITE_ONCE(sk->sk_pacing_rate, ulval);
+ return 0;
+ }
}
sockopt_lock_sock(sk);
@@ -1423,25 +1444,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
break;
- case SO_MAX_PACING_RATE:
- {
- unsigned long ulval = (val == ~0U) ? ~0UL : (unsigned int)val;
-
- if (sizeof(ulval) != sizeof(val) &&
- optlen >= sizeof(ulval) &&
- copy_from_sockptr(&ulval, optval, sizeof(ulval))) {
- ret = -EFAULT;
- break;
- }
- if (ulval != ~0UL)
- cmpxchg(&sk->sk_pacing_status,
- SK_PACING_NONE,
- SK_PACING_NEEDED);
- /* Pairs with READ_ONCE() from sk_getsockopt() */
- WRITE_ONCE(sk->sk_max_pacing_rate, ulval);
- sk->sk_pacing_rate = min(sk->sk_pacing_rate, ulval);
- break;
- }
case SO_INCOMING_CPU:
reuseport_update_incoming_cpu(sk, val);
break;
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 146792cd26fed4e61cd72a5d85263b2c7c7b2636..22358032dd484b081d30686fbd03b01fbb9c4214 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -258,7 +258,7 @@ static unsigned long bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
u64 rate = bw;
rate = bbr_rate_bytes_per_sec(sk, rate, gain);
- rate = min_t(u64, rate, sk->sk_max_pacing_rate);
+ rate = min_t(u64, rate, READ_ONCE(sk->sk_max_pacing_rate));
return rate;
}
@@ -278,7 +278,8 @@ static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
}
bw = (u64)tcp_snd_cwnd(tp) * BW_UNIT;
do_div(bw, rtt_us);
- sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
+ WRITE_ONCE(sk->sk_pacing_rate,
+ bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain));
}
/* Pace using current bw estimate and a gain factor. */
@@ -290,14 +291,14 @@ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
if (unlikely(!bbr->has_seen_rtt && tp->srtt_us))
bbr_init_pacing_rate_from_rtt(sk);
- if (bbr_full_bw_reached(sk) || rate > sk->sk_pacing_rate)
- sk->sk_pacing_rate = rate;
+ if (bbr_full_bw_reached(sk) || rate > READ_ONCE(sk->sk_pacing_rate))
+ WRITE_ONCE(sk->sk_pacing_rate, rate);
}
/* override sysctl_tcp_min_tso_segs */
__bpf_kfunc static u32 bbr_min_tso_segs(struct sock *sk)
{
- return sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2;
+ return READ_ONCE(sk->sk_pacing_rate) < (bbr_min_tso_rate >> 3) ? 1 : 2;
}
static u32 bbr_tso_segs_goal(struct sock *sk)
@@ -309,7 +310,7 @@ static u32 bbr_tso_segs_goal(struct sock *sk)
* driver provided sk_gso_max_size.
*/
bytes = min_t(unsigned long,
- sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
+ READ_ONCE(sk->sk_pacing_rate) >> READ_ONCE(sk->sk_pacing_shift),
GSO_LEGACY_MAX_SIZE - 1 - MAX_TCP_HEADER);
segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk));
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 584825ddd0a09a2037aea7869b137c3ac64a1534..22c2a7c2e65ee749a61b5dc74459e0c7db9f4628 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -927,8 +927,8 @@ static void tcp_update_pacing_rate(struct sock *sk)
* without any lock. We want to make sure compiler wont store
* intermediate values in this location.
*/
- WRITE_ONCE(sk->sk_pacing_rate, min_t(u64, rate,
- sk->sk_max_pacing_rate));
+ WRITE_ONCE(sk->sk_pacing_rate,
+ min_t(u64, rate, READ_ONCE(sk->sk_max_pacing_rate)));
}
/* Calculate rto without backoff. This is the second half of Van Jacobson's
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..696dfd64c8c5ffaef43f0f33c9402df2f673dcd3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1201,7 +1201,7 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
struct tcp_sock *tp = tcp_sk(sk);
if (sk->sk_pacing_status != SK_PACING_NONE) {
- unsigned long rate = sk->sk_pacing_rate;
+ unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
/* Original sch_fq does not pace first 10 MSS
* Note that tp->data_segs_out overflows after 2^32 packets,
@@ -1973,7 +1973,7 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
unsigned long bytes;
u32 r;
- bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
+ bytes = READ_ONCE(sk->sk_pacing_rate) >> READ_ONCE(sk->sk_pacing_shift);
r = tcp_min_rtt(tcp_sk(sk)) >> READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_rtt_log);
if (r < BITS_PER_TYPE(sk->sk_gso_max_size))
@@ -2553,7 +2553,7 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
limit = max_t(unsigned long,
2 * skb->truesize,
- sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift));
+ READ_ONCE(sk->sk_pacing_rate) >> READ_ONCE(sk->sk_pacing_shift));
if (sk->sk_pacing_status == SK_PACING_NONE)
limit = min_t(unsigned long, limit,
READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes));
@@ -2561,7 +2561,8 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
if (static_branch_unlikely(&tcp_tx_delay_enabled) &&
tcp_sk(sk)->tcp_tx_delay) {
- u64 extra_bytes = (u64)sk->sk_pacing_rate * tcp_sk(sk)->tcp_tx_delay;
+ u64 extra_bytes = (u64)READ_ONCE(sk->sk_pacing_rate) *
+ tcp_sk(sk)->tcp_tx_delay;
/* TSQ is based on skb truesize sum (sk_wmem_alloc), so we
* approximate our needs assuming an ~100% skb->truesize overhead.
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index f59a2cb2c803d79bd1f0eb1806464a0220824f9e..1a616bdeaf9ba8ba6413aaae8e6c642174a7196a 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -607,7 +607,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
*/
if (!skb->tstamp) {
if (skb->sk)
- rate = min(skb->sk->sk_pacing_rate, rate);
+ rate = min(READ_ONCE(skb->sk->sk_pacing_rate), rate);
if (rate <= q->low_rate_threshold) {
f->credit = 0;
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 6/8] net: lockless implementation of SO_TXREHASH
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (4 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 5/8] net: implement lockless SO_MAX_PACING_RATE Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 7/8] net: annotate data-races around sk->sk_tx_queue_mapping Eric Dumazet
` (3 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
sk->sk_txrehash readers are already safe against
concurrent change of this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/sock.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 4254ed0e4817d60cb2bf9d8e62ffcd98a90f7ec6..f0930f858714b6efdb5b4168d7eb5135f65aded4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1181,6 +1181,16 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
WRITE_ONCE(sk->sk_pacing_rate, ulval);
return 0;
}
+ case SO_TXREHASH:
+ if (val < -1 || val > 1)
+ return -EINVAL;
+ if ((u8)val == SOCK_TXREHASH_DEFAULT)
+ val = READ_ONCE(sock_net(sk)->core.sysctl_txrehash);
+ /* Paired with READ_ONCE() in tcp_rtx_synack()
+ * and sk_getsockopt().
+ */
+ WRITE_ONCE(sk->sk_txrehash, (u8)val);
+ return 0;
}
sockopt_lock_sock(sk);
@@ -1528,19 +1538,6 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
break;
}
- case SO_TXREHASH:
- if (val < -1 || val > 1) {
- ret = -EINVAL;
- break;
- }
- if ((u8)val == SOCK_TXREHASH_DEFAULT)
- val = READ_ONCE(sock_net(sk)->core.sysctl_txrehash);
- /* Paired with READ_ONCE() in tcp_rtx_synack()
- * and sk_getsockopt().
- */
- WRITE_ONCE(sk->sk_txrehash, (u8)val);
- break;
-
default:
ret = -ENOPROTOOPT;
break;
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 7/8] net: annotate data-races around sk->sk_tx_queue_mapping
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (5 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 6/8] net: lockless implementation of SO_TXREHASH Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 8/8] net: annotate data-races around sk->sk_dst_pending_confirm Eric Dumazet
` (2 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 56ac1abadea59e6734396a7ef2e22518a0ba80a1..f33e733167df8c2da9240f4af5ed7d715f347394 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2007,21 +2007,33 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
/* sk_tx_queue_mapping accept only upto a 16-bit value */
if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
return;
- sk->sk_tx_queue_mapping = tx_queue;
+ /* Paired with READ_ONCE() in sk_tx_queue_get() and
+ * other WRITE_ONCE() because socket lock might be not held.
+ */
+ WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
}
#define NO_QUEUE_MAPPING USHRT_MAX
static inline void sk_tx_queue_clear(struct sock *sk)
{
- sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
+ /* Paired with READ_ONCE() in sk_tx_queue_get() and
+ * other WRITE_ONCE() because socket lock might be not held.
+ */
+ WRITE_ONCE(sk->sk_tx_queue_mapping, NO_QUEUE_MAPPING);
}
static inline int sk_tx_queue_get(const struct sock *sk)
{
- if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
- return sk->sk_tx_queue_mapping;
+ if (sk) {
+ /* Paired with WRITE_ONCE() in sk_tx_queue_clear()
+ * and sk_tx_queue_set().
+ */
+ int val = READ_ONCE(sk->sk_tx_queue_mapping);
+ if (val != NO_QUEUE_MAPPING)
+ return val;
+ }
return -1;
}
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next 8/8] net: annotate data-races around sk->sk_dst_pending_confirm
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (6 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 7/8] net: annotate data-races around sk->sk_tx_queue_mapping Eric Dumazet
@ 2023-09-21 20:28 ` Eric Dumazet
2023-09-22 16:56 ` [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Simon Horman
2023-10-01 18:20 ` patchwork-bot+netdevbpf
9 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2023-09-21 20:28 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 6 +++---
net/core/sock.c | 2 +-
net/ipv4/tcp_output.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index f33e733167df8c2da9240f4af5ed7d715f347394..e70afdb4d29b680aa1081f2b57bab60700b56f5f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2182,7 +2182,7 @@ static inline void __dst_negative_advice(struct sock *sk)
if (ndst != dst) {
rcu_assign_pointer(sk->sk_dst_cache, ndst);
sk_tx_queue_clear(sk);
- sk->sk_dst_pending_confirm = 0;
+ WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
}
}
}
@@ -2199,7 +2199,7 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
struct dst_entry *old_dst;
sk_tx_queue_clear(sk);
- sk->sk_dst_pending_confirm = 0;
+ WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
old_dst = rcu_dereference_protected(sk->sk_dst_cache,
lockdep_sock_is_held(sk));
rcu_assign_pointer(sk->sk_dst_cache, dst);
@@ -2212,7 +2212,7 @@ sk_dst_set(struct sock *sk, struct dst_entry *dst)
struct dst_entry *old_dst;
sk_tx_queue_clear(sk);
- sk->sk_dst_pending_confirm = 0;
+ WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
old_dst = xchg((__force struct dst_entry **)&sk->sk_dst_cache, dst);
dst_release(old_dst);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index f0930f858714b6efdb5b4168d7eb5135f65aded4..290165954379292782a484d378a865cc52ca6753 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -600,7 +600,7 @@ struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check,
dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
- sk->sk_dst_pending_confirm = 0;
+ WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
dst_release(dst);
return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 696dfd64c8c5ffaef43f0f33c9402df2f673dcd3..a13779b24a6c18419836651f82352b324f1dec57 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1325,7 +1325,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
skb->destructor = skb_is_tcp_pure_ack(skb) ? __sock_wfree : tcp_wfree;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
- skb_set_dst_pending_confirm(skb, sk->sk_dst_pending_confirm);
+ skb_set_dst_pending_confirm(skb, READ_ONCE(sk->sk_dst_pending_confirm));
/* Build TCP header and checksum it. */
th = (struct tcphdr *)skb->data;
--
2.42.0.515.g380fc7ccd1-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 1/8] net: implement lockless SO_PRIORITY
2023-09-21 20:28 ` [PATCH net-next 1/8] net: implement lockless SO_PRIORITY Eric Dumazet
@ 2023-09-21 23:37 ` Wenjia Zhang
0 siblings, 0 replies; 12+ messages in thread
From: Wenjia Zhang @ 2023-09-21 23:37 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet
On 21.09.23 22:28, Eric Dumazet wrote:
> This is a followup of 8bf43be799d4 ("net: annotate data-races
> around sk->sk_priority").
>
> sk->sk_priority can be read and written without holding the socket lock.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> drivers/net/ppp/pppoe.c | 2 +-
> include/net/bluetooth/bluetooth.h | 2 +-
> net/appletalk/aarp.c | 2 +-
> net/ax25/af_ax25.c | 2 +-
> net/bluetooth/l2cap_sock.c | 2 +-
> net/can/j1939/socket.c | 2 +-
> net/can/raw.c | 2 +-
> net/core/sock.c | 23 ++++++++++++-----------
> net/dccp/ipv6.c | 2 +-
> net/ipv4/inet_diag.c | 2 +-
> net/ipv4/ip_output.c | 2 +-
> net/ipv4/tcp_ipv4.c | 2 +-
> net/ipv4/tcp_minisocks.c | 2 +-
> net/ipv6/inet6_connection_sock.c | 2 +-
> net/ipv6/ip6_output.c | 2 +-
> net/ipv6/tcp_ipv6.c | 4 ++--
> net/mptcp/sockopt.c | 2 +-
> net/netrom/af_netrom.c | 2 +-
> net/rose/af_rose.c | 2 +-
> net/sched/em_meta.c | 2 +-
> net/sctp/ipv6.c | 2 +-
> net/smc/af_smc.c | 2 +-
> net/x25/af_x25.c | 2 +-
> net/xdp/xsk.c | 2 +-
> 24 files changed, 36 insertions(+), 35 deletions(-)
>
Thank you, Eric, for the fix!
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 0/8] net: more data-races fixes and lockless socket options
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (7 preceding siblings ...)
2023-09-21 20:28 ` [PATCH net-next 8/8] net: annotate data-races around sk->sk_dst_pending_confirm Eric Dumazet
@ 2023-09-22 16:56 ` Simon Horman
2023-10-01 18:20 ` patchwork-bot+netdevbpf
9 siblings, 0 replies; 12+ messages in thread
From: Simon Horman @ 2023-09-22 16:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
eric.dumazet
On Thu, Sep 21, 2023 at 08:28:10PM +0000, Eric Dumazet wrote:
> This is yet another round of data-races fixes,
> and lockless socket options.
>
> Eric Dumazet (8):
> net: implement lockless SO_PRIORITY
> net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC
> net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()
> net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL,
> SO_BUSY_POLL_BUDGET
> net: implement lockless SO_MAX_PACING_RATE
> net: lockless implementation of SO_TXREHASH
> net: annotate data-races around sk->sk_tx_queue_mapping
> net: annotate data-races around sk->sk_dst_pending_confirm
For series,
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 0/8] net: more data-races fixes and lockless socket options
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
` (8 preceding siblings ...)
2023-09-22 16:56 ` [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Simon Horman
@ 2023-10-01 18:20 ` patchwork-bot+netdevbpf
9 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-10-01 18:20 UTC (permalink / raw)
To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev, eric.dumazet
Hello:
This series was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:
On Thu, 21 Sep 2023 20:28:10 +0000 you wrote:
> This is yet another round of data-races fixes,
> and lockless socket options.
>
> Eric Dumazet (8):
> net: implement lockless SO_PRIORITY
> net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC
> net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()
> net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL,
> SO_BUSY_POLL_BUDGET
> net: implement lockless SO_MAX_PACING_RATE
> net: lockless implementation of SO_TXREHASH
> net: annotate data-races around sk->sk_tx_queue_mapping
> net: annotate data-races around sk->sk_dst_pending_confirm
>
> [...]
Here is the summary with links:
- [net-next,1/8] net: implement lockless SO_PRIORITY
https://git.kernel.org/netdev/net-next/c/10bbf1652c1c
- [net-next,2/8] net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC
https://git.kernel.org/netdev/net-next/c/8ebfb6db5a01
- [net-next,3/8] net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()
https://git.kernel.org/netdev/net-next/c/b120251590a9
- [net-next,4/8] net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET
https://git.kernel.org/netdev/net-next/c/2a4319cf3c83
- [net-next,5/8] net: implement lockless SO_MAX_PACING_RATE
https://git.kernel.org/netdev/net-next/c/28b24f90020f
- [net-next,6/8] net: lockless implementation of SO_TXREHASH
https://git.kernel.org/netdev/net-next/c/5eef0b8de1be
- [net-next,7/8] net: annotate data-races around sk->sk_tx_queue_mapping
https://git.kernel.org/netdev/net-next/c/0bb4d124d340
- [net-next,8/8] net: annotate data-races around sk->sk_dst_pending_confirm
https://git.kernel.org/netdev/net-next/c/eb44ad4e6351
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-10-01 18:20 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-21 20:28 [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 1/8] net: implement lockless SO_PRIORITY Eric Dumazet
2023-09-21 23:37 ` Wenjia Zhang
2023-09-21 20:28 ` [PATCH net-next 2/8] net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 3/8] net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt() Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 4/8] net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 5/8] net: implement lockless SO_MAX_PACING_RATE Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 6/8] net: lockless implementation of SO_TXREHASH Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 7/8] net: annotate data-races around sk->sk_tx_queue_mapping Eric Dumazet
2023-09-21 20:28 ` [PATCH net-next 8/8] net: annotate data-races around sk->sk_dst_pending_confirm Eric Dumazet
2023-09-22 16:56 ` [PATCH net-next 0/8] net: more data-races fixes and lockless socket options Simon Horman
2023-10-01 18:20 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).