[PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound
@ 2025-06-18 23:13 Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 1/3] net: add sock_rcvbuf_has_space() helper Jason Baron
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jason Baron @ 2025-06-18 23:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, horms, kuniyu

The sk->sk_rmem_alloc field of a netlink socket can wraparound as a
signed int when comparing to sk->sk_rcvbuf, when sk->sk_rcvbuf approaches
INT_MAX. This can be reproduced by forcing sk->sk_rcvbuf to INT_MAX and
this can exhaust all of memory.

I've added a sock_rcvbuf_has_space() helper function to generalize the
fix as a similar approach has already been implemented for udp sockets.

v2:
-add Fixes:
-add sock_rcvbuf_has_space() helper
-use helper functions for udp netlink
-remove excessive parentheses

Jason Baron (3):
  net: add sock_rcvbuf_has_space() helper
  udp: use __sock_rcvbuf_has_space() helper
  netlink: Fix wraparound of sk->sk_rmem_alloc

 include/net/sock.h       | 38 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/udp.c           | 13 ++-----------
 net/netlink/af_netlink.c | 35 +++++++++++++++++++++--------------
 3 files changed, 61 insertions(+), 25 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v2 1/3] net: add sock_rcvbuf_has_space() helper
  2025-06-18 23:13 [PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound Jason Baron
@ 2025-06-18 23:13 ` Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 2/3] udp: use __sock_rcvbuf_has_space() helper Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc Jason Baron
  2 siblings, 0 replies; 12+ messages in thread
From: Jason Baron @ 2025-06-18 23:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, horms, kuniyu

Let's add helper function that abstract the sk->sk_rcvbuf limits check
for wraparound that Kuniyuki Iwashima introduce for udp receive path:

commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")

Note that I copied Kuniyuki Iwashima's comments into
sock_rcvbuf_has_space(). Subsequent patches will use this for udp and
netlink sockets.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/sock.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 70c0b570a21f..5f4fdf96bba6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2484,6 +2484,44 @@ static inline unsigned long sock_wspace(struct sock *sk)
 	return amt;
 }
 
+static inline bool __sock_rcvbuf_has_space(unsigned int rmem,
+					   unsigned int rcvbuf,
+					   unsigned int size)
+{
+	/* Immediately drop when the receive queue is full. */
+	if (rmem + size > rcvbuf) {
+		if (rcvbuf > INT_MAX >> 1)
+			return false;
+
+		/* Always allow at least one packet for small buffer. */
+		if (rmem > rcvbuf)
+			return false;
+	}
+	return true;
+}
+
+/**
+ * sock_rcvbuf_has_space - check if sk->sk_rcvbuf has space
+ * @sk: socket
+ * @skb: socket buffer
+ *
+ * Can skb->truesize bytes be added to the socket receive buffer
+ * while respecting the sk->sk_rcvbuf limit. Note that rcvbuf and
+ * rmem are assigned to unsigned int to avoid wraparound.
+ *
+ * Return: true if there is space, false otherwise
+ */
+static inline bool sock_rcvbuf_has_space(struct sock *sk, struct sk_buff *skb)
+{
+	unsigned int rmem, rcvbuf;
+
+	/* Cast to unsigned int performs the boundary check for INT_MAX. */
+	rmem = atomic_read(&sk->sk_rmem_alloc);
+	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
+
+	return __sock_rcvbuf_has_space(rmem, rcvbuf, skb->truesize);
+}
+
 /* Note:
  *  We use sk->sk_wq_raw, from contexts knowing this
  *  pointer is not NULL and cannot disappear/change.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v2 2/3] udp: use __sock_rcvbuf_has_space() helper
  2025-06-18 23:13 [PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 1/3] net: add sock_rcvbuf_has_space() helper Jason Baron
@ 2025-06-18 23:13 ` Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc Jason Baron
  2 siblings, 0 replies; 12+ messages in thread
From: Jason Baron @ 2025-06-18 23:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, horms, kuniyu

Make use of __sock_rcvbuf_has_space() in __udp_enqueue_schedule_skb().

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 net/ipv4/udp.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dde52b8050b8..64e6c34a8fb8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1733,17 +1733,8 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
 	size = skb->truesize;
 
-	/* Immediately drop when the receive queue is full.
-	 * Cast to unsigned int performs the boundary check for INT_MAX.
-	 */
-	if (rmem + size > rcvbuf) {
-		if (rcvbuf > INT_MAX >> 1)
-			goto drop;
-
-		/* Always allow at least one packet for small buffer. */
-		if (rmem > rcvbuf)
-			goto drop;
-	}
+	if (!__sock_rcvbuf_has_space(rmem, rcvbuf, size))
+		goto drop;
 
 	/* Under mem pressure, it might be helpful to help udp_recvmsg()
 	 * having linear skbs :
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-18 23:13 [PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 1/3] net: add sock_rcvbuf_has_space() helper Jason Baron
  2025-06-18 23:13 ` [PATCH net-next v2 2/3] udp: use __sock_rcvbuf_has_space() helper Jason Baron
@ 2025-06-18 23:13 ` Jason Baron
  2025-06-19  6:13   ` Kuniyuki Iwashima
  2 siblings, 1 reply; 12+ messages in thread
From: Jason Baron @ 2025-06-18 23:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, horms, kuniyu

For netlink sockets, when comparing allocated rmem memory with the
rcvbuf limit, the comparison is done using signed values. This means
that if rcvbuf is near INT_MAX, then sk->sk_rmem_alloc may become
negative in the comparison with rcvbuf which will yield incorrect
results.

This can be reproduced by using the program from SOCK_DIAG(7) with
some slight modifications. First, setting sk->sk_rcvbuf to INT_MAX
using SO_RCVBUFFORCE and then secondly running the "send_query()"
in a loop while not calling "receive_responses()". In this case,
the value of sk->sk_rmem_alloc will continuously wrap around
and thus more memory is allocated than the sk->sk_rcvbuf limit.
This will eventually fill all of memory leading to an out of memory
condition with skbs filling up the slab.

Let's fix this in a similar manner to:
commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")

As noted in that fix, if there are multiple threads writing to a
netlink socket it's possible to slightly exceed rcvbuf value. But as
noted this avoids an expensive 'atomic_add_return()' for the common
case. I've confirmed that with the fix the modified program from
SOCK_DIAG(7) can no longer fill memory and the sk->sk_rcvbuf limit
is enforced.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
---
 net/netlink/af_netlink.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e8972a857e51..a770b90abe7d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1216,8 +1216,8 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 
 	nlk = nlk_sk(sk);
 
-	if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	     test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
+	if (!sock_rcvbuf_has_space(sk, skb) ||
+	    test_bit(NETLINK_S_CONGESTED, &nlk->state)) {
 		DECLARE_WAITQUEUE(wait, current);
 		if (!*timeo) {
 			if (!ssk || netlink_is_kernel(ssk))
@@ -1230,7 +1230,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		__set_current_state(TASK_INTERRUPTIBLE);
 		add_wait_queue(&nlk->wait, &wait);
 
-		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+		if ((!sock_rcvbuf_has_space(sk, skb) ||
 		     test_bit(NETLINK_S_CONGESTED, &nlk->state)) &&
 		    !sock_flag(sk, SOCK_DEAD))
 			*timeo = schedule_timeout(*timeo);
@@ -1383,12 +1383,15 @@ EXPORT_SYMBOL_GPL(netlink_strict_get_check);
 static int netlink_broadcast_deliver(struct sock *sk, struct sk_buff *skb)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
+	unsigned int rmem, rcvbuf;
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+	if (sock_rcvbuf_has_space(sk, skb) &&
 	    !test_bit(NETLINK_S_CONGESTED, &nlk->state)) {
 		netlink_skb_set_owner_r(skb, sk);
 		__netlink_sendskb(sk, skb);
-		return atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1);
+		rmem = atomic_read(&sk->sk_rmem_alloc);
+		rcvbuf = READ_ONCE(sk->sk_rcvbuf);
+		return rmem > (rcvbuf >> 1);
 	}
 	return -1;
 }
@@ -1895,6 +1898,7 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	struct netlink_sock *nlk = nlk_sk(sk);
 	size_t copied, max_recvmsg_len;
 	struct sk_buff *skb, *data_skb;
+	unsigned int rmem, rcvbuf;
 	int err, ret;
 
 	if (flags & MSG_OOB)
@@ -1960,12 +1964,15 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 	skb_free_datagram(sk, skb);
 
-	if (READ_ONCE(nlk->cb_running) &&
-	    atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
-		ret = netlink_dump(sk, false);
-		if (ret) {
-			WRITE_ONCE(sk->sk_err, -ret);
-			sk_error_report(sk);
+	if (READ_ONCE(nlk->cb_running)) {
+		rmem = atomic_read(&sk->sk_rmem_alloc);
+		rcvbuf = READ_ONCE(sk->sk_rcvbuf);
+		if (rmem <= (rcvbuf >> 1)) {
+			ret = netlink_dump(sk, false);
+			if (ret) {
+				WRITE_ONCE(sk->sk_err, -ret);
+				sk_error_report(sk);
+			}
 		}
 	}
 
@@ -2258,9 +2265,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 		goto errout_skb;
 	}
 
-	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
-		goto errout_skb;
-
 	/* NLMSG_GOODSIZE is small to avoid high order allocations being
 	 * required, but it makes sense to _attempt_ a 32KiB allocation
 	 * to reduce number of system calls on dump operations, if user
@@ -2283,6 +2287,9 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	if (!skb)
 		goto errout_skb;
 
+	if (!sock_rcvbuf_has_space(sk, skb))
+		goto errout_skb;
+
 	/* Trim skb to allocated size. User is expected to provide buffer as
 	 * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
 	 * netlink_recvmsg())). dump will pack as many smaller messages as
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-18 23:13 ` [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc Jason Baron
@ 2025-06-19  6:13   ` Kuniyuki Iwashima
  2025-06-23 23:35     ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-06-19  6:13 UTC (permalink / raw)
  To: jbaron; +Cc: davem, edumazet, horms, kuba, kuniyu, netdev, pabeni

From: Jason Baron <jbaron@akamai.com>
Date: Wed, 18 Jun 2025 19:13:23 -0400
> For netlink sockets, when comparing allocated rmem memory with the
> rcvbuf limit, the comparison is done using signed values. This means
> that if rcvbuf is near INT_MAX, then sk->sk_rmem_alloc may become
> negative in the comparison with rcvbuf which will yield incorrect
> results.
> 
> This can be reproduced by using the program from SOCK_DIAG(7) with
> some slight modifications. First, setting sk->sk_rcvbuf to INT_MAX
> using SO_RCVBUFFORCE and then secondly running the "send_query()"
> in a loop while not calling "receive_responses()". In this case,
> the value of sk->sk_rmem_alloc will continuously wrap around
> and thus more memory is allocated than the sk->sk_rcvbuf limit.
> This will eventually fill all of memory leading to an out of memory
> condition with skbs filling up the slab.
> 
> Let's fix this in a similar manner to:
> commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")
> 
> As noted in that fix, if there are multiple threads writing to a
> netlink socket it's possible to slightly exceed rcvbuf value. But as
> noted this avoids an expensive 'atomic_add_return()' for the common
> case.

This was because UDP RX path is the fast path, but netlink isn't.
Also, it's common for UDP that multiple packets for the same socket
are processed concurrently, and 850cbaddb52d dropped lock_sock from
the path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-19  6:13   ` Kuniyuki Iwashima
@ 2025-06-23 23:35     ` Jakub Kicinski
  2025-06-24  7:55       ` Paolo Abeni
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-06-23 23:35 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: jbaron, davem, edumazet, horms, kuniyu, netdev, pabeni

On Wed, 18 Jun 2025 23:13:02 -0700 Kuniyuki Iwashima wrote:
> From: Jason Baron <jbaron@akamai.com>
> Date: Wed, 18 Jun 2025 19:13:23 -0400
> > For netlink sockets, when comparing allocated rmem memory with the
> > rcvbuf limit, the comparison is done using signed values. This means
> > that if rcvbuf is near INT_MAX, then sk->sk_rmem_alloc may become
> > negative in the comparison with rcvbuf which will yield incorrect
> > results.
> > 
> > This can be reproduced by using the program from SOCK_DIAG(7) with
> > some slight modifications. First, setting sk->sk_rcvbuf to INT_MAX
> > using SO_RCVBUFFORCE and then secondly running the "send_query()"
> > in a loop while not calling "receive_responses()". In this case,
> > the value of sk->sk_rmem_alloc will continuously wrap around
> > and thus more memory is allocated than the sk->sk_rcvbuf limit.
> > This will eventually fill all of memory leading to an out of memory
> > condition with skbs filling up the slab.
> > 
> > Let's fix this in a similar manner to:
> > commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")
> > 
> > As noted in that fix, if there are multiple threads writing to a
> > netlink socket it's possible to slightly exceed rcvbuf value. But as
> > noted this avoids an expensive 'atomic_add_return()' for the common
> > case.  
> 
> This was because UDP RX path is the fast path, but netlink isn't.
> Also, it's common for UDP that multiple packets for the same socket
> are processed concurrently, and 850cbaddb52d dropped lock_sock from
> the path.

To be clear -- are you saying we should fix this differently?
Or perhaps that the problem doesn't exist? The change doesn't
seem very intrusive..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-23 23:35     ` Jakub Kicinski
@ 2025-06-24  7:55       ` Paolo Abeni
  2025-06-24 13:57         ` Jason Baron
  2025-06-24 14:11         ` Jakub Kicinski
  0 siblings, 2 replies; 12+ messages in thread
From: Paolo Abeni @ 2025-06-24  7:55 UTC (permalink / raw)
  To: Jakub Kicinski, jbaron
  Cc: davem, edumazet, horms, kuniyu, netdev, Kuniyuki Iwashima

On 6/24/25 1:35 AM, Jakub Kicinski wrote:
> On Wed, 18 Jun 2025 23:13:02 -0700 Kuniyuki Iwashima wrote:
>> From: Jason Baron <jbaron@akamai.com>
>> Date: Wed, 18 Jun 2025 19:13:23 -0400
>>> For netlink sockets, when comparing allocated rmem memory with the
>>> rcvbuf limit, the comparison is done using signed values. This means
>>> that if rcvbuf is near INT_MAX, then sk->sk_rmem_alloc may become
>>> negative in the comparison with rcvbuf which will yield incorrect
>>> results.
>>>
>>> This can be reproduced by using the program from SOCK_DIAG(7) with
>>> some slight modifications. First, setting sk->sk_rcvbuf to INT_MAX
>>> using SO_RCVBUFFORCE and then secondly running the "send_query()"
>>> in a loop while not calling "receive_responses()". In this case,
>>> the value of sk->sk_rmem_alloc will continuously wrap around
>>> and thus more memory is allocated than the sk->sk_rcvbuf limit.
>>> This will eventually fill all of memory leading to an out of memory
>>> condition with skbs filling up the slab.
>>>
>>> Let's fix this in a similar manner to:
>>> commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")
>>>
>>> As noted in that fix, if there are multiple threads writing to a
>>> netlink socket it's possible to slightly exceed rcvbuf value. But as
>>> noted this avoids an expensive 'atomic_add_return()' for the common
>>> case.  
>>
>> This was because UDP RX path is the fast path, but netlink isn't.
>> Also, it's common for UDP that multiple packets for the same socket
>> are processed concurrently, and 850cbaddb52d dropped lock_sock from
>> the path.
> 
> To be clear -- are you saying we should fix this differently?
> Or perhaps that the problem doesn't exist? The change doesn't
> seem very intrusive..

AFAICS the race is possible even with netlink as netlink_unicast() runs
without the socket lock, too.

The point is that for UDP the scenario with multiple threads enqueuing a
packet into the same socket is a critical path, optimizing for
performances and allowing some memory accounting inaccuracy makes sense.

For netlink socket, that scenario looks a patological one and I think we
should prefer accuracy instead of optimization.

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-24  7:55       ` Paolo Abeni
@ 2025-06-24 13:57         ` Jason Baron
  2025-06-24 14:11         ` Jakub Kicinski
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Baron @ 2025-06-24 13:57 UTC (permalink / raw)
  To: Paolo Abeni, Jakub Kicinski
  Cc: davem, edumazet, horms, kuniyu, netdev, Kuniyuki Iwashima



On 6/24/25 3:55 AM, Paolo Abeni wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
> 
> On 6/24/25 1:35 AM, Jakub Kicinski wrote:
>> On Wed, 18 Jun 2025 23:13:02 -0700 Kuniyuki Iwashima wrote:
>>> From: Jason Baron <jbaron@akamai.com>
>>> Date: Wed, 18 Jun 2025 19:13:23 -0400
>>>> For netlink sockets, when comparing allocated rmem memory with the
>>>> rcvbuf limit, the comparison is done using signed values. This means
>>>> that if rcvbuf is near INT_MAX, then sk->sk_rmem_alloc may become
>>>> negative in the comparison with rcvbuf which will yield incorrect
>>>> results.
>>>>
>>>> This can be reproduced by using the program from SOCK_DIAG(7) with
>>>> some slight modifications. First, setting sk->sk_rcvbuf to INT_MAX
>>>> using SO_RCVBUFFORCE and then secondly running the "send_query()"
>>>> in a loop while not calling "receive_responses()". In this case,
>>>> the value of sk->sk_rmem_alloc will continuously wrap around
>>>> and thus more memory is allocated than the sk->sk_rcvbuf limit.
>>>> This will eventually fill all of memory leading to an out of memory
>>>> condition with skbs filling up the slab.
>>>>
>>>> Let's fix this in a similar manner to:
>>>> commit 5a465a0da13e ("udp: Fix multiple wraparounds of sk->sk_rmem_alloc.")
>>>>
>>>> As noted in that fix, if there are multiple threads writing to a
>>>> netlink socket it's possible to slightly exceed rcvbuf value. But as
>>>> noted this avoids an expensive 'atomic_add_return()' for the common
>>>> case.
>>>
>>> This was because UDP RX path is the fast path, but netlink isn't.
>>> Also, it's common for UDP that multiple packets for the same socket
>>> are processed concurrently, and 850cbaddb52d dropped lock_sock from
>>> the path.
>>
>> To be clear -- are you saying we should fix this differently?
>> Or perhaps that the problem doesn't exist? The change doesn't
>> seem very intrusive..
> 
> AFAICS the race is possible even with netlink as netlink_unicast() runs
> without the socket lock, too.
> 
> The point is that for UDP the scenario with multiple threads enqueuing a
> packet into the same socket is a critical path, optimizing for
> performances and allowing some memory accounting inaccuracy makes sense.
> 
> For netlink socket, that scenario looks a patological one and I think we
> should prefer accuracy instead of optimization.
> 
> Thanks,
> 
> Paolo
> 

Hi,

Since the current netlink code already allows memory allocation to 
slightly exceed rcvbuf even when there is no overflow, I didn't think we 
would want to change this especially for smaller buffers as it may cause 
a regression. I tried to preserve this behavior if you look at the 
__sock_rcvbuf_has_space() function in patch 1/3.

For larger buffers, I think accurate accounting could make sense, but I 
thought it made more sense to try and optimize this common case as 
Kuniyuki Iwashima did, consolidate code between multiple use-cases and 
not overly special case handling for large buffers.

Thanks,

-Jason



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-24  7:55       ` Paolo Abeni
  2025-06-24 13:57         ` Jason Baron
@ 2025-06-24 14:11         ` Jakub Kicinski
  2025-06-24 17:08           ` Kuniyuki Iwashima
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-06-24 14:11 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: jbaron, davem, edumazet, horms, kuniyu, netdev, Kuniyuki Iwashima

On Tue, 24 Jun 2025 09:55:15 +0200 Paolo Abeni wrote:
> > To be clear -- are you saying we should fix this differently?
> > Or perhaps that the problem doesn't exist? The change doesn't
> > seem very intrusive..  
> 
> AFAICS the race is possible even with netlink as netlink_unicast() runs
> without the socket lock, too.
> 
> The point is that for UDP the scenario with multiple threads enqueuing a
> packet into the same socket is a critical path, optimizing for
> performances and allowing some memory accounting inaccuracy makes sense.
> 
> For netlink socket, that scenario looks a patological one and I think we
> should prefer accuracy instead of optimization.

Could you ELI5 what you mean? Are you suggesting a lock around every
sk_rmem write for netlink sockets? 
If we think this is an attack vector the attacker can simply use a UDP
socket instead. Or do you think it'd lead to simpler code?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-24 14:11         ` Jakub Kicinski
@ 2025-06-24 17:08           ` Kuniyuki Iwashima
  2025-06-24 22:03             ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-06-24 17:08 UTC (permalink / raw)
  To: kuba; +Cc: davem, edumazet, horms, jbaron, kuni1840, kuniyu, netdev, pabeni

Date: Tue, 24 Jun 2025 07:11:57 -0700
From: Jakub Kicinski <kuba@kernel.org>
> On Tue, 24 Jun 2025 09:55:15 +0200 Paolo Abeni wrote:
> > > To be clear -- are you saying we should fix this differently?
> > > Or perhaps that the problem doesn't exist? The change doesn't
> > > seem very intrusive..  
> > 
> > AFAICS the race is possible even with netlink as netlink_unicast() runs
> > without the socket lock, too.
> > 
> > The point is that for UDP the scenario with multiple threads enqueuing a
> > packet into the same socket is a critical path, optimizing for
> > performances and allowing some memory accounting inaccuracy makes sense.
> > 
> > For netlink socket, that scenario looks a patological one and I think we
> > should prefer accuracy instead of optimization.
> 
> Could you ELI5 what you mean? Are you suggesting a lock around every
> sk_rmem write for netlink sockets? 
> If we think this is an attack vector the attacker can simply use a UDP
> socket instead. Or do you think it'd lead to simpler code?

I was wondering if atomic_add_return() is expensive for netlink,
and if not, we could use it like below.  I'm also not sure we want
to keep the allow-at-least-one-skb rule for netlink though, which
comes from the first condition in __sock_queue_rcv_skb() for UDP
in the past, IIRC.

untested:

---8<---
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e8972a857e51..e1a9ae7ff521 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -387,7 +387,6 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	WARN_ON(skb->sk != NULL);
 	skb->sk = sk;
 	skb->destructor = netlink_skb_destructor;
-	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 	sk_mem_charge(sk, skb->truesize);
 }
 
@@ -1212,41 +1211,45 @@ struct sk_buff *netlink_alloc_large_skb(unsigned int size, int broadcast)
 int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		      long *timeo, struct sock *ssk)
 {
+	unsigned long rmem = atomic_add_return(skb->truesize, &sk->sk_rmem_alloc);
+	DECLARE_WAITQUEUE(wait, current);
 	struct netlink_sock *nlk;
 
 	nlk = nlk_sk(sk);
 
-	if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	     test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
-		DECLARE_WAITQUEUE(wait, current);
-		if (!*timeo) {
-			if (!ssk || netlink_is_kernel(ssk))
-				netlink_overrun(sk);
-			sock_put(sk);
-			kfree_skb(skb);
-			return -EAGAIN;
-		}
-
-		__set_current_state(TASK_INTERRUPTIBLE);
-		add_wait_queue(&nlk->wait, &wait);
+	if (rmem == skb->truesize ||
+	    (rmem < sk->sk_rcvbuf && !test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
+		netlink_skb_set_owner_r(skb, sk);
+		return 0;
+	}
 
-		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-		     test_bit(NETLINK_S_CONGESTED, &nlk->state)) &&
-		    !sock_flag(sk, SOCK_DEAD))
-			*timeo = schedule_timeout(*timeo);
+	atomic_dec(skb->truesize, &sk->sk_rmem_alloc);
 
-		__set_current_state(TASK_RUNNING);
-		remove_wait_queue(&nlk->wait, &wait);
+	if (!*timeo) {
+		if (!ssk || netlink_is_kernel(ssk))
+			netlink_overrun(sk);
 		sock_put(sk);
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
 
-		if (signal_pending(current)) {
-			kfree_skb(skb);
-			return sock_intr_errno(*timeo);
-		}
-		return 1;
+	__set_current_state(TASK_INTERRUPTIBLE);
+	add_wait_queue(&nlk->wait, &wait);
+
+	if ((atomic_read(&sk->sk_rmem_alloc) + skb->truesize > sk->sk_rcvbuf ||
+	     test_bit(NETLINK_S_CONGESTED, &nlk->state)) &&
+	    !sock_flag(sk, SOCK_DEAD))
+		*timeo = schedule_timeout(*timeo);
+
+	__set_current_state(TASK_RUNNING);
+	remove_wait_queue(&nlk->wait, &wait);
+	sock_put(sk);
+
+	if (signal_pending(current)) {
+		kfree_skb(skb);
+		return sock_intr_errno(*timeo);
 	}
-	netlink_skb_set_owner_r(skb, sk);
-	return 0;
+	return 1;
 }
 
 static int __netlink_sendskb(struct sock *sk, struct sk_buff *skb)
@@ -1307,6 +1310,7 @@ static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb,
 	ret = -ECONNREFUSED;
 	if (nlk->netlink_rcv != NULL) {
 		ret = skb->len;
+		atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 		netlink_skb_set_owner_r(skb, sk);
 		NETLINK_CB(skb).sk = ssk;
 		netlink_deliver_tap_kernel(sk, ssk, skb);
@@ -1382,14 +1386,18 @@ EXPORT_SYMBOL_GPL(netlink_strict_get_check);
 
 static int netlink_broadcast_deliver(struct sock *sk, struct sk_buff *skb)
 {
+	unsigned long rmem = atomic_add_return(skb->truesize, &sk->sk_rmem_alloc);
+	unsigned int rcvbuf = READ_ONCE(sk->sk_rcvbuf);
 	struct netlink_sock *nlk = nlk_sk(sk);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-	    !test_bit(NETLINK_S_CONGESTED, &nlk->state)) {
+	if (rmem == skb->truesize ||
+	    (size <= rcvbuf && !test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
 		netlink_skb_set_owner_r(skb, sk);
 		__netlink_sendskb(sk, skb);
-		return atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1);
+		return size > (rcvbuf >> 1);
 	}
+
+	atomic_dec(skb->truesize, &sk->sk_rmem_alloc);
 	return -1;
 }
 
@@ -2249,6 +2257,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	struct module *module;
 	int err = -ENOBUFS;
 	int alloc_min_size;
+	unsigned int rmem;
 	int alloc_size;
 
 	if (!lock_taken)
@@ -2258,9 +2267,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 		goto errout_skb;
 	}
 
-	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
-		goto errout_skb;
-
 	/* NLMSG_GOODSIZE is small to avoid high order allocations being
 	 * required, but it makes sense to _attempt_ a 32KiB allocation
 	 * to reduce number of system calls on dump operations, if user
@@ -2283,6 +2289,12 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	if (!skb)
 		goto errout_skb;
 
+	rmem = atomic_add_return(skb->truesize, &sk->sk_rmem_alloc);
+	if (rmem != skb->truesize && rmem >= sk->sk_rcvbuf) {
+		atomic_dec(skb->truesize, &sk->sk_rmem_alloc);
+		goto errout_skb;
+	}
+
 	/* Trim skb to allocated size. User is expected to provide buffer as
 	 * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
 	 * netlink_recvmsg())). dump will pack as many smaller messages as
---8<---

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-24 17:08           ` Kuniyuki Iwashima
@ 2025-06-24 22:03             ` Jakub Kicinski
  2025-06-25 16:56               ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-06-24 22:03 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, horms, jbaron, kuniyu, netdev, pabeni

On Tue, 24 Jun 2025 10:08:41 -0700 Kuniyuki Iwashima wrote:
> Date: Tue, 24 Jun 2025 07:11:57 -0700
> From: Jakub Kicinski <kuba@kernel.org>
> > On Tue, 24 Jun 2025 09:55:15 +0200 Paolo Abeni wrote:  
> > > > To be clear -- are you saying we should fix this differently?
> > > > Or perhaps that the problem doesn't exist? The change doesn't
> > > > seem very intrusive..    
> > > 
> > > AFAICS the race is possible even with netlink as netlink_unicast() runs
> > > without the socket lock, too.
> > > 
> > > The point is that for UDP the scenario with multiple threads enqueuing a
> > > packet into the same socket is a critical path, optimizing for
> > > performances and allowing some memory accounting inaccuracy makes sense.
> > > 
> > > For netlink socket, that scenario looks a patological one and I think we
> > > should prefer accuracy instead of optimization.  
> > 
> > Could you ELI5 what you mean? Are you suggesting a lock around every
> > sk_rmem write for netlink sockets? 
> > If we think this is an attack vector the attacker can simply use a UDP
> > socket instead. Or do you think it'd lead to simpler code?  
> 
> I was wondering if atomic_add_return() is expensive for netlink,
> and if not, we could use it like below. 

Ah, got it. That does look simpler. 

nit: Please don't hide the atomic_add_return() in local variable init,
as it need validation and error handling.

> I'm also not sure we want to keep the allow-at-least-one-skb rule for
> netlink though, which comes from the first condition in
> __sock_queue_rcv_skb() for UDP in the past, IIRC.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc
  2025-06-24 22:03             ` Jakub Kicinski
@ 2025-06-25 16:56               ` Kuniyuki Iwashima
  0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-06-25 16:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Kuniyuki Iwashima, davem, edumazet, horms, jbaron, netdev, pabeni

On Tue, Jun 24, 2025 at 3:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 24 Jun 2025 10:08:41 -0700 Kuniyuki Iwashima wrote:
> > Date: Tue, 24 Jun 2025 07:11:57 -0700
> > From: Jakub Kicinski <kuba@kernel.org>
> > > On Tue, 24 Jun 2025 09:55:15 +0200 Paolo Abeni wrote:
> > > > > To be clear -- are you saying we should fix this differently?
> > > > > Or perhaps that the problem doesn't exist? The change doesn't
> > > > > seem very intrusive..
> > > >
> > > > AFAICS the race is possible even with netlink as netlink_unicast() runs
> > > > without the socket lock, too.
> > > >
> > > > The point is that for UDP the scenario with multiple threads enqueuing a
> > > > packet into the same socket is a critical path, optimizing for
> > > > performances and allowing some memory accounting inaccuracy makes sense.
> > > >
> > > > For netlink socket, that scenario looks a patological one and I think we
> > > > should prefer accuracy instead of optimization.
> > >
> > > Could you ELI5 what you mean? Are you suggesting a lock around every
> > > sk_rmem write for netlink sockets?
> > > If we think this is an attack vector the attacker can simply use a UDP
> > > socket instead. Or do you think it'd lead to simpler code?
> >
> > I was wondering if atomic_add_return() is expensive for netlink,
> > and if not, we could use it like below.
>
> Ah, got it. That does look simpler.
>
> nit: Please don't hide the atomic_add_return() in local variable init,
> as it need validation and error handling.

Makes sense.
Will post the official patch later.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-25 16:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 23:13 [PATCH net-next v2 0/3] Fix netlink rcvbuf wraparound Jason Baron
2025-06-18 23:13 ` [PATCH net-next v2 1/3] net: add sock_rcvbuf_has_space() helper Jason Baron
2025-06-18 23:13 ` [PATCH net-next v2 2/3] udp: use __sock_rcvbuf_has_space() helper Jason Baron
2025-06-18 23:13 ` [PATCH net-next v2 3/3] netlink: Fix wraparound of sk->sk_rmem_alloc Jason Baron
2025-06-19  6:13   ` Kuniyuki Iwashima
2025-06-23 23:35     ` Jakub Kicinski
2025-06-24  7:55       ` Paolo Abeni
2025-06-24 13:57         ` Jason Baron
2025-06-24 14:11         ` Jakub Kicinski
2025-06-24 17:08           ` Kuniyuki Iwashima
2025-06-24 22:03             ` Jakub Kicinski
2025-06-25 16:56               ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).