* [PATCH v1 net 0/3] udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX.
@ 2025-03-23 23:09 Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-23 23:09 UTC (permalink / raw)
To: Willem de Bruijn, David S. Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
I got a report that UDP mem usage in /proc/net/sockstat did not
drop even after an application was terminated.
The issue could happen if sk->sk_rmem_alloc wraps around due
to a large sk->sk_rcvbuf, which was INT_MAX in our case.
The patch 2 fixes the issue, and the patch 1 fixes yet another
overflow I found while investigating the issue.
Kuniyuki Iwashima (3):
udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
udp: Fix memory accounting leak.
selftest: net: Check wraparounds for sk->sk_rmem_alloc.
net/ipv4/udp.c | 18 ++-
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/so_rcvbuf.c | 178 ++++++++++++++++++++++++
4 files changed, 188 insertions(+), 11 deletions(-)
create mode 100644 tools/testing/selftests/net/so_rcvbuf.c
--
2.48.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-23 23:09 [PATCH v1 net 0/3] udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX Kuniyuki Iwashima
@ 2025-03-23 23:09 ` Kuniyuki Iwashima
2025-03-24 10:01 ` Eric Dumazet
2025-03-24 14:59 ` Willem de Bruijn
2025-03-23 23:09 ` [PATCH v1 net 2/3] udp: Fix memory accounting leak Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc Kuniyuki Iwashima
2 siblings, 2 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-23 23:09 UTC (permalink / raw)
To: Willem de Bruijn, David S. Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
__udp_enqueue_schedule_skb() has the following condition:
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
goto drop;
sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
or SO_RCVBUFFORCE.
If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
as sk->sk_rmem_alloc is also signed int.
Then, the size of the incoming skb is added to sk->sk_rmem_alloc
unconditionally.
This results in integer overflow (possibly multiple times) on
sk->sk_rmem_alloc and allows a single socket to have skb up to
net.core.udp_mem[1].
For example, if we set a large value to udp_mem[1] and INT_MAX to
sk->sk_rcvbuf and flood packets to the socket, we can see multiple
overflows:
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
^- PAGE_SHIFT
# ss -uam
State Recv-Q ...
UNCONN -1757018048 ... <-- flipping the sign repeatedly
skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
Previously, we had a boundary check for INT_MAX, which was removed by
commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
A complete fix would be to revert it and cap the right operand by
INT_MAX:
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
goto uncharge_drop;
but we do not want to add the expensive atomic_add_return() back just
for the corner case.
So, let's perform the first check as unsigned int to detect the
integer overflow.
Note that we still allow a single wraparound, which can be observed
from userspace, but it's acceptable considering it's unlikely that
no recv() is called for a long period, and the negative value will
soon flip back to positive after a few recv() calls.
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12
# ss -uam
State Recv-Q ...
UNCONN -2147482816 ... <-- INT_MAX + 831 bytes
skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/udp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a9bb9ce5438e..a1e60aab29b5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
*/
rmem = atomic_read(&sk->sk_rmem_alloc);
rcvbuf = READ_ONCE(sk->sk_rcvbuf);
- if (rmem > rcvbuf)
+ if ((unsigned int)rmem > rcvbuf)
goto drop;
/* Under mem pressure, it might be helpful to help udp_recvmsg()
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v1 net 2/3] udp: Fix memory accounting leak.
2025-03-23 23:09 [PATCH v1 net 0/3] udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
@ 2025-03-23 23:09 ` Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc Kuniyuki Iwashima
2 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-23 23:09 UTC (permalink / raw)
To: Willem de Bruijn, David S. Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
Matt Dowling
Matt Dowling reported a weird UDP memory usage issue.
Under normal operation, the UDP memory usage reported in /proc/net/sockstat
remains close to zero. However, it occasionally spiked to 524,288 pages
and never dropped. Moreover, the value doubled when the application was
terminated. Finally, it caused intermittent packet drops.
We can reproduce the issue with the script below [0]:
1. /proc/net/sockstat reports 0 pages
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 0
2. Run the script till the report reaches 524,288
# python3 test.py & sleep 5
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> PAGE_SHIFT
3. Kill the socket and confirm the number never drops
# pkill python3 && sleep 5
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 524288
4. (necessary since v6.0) Trigger proto_memory_pcpu_drain()
# python3 test.py & sleep 1 && pkill python3
5. The number doubles
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 1048577
The application set INT_MAX to SO_RCVBUF, which triggered an integer
overflow in udp_rmem_release().
When a socket is close()d, udp_destruct_common() purges its receive
queue and sums up skb->truesize in the queue. This total is calculated
and stored in a local unsigned integer variable.
The total size is then passed to udp_rmem_release() to adjust memory
accounting. However, because the function takes a signed integer
argument, the total size can wrap around, causing an overflow.
Then, the released amount is calculated as follows:
1) Add size to sk->sk_forward_alloc.
2) Round down sk->sk_forward_alloc to the nearest lower multiple of
PAGE_SIZE and assign it to amount.
3) Subtract amount from sk->sk_forward_alloc.
4) Pass amount >> PAGE_SHIFT to __sk_mem_reduce_allocated().
When the issue occurred, the total in udp_destruct_common() was 2147484480
(INT_MAX + 833), which was cast to -2147482816 in udp_rmem_release().
At 1) sk->sk_forward_alloc is changed from 3264 to -2147479552, and
2) sets -2147479552 to amount. 3) reverts the wraparound, so we don't
see a warning in inet_sock_destruct(). However, udp_memory_allocated
ends up doubling at 4).
Since commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
memory_allocated"), memory usage no longer doubles immediately after
a socket is close()d because __sk_mem_reduce_allocated() caches the
amount in udp_memory_per_cpu_fw_alloc. However, the next time a UDP
socket receives a packet, the subtraction takes effect, causing UDP
memory usage to double.
This issue makes further memory allocation fail once the socket's
sk->sk_rmem_alloc exceeds net.ipv4.udp_rmem_min, resulting in packet
drops.
To prevent this issue, let's use unsigned int for the calculation and
call sk_forward_alloc_add() only once for the small delta.
Note that first_packet_length() also potentially has the same problem.
[0]:
from socket import *
SO_RCVBUFFORCE = 33
INT_MAX = (2 ** 31) - 1
s = socket(AF_INET, SOCK_DGRAM)
s.bind(('', 0))
s.setsockopt(SOL_SOCKET, SO_RCVBUFFORCE, INT_MAX)
c = socket(AF_INET, SOCK_DGRAM)
c.connect(s.getsockname())
data = b'a' * 100
while True:
c.send(data)
Fixes: f970bd9e3a06 ("udp: implement memory accounting helpers")
Reported-by: Matt Dowling <madowlin@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/udp.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a1e60aab29b5..ca6a5eb0ddac 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1626,12 +1626,12 @@ static bool udp_skb_has_head_state(struct sk_buff *skb)
}
/* fully reclaim rmem/fwd memory allocated for skb */
-static void udp_rmem_release(struct sock *sk, int size, int partial,
- bool rx_queue_lock_held)
+static void udp_rmem_release(struct sock *sk, unsigned int size,
+ int partial, bool rx_queue_lock_held)
{
struct udp_sock *up = udp_sk(sk);
struct sk_buff_head *sk_queue;
- int amt;
+ unsigned int amt;
if (likely(partial)) {
up->forward_deficit += size;
@@ -1651,10 +1651,8 @@ static void udp_rmem_release(struct sock *sk, int size, int partial,
if (!rx_queue_lock_held)
spin_lock(&sk_queue->lock);
-
- sk_forward_alloc_add(sk, size);
- amt = (sk->sk_forward_alloc - partial) & ~(PAGE_SIZE - 1);
- sk_forward_alloc_add(sk, -amt);
+ amt = (size + sk->sk_forward_alloc - partial) & ~(PAGE_SIZE - 1);
+ sk_forward_alloc_add(sk, size - amt);
if (amt)
__sk_mem_reduce_allocated(sk, amt >> PAGE_SHIFT);
@@ -1836,7 +1834,7 @@ EXPORT_SYMBOL_GPL(skb_consume_udp);
static struct sk_buff *__first_packet_length(struct sock *sk,
struct sk_buff_head *rcvq,
- int *total)
+ unsigned int *total)
{
struct sk_buff *skb;
@@ -1869,8 +1867,8 @@ static int first_packet_length(struct sock *sk)
{
struct sk_buff_head *rcvq = &udp_sk(sk)->reader_queue;
struct sk_buff_head *sk_queue = &sk->sk_receive_queue;
+ unsigned int total = 0;
struct sk_buff *skb;
- int total = 0;
int res;
spin_lock_bh(&rcvq->lock);
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc.
2025-03-23 23:09 [PATCH v1 net 0/3] udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 2/3] udp: Fix memory accounting leak Kuniyuki Iwashima
@ 2025-03-23 23:09 ` Kuniyuki Iwashima
2025-03-24 15:58 ` Willem de Bruijn
2 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-23 23:09 UTC (permalink / raw)
To: Willem de Bruijn, David S. Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
The test creates client and server sockets and sets INT_MAX to the
server's SO_RCVBUFFORCE.
Then, the client floods packets to the server until the UDP memory
usage reaches (INT_MAX + 1) >> PAGE_SHIFT.
Finally, both sockets are close()d, and the last assert makes sure
that the memory usage drops to 0.
If needed, we can extend the test later for other protocols.
Without patch 1:
# RUN so_rcvbuf.udp_ipv4.rmem_max ...
# so_rcvbuf.c:160:rmem_max:Expected pages (524800) <= *variant->max_pages (524288)
# rmem_max: Test terminated by assertion
# FAIL so_rcvbuf.udp_ipv4.rmem_max
not ok 1 so_rcvbuf.udp_ipv4.rmem_max
Without patch 2:
# RUN so_rcvbuf.udp_ipv4.rmem_max ...
# so_rcvbuf.c:167:rmem_max:max_pages: 524288
# so_rcvbuf.c:175:rmem_max:Expected get_prot_pages(_metadata, variant) (524288) == 0 (0)
# rmem_max: Test terminated by assertion
# FAIL so_rcvbuf.udp_ipv4.rmem_max
not ok 1 so_rcvbuf.udp_ipv4.rmem_max
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/so_rcvbuf.c | 178 ++++++++++++++++++++++++
3 files changed, 180 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/net/so_rcvbuf.c
diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 28a715a8ef2b..befbdfb26581 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -41,6 +41,7 @@ sk_so_peek_off
socket
so_incoming_cpu
so_netns_cookie
+so_rcvbuf
so_txtime
stress_reuseport_listen
tap
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 8f32b4f01aee..d04428eaa819 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -83,7 +83,7 @@ TEST_GEN_PROGS += sk_bind_sendto_listen
TEST_GEN_PROGS += sk_connect_zero_addr
TEST_GEN_PROGS += sk_so_peek_off
TEST_PROGS += test_ingress_egress_chaining.sh
-TEST_GEN_PROGS += so_incoming_cpu
+TEST_GEN_PROGS += so_incoming_cpu so_rcvbuf
TEST_PROGS += sctp_vrf.sh
TEST_GEN_FILES += sctp_hello
TEST_GEN_FILES += ip_local_port_range
diff --git a/tools/testing/selftests/net/so_rcvbuf.c b/tools/testing/selftests/net/so_rcvbuf.c
new file mode 100644
index 000000000000..6e20bafce32e
--- /dev/null
+++ b/tools/testing/selftests/net/so_rcvbuf.c
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include <limits.h>
+#include <netinet/in.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+#include "../kselftest_harness.h"
+
+static int udp_max_pages;
+
+static int udp_parse_pages(struct __test_metadata *_metadata,
+ char *line, int *pages)
+{
+ int ret, unused;
+
+ if (strncmp(line, "UDP:", 4))
+ return -1;
+
+ ret = sscanf(line + 4, " inuse %d mem %d", &unused, pages);
+ ASSERT_EQ(2, ret);
+
+ return 0;
+}
+
+FIXTURE(so_rcvbuf)
+{
+ union {
+ struct sockaddr addr;
+ struct sockaddr_in addr4;
+ struct sockaddr_in6 addr6;
+ };
+ socklen_t addrlen;
+ int server;
+ int client;
+};
+
+FIXTURE_VARIANT(so_rcvbuf)
+{
+ int family;
+ int type;
+ int protocol;
+ int *max_pages;
+ int (*parse_pages)(struct __test_metadata *_metadata,
+ char *line, int *pages);
+};
+
+FIXTURE_VARIANT_ADD(so_rcvbuf, udp_ipv4)
+{
+ .family = AF_INET,
+ .type = SOCK_DGRAM,
+ .protocol = 0,
+ .max_pages = &udp_max_pages,
+ .parse_pages = udp_parse_pages,
+};
+
+FIXTURE_VARIANT_ADD(so_rcvbuf, udp_ipv6)
+{
+ .family = AF_INET6,
+ .type = SOCK_DGRAM,
+ .protocol = 0,
+ .max_pages = &udp_max_pages,
+ .parse_pages = udp_parse_pages,
+};
+
+static int get_page_shift(void)
+{
+ int page_size = getpagesize();
+ int page_shift = 0;
+
+ while (page_size > 1) {
+ page_size >>= 1;
+ page_shift++;
+ }
+
+ return page_shift;
+}
+
+FIXTURE_SETUP(so_rcvbuf)
+{
+ self->addr.sa_family = variant->family;
+
+ if (variant->family == AF_INET)
+ self->addrlen = sizeof(struct sockaddr_in);
+ else
+ self->addrlen = sizeof(struct sockaddr_in6);
+
+ udp_max_pages = (INT_MAX + 1L) >> get_page_shift();
+}
+
+FIXTURE_TEARDOWN(so_rcvbuf)
+{
+}
+
+static void create_socketpair(struct __test_metadata *_metadata,
+ FIXTURE_DATA(so_rcvbuf) *self,
+ const FIXTURE_VARIANT(so_rcvbuf) *variant)
+{
+ int ret;
+
+ self->server = socket(variant->family, variant->type, variant->protocol);
+ ASSERT_NE(self->server, -1);
+
+ self->client = socket(variant->family, variant->type, variant->protocol);
+ ASSERT_NE(self->client, -1);
+
+ ret = bind(self->server, &self->addr, self->addrlen);
+ ASSERT_EQ(ret, 0);
+
+ ret = getsockname(self->server, &self->addr, &self->addrlen);
+ ASSERT_EQ(ret, 0);
+
+ ret = connect(self->client, &self->addr, self->addrlen);
+ ASSERT_EQ(ret, 0);
+}
+
+static int get_prot_pages(struct __test_metadata *_metadata,
+ const FIXTURE_VARIANT(so_rcvbuf) *variant)
+{
+ char *line = NULL;
+ size_t unused;
+ int pages = 0;
+ FILE *f;
+
+ f = fopen("/proc/net/sockstat", "r");
+ ASSERT_NE(NULL, f);
+
+ while (getline(&line, &unused, f) != -1)
+ if (!variant->parse_pages(_metadata, line, &pages))
+ break;
+
+ free(line);
+ fclose(f);
+
+ return pages;
+}
+
+TEST_F(so_rcvbuf, rmem_max)
+{
+ char buf[16] = {};
+ int ret, i;
+
+ create_socketpair(_metadata, self, variant);
+
+ ret = setsockopt(self->server, SOL_SOCKET, SO_RCVBUFFORCE,
+ &(int){INT_MAX}, sizeof(int));
+ ASSERT_EQ(ret, 0);
+
+ ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
+
+ for (i = 1; ; i++) {
+ ret = send(self->client, buf, sizeof(buf), 0);
+ ASSERT_EQ(ret, sizeof(buf));
+
+ if (i % 10000 == 0) {
+ int pages = get_prot_pages(_metadata, variant);
+
+ /* sk_rmem_alloc wrapped around too much ? */
+ ASSERT_LE(pages, *variant->max_pages);
+
+ if (pages == *variant->max_pages)
+ break;
+ }
+ }
+
+ TH_LOG("max_pages: %d", get_prot_pages(_metadata, variant));
+
+ close(self->client);
+ close(self->server);
+
+ /* Give RCU a chance to call udp_destruct_common() */
+ sleep(5);
+
+ ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
+}
+
+TEST_HARNESS_MAIN
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
@ 2025-03-24 10:01 ` Eric Dumazet
2025-03-24 18:00 ` Kuniyuki Iwashima
2025-03-24 14:59 ` Willem de Bruijn
1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2025-03-24 10:01 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Willem de Bruijn, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Simon Horman, Kuniyuki Iwashima, netdev
On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> __udp_enqueue_schedule_skb() has the following condition:
>
> if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> goto drop;
>
> sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> or SO_RCVBUFFORCE.
>
> If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> as sk->sk_rmem_alloc is also signed int.
>
> Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> unconditionally.
>
> This results in integer overflow (possibly multiple times) on
> sk->sk_rmem_alloc and allows a single socket to have skb up to
> net.core.udp_mem[1].
>
> For example, if we set a large value to udp_mem[1] and INT_MAX to
> sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> overflows:
>
> # cat /proc/net/sockstat | grep UDP:
> UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> ^- PAGE_SHIFT
> # ss -uam
> State Recv-Q ...
> UNCONN -1757018048 ... <-- flipping the sign repeatedly
> skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
>
> Previously, we had a boundary check for INT_MAX, which was removed by
> commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
>
> A complete fix would be to revert it and cap the right operand by
> INT_MAX:
>
> rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> goto uncharge_drop;
>
> but we do not want to add the expensive atomic_add_return() back just
> for the corner case.
>
> So, let's perform the first check as unsigned int to detect the
> integer overflow.
>
> Note that we still allow a single wraparound, which can be observed
> from userspace, but it's acceptable considering it's unlikely that
> no recv() is called for a long period, and the negative value will
> soon flip back to positive after a few recv() calls.
>
> # cat /proc/net/sockstat | grep UDP:
> UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12
>
> # ss -uam
> State Recv-Q ...
> UNCONN -2147482816 ... <-- INT_MAX + 831 bytes
> skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
>
> Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> net/ipv4/udp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a9bb9ce5438e..a1e60aab29b5 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> */
> rmem = atomic_read(&sk->sk_rmem_alloc);
> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> - if (rmem > rcvbuf)
> + if ((unsigned int)rmem > rcvbuf)
SGTM, but maybe make rmem and rcvbuf 'unsigned int ' to avoid casts ?
BTW piling 2GB worth of skbs in a single UDP receive queue means a
latency spike when
__skb_queue_purge(&sk->sk_receive_queue) is called, say from
inet_sock_destruct(),
which is a problem on its own.
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70
100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size)
int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
{
struct sk_buff_head *list = &sk->sk_receive_queue;
- int rmem, err = -ENOMEM;
+ unsigned int rmem, rcvbuf;
+ int size, err = -ENOMEM;
spinlock_t *busy = NULL;
- int size, rcvbuf;
/* Immediately drop when the receive queue is full.
* Always allow at least one packet.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
2025-03-24 10:01 ` Eric Dumazet
@ 2025-03-24 14:59 ` Willem de Bruijn
2025-03-24 18:10 ` Kuniyuki Iwashima
1 sibling, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2025-03-24 14:59 UTC (permalink / raw)
To: Kuniyuki Iwashima, Willem de Bruijn, David S. Miller, David Ahern,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
Kuniyuki Iwashima wrote:
> __udp_enqueue_schedule_skb() has the following condition:
>
> if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> goto drop;
>
> sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> or SO_RCVBUFFORCE.
>
> If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> as sk->sk_rmem_alloc is also signed int.
>
> Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> unconditionally.
>
> This results in integer overflow (possibly multiple times) on
> sk->sk_rmem_alloc and allows a single socket to have skb up to
> net.core.udp_mem[1].
>
> For example, if we set a large value to udp_mem[1] and INT_MAX to
> sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> overflows:
>
> # cat /proc/net/sockstat | grep UDP:
> UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> ^- PAGE_SHIFT
> # ss -uam
> State Recv-Q ...
> UNCONN -1757018048 ... <-- flipping the sign repeatedly
> skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
>
> Previously, we had a boundary check for INT_MAX, which was removed by
> commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
>
> A complete fix would be to revert it and cap the right operand by
> INT_MAX:
>
> rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> goto uncharge_drop;
>
> but we do not want to add the expensive atomic_add_return() back just
> for the corner case.
>
> So, let's perform the first check as unsigned int to detect the
> integer overflow.
>
> Note that we still allow a single wraparound, which can be observed
> from userspace, but it's acceptable considering it's unlikely that
> no recv() is called for a long period, and the negative value will
> soon flip back to positive after a few recv() calls.
Can we do better than this?
Is this because of the "Always allow at least one packet" below, and
due to testing the value of the counter without skb->truesize added?
/* Immediately drop when the receive queue is full.
* Always allow at least one packet.
*/
rmem = atomic_read(&sk->sk_rmem_alloc);
rcvbuf = READ_ONCE(sk->sk_rcvbuf);
if (rmem > rcvbuf)
goto drop;
>
> # cat /proc/net/sockstat | grep UDP:
> UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12
>
> # ss -uam
> State Recv-Q ...
> UNCONN -2147482816 ... <-- INT_MAX + 831 bytes
> skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
>
> Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> net/ipv4/udp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a9bb9ce5438e..a1e60aab29b5 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> */
> rmem = atomic_read(&sk->sk_rmem_alloc);
> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> - if (rmem > rcvbuf)
> + if ((unsigned int)rmem > rcvbuf)
> goto drop;
>
> /* Under mem pressure, it might be helpful to help udp_recvmsg()
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc.
2025-03-23 23:09 ` [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc Kuniyuki Iwashima
@ 2025-03-24 15:58 ` Willem de Bruijn
2025-03-24 18:25 ` Kuniyuki Iwashima
0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2025-03-24 15:58 UTC (permalink / raw)
To: Kuniyuki Iwashima, Willem de Bruijn, David S. Miller, David Ahern,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
Kuniyuki Iwashima wrote:
> The test creates client and server sockets and sets INT_MAX to the
> server's SO_RCVBUFFORCE.
>
> Then, the client floods packets to the server until the UDP memory
> usage reaches (INT_MAX + 1) >> PAGE_SHIFT.
>
> Finally, both sockets are close()d, and the last assert makes sure
> that the memory usage drops to 0.
>
> If needed, we can extend the test later for other protocols.
>
> Without patch 1:
>
> # RUN so_rcvbuf.udp_ipv4.rmem_max ...
> # so_rcvbuf.c:160:rmem_max:Expected pages (524800) <= *variant->max_pages (524288)
> # rmem_max: Test terminated by assertion
> # FAIL so_rcvbuf.udp_ipv4.rmem_max
> not ok 1 so_rcvbuf.udp_ipv4.rmem_max
>
> Without patch 2:
>
> # RUN so_rcvbuf.udp_ipv4.rmem_max ...
> # so_rcvbuf.c:167:rmem_max:max_pages: 524288
> # so_rcvbuf.c:175:rmem_max:Expected get_prot_pages(_metadata, variant) (524288) == 0 (0)
> # rmem_max: Test terminated by assertion
> # FAIL so_rcvbuf.udp_ipv4.rmem_max
> not ok 1 so_rcvbuf.udp_ipv4.rmem_max
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> tools/testing/selftests/net/.gitignore | 1 +
> tools/testing/selftests/net/Makefile | 2 +-
> tools/testing/selftests/net/so_rcvbuf.c | 178 ++++++++++++++++++++++++
> 3 files changed, 180 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/net/so_rcvbuf.c
>
> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
> index 28a715a8ef2b..befbdfb26581 100644
> --- a/tools/testing/selftests/net/.gitignore
> +++ b/tools/testing/selftests/net/.gitignore
> @@ -41,6 +41,7 @@ sk_so_peek_off
> socket
> so_incoming_cpu
> so_netns_cookie
> +so_rcvbuf
> so_txtime
> stress_reuseport_listen
> tap
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index 8f32b4f01aee..d04428eaa819 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -83,7 +83,7 @@ TEST_GEN_PROGS += sk_bind_sendto_listen
> TEST_GEN_PROGS += sk_connect_zero_addr
> TEST_GEN_PROGS += sk_so_peek_off
> TEST_PROGS += test_ingress_egress_chaining.sh
> -TEST_GEN_PROGS += so_incoming_cpu
> +TEST_GEN_PROGS += so_incoming_cpu so_rcvbuf
> TEST_PROGS += sctp_vrf.sh
> TEST_GEN_FILES += sctp_hello
> TEST_GEN_FILES += ip_local_port_range
> diff --git a/tools/testing/selftests/net/so_rcvbuf.c b/tools/testing/selftests/net/so_rcvbuf.c
> new file mode 100644
> index 000000000000..6e20bafce32e
> --- /dev/null
> +++ b/tools/testing/selftests/net/so_rcvbuf.c
> @@ -0,0 +1,178 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright Amazon.com Inc. or its affiliates. */
> +
> +#include <limits.h>
> +#include <netinet/in.h>
> +#include <sys/socket.h>
> +#include <unistd.h>
> +
> +#include "../kselftest_harness.h"
> +
> +static int udp_max_pages;
> +
> +static int udp_parse_pages(struct __test_metadata *_metadata,
> + char *line, int *pages)
> +{
> + int ret, unused;
> +
> + if (strncmp(line, "UDP:", 4))
> + return -1;
> +
> + ret = sscanf(line + 4, " inuse %d mem %d", &unused, pages);
> + ASSERT_EQ(2, ret);
> +
> + return 0;
> +}
> +
> +FIXTURE(so_rcvbuf)
> +{
> + union {
> + struct sockaddr addr;
> + struct sockaddr_in addr4;
> + struct sockaddr_in6 addr6;
> + };
> + socklen_t addrlen;
> + int server;
> + int client;
> +};
> +
> +FIXTURE_VARIANT(so_rcvbuf)
> +{
> + int family;
> + int type;
> + int protocol;
> + int *max_pages;
> + int (*parse_pages)(struct __test_metadata *_metadata,
> + char *line, int *pages);
> +};
> +
> +FIXTURE_VARIANT_ADD(so_rcvbuf, udp_ipv4)
> +{
> + .family = AF_INET,
> + .type = SOCK_DGRAM,
> + .protocol = 0,
> + .max_pages = &udp_max_pages,
> + .parse_pages = udp_parse_pages,
> +};
> +
> +FIXTURE_VARIANT_ADD(so_rcvbuf, udp_ipv6)
> +{
> + .family = AF_INET6,
> + .type = SOCK_DGRAM,
> + .protocol = 0,
> + .max_pages = &udp_max_pages,
> + .parse_pages = udp_parse_pages,
> +};
> +
> +static int get_page_shift(void)
> +{
> + int page_size = getpagesize();
> + int page_shift = 0;
> +
> + while (page_size > 1) {
> + page_size >>= 1;
> + page_shift++;
> + }
> +
> + return page_shift;
> +}
> +
> +FIXTURE_SETUP(so_rcvbuf)
> +{
> + self->addr.sa_family = variant->family;
> +
> + if (variant->family == AF_INET)
> + self->addrlen = sizeof(struct sockaddr_in);
> + else
> + self->addrlen = sizeof(struct sockaddr_in6);
> +
> + udp_max_pages = (INT_MAX + 1L) >> get_page_shift();
> +}
> +
> +FIXTURE_TEARDOWN(so_rcvbuf)
> +{
> +}
> +
> +static void create_socketpair(struct __test_metadata *_metadata,
> + FIXTURE_DATA(so_rcvbuf) *self,
> + const FIXTURE_VARIANT(so_rcvbuf) *variant)
> +{
> + int ret;
> +
> + self->server = socket(variant->family, variant->type, variant->protocol);
> + ASSERT_NE(self->server, -1);
> +
> + self->client = socket(variant->family, variant->type, variant->protocol);
> + ASSERT_NE(self->client, -1);
> +
> + ret = bind(self->server, &self->addr, self->addrlen);
> + ASSERT_EQ(ret, 0);
> +
> + ret = getsockname(self->server, &self->addr, &self->addrlen);
> + ASSERT_EQ(ret, 0);
> +
> + ret = connect(self->client, &self->addr, self->addrlen);
> + ASSERT_EQ(ret, 0);
> +}
> +
> +static int get_prot_pages(struct __test_metadata *_metadata,
> + const FIXTURE_VARIANT(so_rcvbuf) *variant)
> +{
> + char *line = NULL;
> + size_t unused;
> + int pages = 0;
> + FILE *f;
> +
> + f = fopen("/proc/net/sockstat", "r");
> + ASSERT_NE(NULL, f);
> +
> + while (getline(&line, &unused, f) != -1)
> + if (!variant->parse_pages(_metadata, line, &pages))
> + break;
> +
> + free(line);
> + fclose(f);
> +
> + return pages;
> +}
> +
> +TEST_F(so_rcvbuf, rmem_max)
> +{
> + char buf[16] = {};
> + int ret, i;
> +
> + create_socketpair(_metadata, self, variant);
> +
> + ret = setsockopt(self->server, SOL_SOCKET, SO_RCVBUFFORCE,
> + &(int){INT_MAX}, sizeof(int));
> + ASSERT_EQ(ret, 0);
> +
> + ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
> +
> + for (i = 1; ; i++) {
> + ret = send(self->client, buf, sizeof(buf), 0);
> + ASSERT_EQ(ret, sizeof(buf));
> +
> + if (i % 10000 == 0) {
> + int pages = get_prot_pages(_metadata, variant);
> +
> + /* sk_rmem_alloc wrapped around too much ? */
> + ASSERT_LE(pages, *variant->max_pages);
> +
> + if (pages == *variant->max_pages)
> + break;
Does correctness depend here on max_pages being a multiple of 10K?
> + }
> + }
> +
> + TH_LOG("max_pages: %d", get_prot_pages(_metadata, variant));
> +
> + close(self->client);
> + close(self->server);
> +
> + /* Give RCU a chance to call udp_destruct_common() */
> + sleep(5);
> +
> + ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
> +}
> +
> +TEST_HARNESS_MAIN
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-24 10:01 ` Eric Dumazet
@ 2025-03-24 18:00 ` Kuniyuki Iwashima
0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-24 18:00 UTC (permalink / raw)
To: edumazet
Cc: davem, dsahern, horms, kuba, kuni1840, kuniyu, netdev, pabeni,
willemdebruijn.kernel
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 24 Mar 2025 11:01:15 +0100
> On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > __udp_enqueue_schedule_skb() has the following condition:
> >
> > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > goto drop;
> >
> > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > or SO_RCVBUFFORCE.
> >
> > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > as sk->sk_rmem_alloc is also signed int.
> >
> > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > unconditionally.
> >
> > This results in integer overflow (possibly multiple times) on
> > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > net.core.udp_mem[1].
> >
> > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > overflows:
> >
> > # cat /proc/net/sockstat | grep UDP:
> > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > ^- PAGE_SHIFT
> > # ss -uam
> > State Recv-Q ...
> > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> >
> > Previously, we had a boundary check for INT_MAX, which was removed by
> > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> >
> > A complete fix would be to revert it and cap the right operand by
> > INT_MAX:
> >
> > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > goto uncharge_drop;
> >
> > but we do not want to add the expensive atomic_add_return() back just
> > for the corner case.
> >
> > So, let's perform the first check as unsigned int to detect the
> > integer overflow.
> >
> > Note that we still allow a single wraparound, which can be observed
> > from userspace, but it's acceptable considering it's unlikely that
> > no recv() is called for a long period, and the negative value will
> > soon flip back to positive after a few recv() calls.
> >
> > # cat /proc/net/sockstat | grep UDP:
> > UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12
> >
> > # ss -uam
> > State Recv-Q ...
> > UNCONN -2147482816 ... <-- INT_MAX + 831 bytes
> > skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
> >
> > Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > net/ipv4/udp.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a9bb9ce5438e..a1e60aab29b5 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > */
> > rmem = atomic_read(&sk->sk_rmem_alloc);
> > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > - if (rmem > rcvbuf)
> > + if ((unsigned int)rmem > rcvbuf)
>
> SGTM, but maybe make rmem and rcvbuf 'unsigned int ' to avoid casts ?
That's cleaner. I'll add a small comment above the comparison
not to lose the boundary check by defining these two as int in
the future.
>
> BTW piling 2GB worth of skbs in a single UDP receive queue means a
> latency spike when
> __skb_queue_purge(&sk->sk_receive_queue) is called, say from
> inet_sock_destruct(),
> which is a problem on its own.
Yes, we need to improve our application a lot :)
Thanks!
>
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70
> 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> {
> struct sk_buff_head *list = &sk->sk_receive_queue;
> - int rmem, err = -ENOMEM;
> + unsigned int rmem, rcvbuf;
> + int size, err = -ENOMEM;
> spinlock_t *busy = NULL;
> - int size, rcvbuf;
>
> /* Immediately drop when the receive queue is full.
> * Always allow at least one packet.
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-24 14:59 ` Willem de Bruijn
@ 2025-03-24 18:10 ` Kuniyuki Iwashima
2025-03-24 19:44 ` Willem de Bruijn
0 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-24 18:10 UTC (permalink / raw)
To: willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 10:59:49 -0400
> Kuniyuki Iwashima wrote:
> > __udp_enqueue_schedule_skb() has the following condition:
> >
> > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > goto drop;
> >
> > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > or SO_RCVBUFFORCE.
> >
> > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > as sk->sk_rmem_alloc is also signed int.
> >
> > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > unconditionally.
> >
> > This results in integer overflow (possibly multiple times) on
> > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > net.core.udp_mem[1].
> >
> > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > overflows:
> >
> > # cat /proc/net/sockstat | grep UDP:
> > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > ^- PAGE_SHIFT
> > # ss -uam
> > State Recv-Q ...
> > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> >
> > Previously, we had a boundary check for INT_MAX, which was removed by
> > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> >
> > A complete fix would be to revert it and cap the right operand by
> > INT_MAX:
> >
> > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > goto uncharge_drop;
> >
> > but we do not want to add the expensive atomic_add_return() back just
> > for the corner case.
> >
> > So, let's perform the first check as unsigned int to detect the
> > integer overflow.
> >
> > Note that we still allow a single wraparound, which can be observed
> > from userspace, but it's acceptable considering it's unlikely that
> > no recv() is called for a long period, and the negative value will
> > soon flip back to positive after a few recv() calls.
>
> Can we do better than this?
Another approach I had in mind was to restore the original validation
under the recvq lock but without atomic ops like
1. add another u32 as union of sk_rmem_alloc (only for UDP)
2. access it with READ_ONCE() or under the recvq lock
3. perform the validation under the lock
But it requires more changes around the error queue handling and
the general socket impl, so will be too invasive for net.git but
maybe worth a try for net-next ?
> Is this because of the "Always allow at least one packet" below, and
> due to testing the value of the counter without skb->truesize added?
Yes, that's the reason although we don't receive a single >INT_MAX
packet.
>
> /* Immediately drop when the receive queue is full.
> * Always allow at least one packet.
> */
> rmem = atomic_read(&sk->sk_rmem_alloc);
> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> if (rmem > rcvbuf)
> goto drop;
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc.
2025-03-24 15:58 ` Willem de Bruijn
@ 2025-03-24 18:25 ` Kuniyuki Iwashima
2025-03-24 19:31 ` Willem de Bruijn
0 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-24 18:25 UTC (permalink / raw)
To: willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 11:58:06 -0400
> > +TEST_F(so_rcvbuf, rmem_max)
> > +{
> > + char buf[16] = {};
> > + int ret, i;
> > +
> > + create_socketpair(_metadata, self, variant);
> > +
> > + ret = setsockopt(self->server, SOL_SOCKET, SO_RCVBUFFORCE,
> > + &(int){INT_MAX}, sizeof(int));
> > + ASSERT_EQ(ret, 0);
> > +
> > + ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
> > +
> > + for (i = 1; ; i++) {
> > + ret = send(self->client, buf, sizeof(buf), 0);
> > + ASSERT_EQ(ret, sizeof(buf));
> > +
> > + if (i % 10000 == 0) {
> > + int pages = get_prot_pages(_metadata, variant);
> > +
> > + /* sk_rmem_alloc wrapped around too much ? */
> > + ASSERT_LE(pages, *variant->max_pages);
> > +
> > + if (pages == *variant->max_pages)
> > + break;
>
> Does correctness depend here on max_pages being a multiple of 10K?
10K may be too conservative, but at least we need to ensure
that the size of accumulated skbs exceeds 1 PAGE_SIZE to
fail on the ASSERT_LE(), otherwise we can't detect the multiple
wraparounds even without patch 1.
The later sleep for call_rcu() was dominant than this loop on
my machine.
Thanks!
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc.
2025-03-24 18:25 ` Kuniyuki Iwashima
@ 2025-03-24 19:31 ` Willem de Bruijn
0 siblings, 0 replies; 15+ messages in thread
From: Willem de Bruijn @ 2025-03-24 19:31 UTC (permalink / raw)
To: Kuniyuki Iwashima, willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 24 Mar 2025 11:58:06 -0400
> > > +TEST_F(so_rcvbuf, rmem_max)
> > > +{
> > > + char buf[16] = {};
> > > + int ret, i;
> > > +
> > > + create_socketpair(_metadata, self, variant);
> > > +
> > > + ret = setsockopt(self->server, SOL_SOCKET, SO_RCVBUFFORCE,
> > > + &(int){INT_MAX}, sizeof(int));
> > > + ASSERT_EQ(ret, 0);
> > > +
> > > + ASSERT_EQ(get_prot_pages(_metadata, variant), 0);
> > > +
> > > + for (i = 1; ; i++) {
> > > + ret = send(self->client, buf, sizeof(buf), 0);
> > > + ASSERT_EQ(ret, sizeof(buf));
> > > +
> > > + if (i % 10000 == 0) {
> > > + int pages = get_prot_pages(_metadata, variant);
> > > +
> > > + /* sk_rmem_alloc wrapped around too much ? */
> > > + ASSERT_LE(pages, *variant->max_pages);
> > > +
> > > + if (pages == *variant->max_pages)
> > > + break;
> >
> > Does correctness depend here on max_pages being a multiple of 10K?
>
> 10K may be too conservative, but at least we need to ensure
> that the size of accumulated skbs exceeds 1 PAGE_SIZE to
> fail on the ASSERT_LE(), otherwise we can't detect the multiple
> wraparounds even without patch 1.
Thanks. It took me some time to understand. Without overflow,
the pages counter will saturate at max_pages as the queue fills up.
> The later sleep for call_rcu() was dominant than this loop on
> my machine.
Ack.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-24 18:10 ` Kuniyuki Iwashima
@ 2025-03-24 19:44 ` Willem de Bruijn
2025-03-24 20:46 ` Kuniyuki Iwashima
0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2025-03-24 19:44 UTC (permalink / raw)
To: Kuniyuki Iwashima, willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 24 Mar 2025 10:59:49 -0400
> > Kuniyuki Iwashima wrote:
> > > __udp_enqueue_schedule_skb() has the following condition:
> > >
> > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > goto drop;
> > >
> > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > or SO_RCVBUFFORCE.
> > >
> > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > as sk->sk_rmem_alloc is also signed int.
> > >
> > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > unconditionally.
> > >
> > > This results in integer overflow (possibly multiple times) on
> > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > net.core.udp_mem[1].
> > >
> > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > overflows:
> > >
> > > # cat /proc/net/sockstat | grep UDP:
> > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > > ^- PAGE_SHIFT
> > > # ss -uam
> > > State Recv-Q ...
> > > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > >
> > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > >
> > > A complete fix would be to revert it and cap the right operand by
> > > INT_MAX:
> > >
> > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > goto uncharge_drop;
> > >
> > > but we do not want to add the expensive atomic_add_return() back just
> > > for the corner case.
> > >
> > > So, let's perform the first check as unsigned int to detect the
> > > integer overflow.
> > >
> > > Note that we still allow a single wraparound, which can be observed
> > > from userspace, but it's acceptable considering it's unlikely that
> > > no recv() is called for a long period, and the negative value will
> > > soon flip back to positive after a few recv() calls.
> >
> > Can we do better than this?
>
> Another approach I had in mind was to restore the original validation
> under the recvq lock but without atomic ops like
>
> 1. add another u32 as union of sk_rmem_alloc (only for UDP)
> 2. access it with READ_ONCE() or under the recvq lock
> 3. perform the validation under the lock
>
> But it requires more changes around the error queue handling and
> the general socket impl, so will be too invasive for net.git but
> maybe worth a try for net-next ?
Definitely not net material. Adding more complexity here
would also need some convincing benchmark data probably.
>
> > Is this because of the "Always allow at least one packet" below, and
> > due to testing the value of the counter without skb->truesize added?
>
> Yes, that's the reason although we don't receive a single >INT_MAX
> packet.
I was surprised that we don't take the current skb size into
account when doing this calculation.
Turns out that this code used to do that.
commit 363dc73acacb ("udp: be less conservative with sock rmem
accounting") made this change:
- if (rmem && (rmem + size > sk->sk_rcvbuf))
+ if (rmem > sk->sk_rcvbuf)
goto drop;
The special consideration to allow one packet is to avoid starvation
with small rcvbuf, judging also from this review comment:
https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
That clearly doesn't apply when rcvbuf is near INT_MAX.
Can we separate the tiny budget case and hard drop including the
skb->truesize for normal buffer sizes?
>
> >
> > /* Immediately drop when the receive queue is full.
> > * Always allow at least one packet.
> > */
> > rmem = atomic_read(&sk->sk_rmem_alloc);
> > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > if (rmem > rcvbuf)
> > goto drop;
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-24 19:44 ` Willem de Bruijn
@ 2025-03-24 20:46 ` Kuniyuki Iwashima
2025-03-25 14:18 ` Willem de Bruijn
0 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-24 20:46 UTC (permalink / raw)
To: willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 15:44:40 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > Kuniyuki Iwashima wrote:
> > > > __udp_enqueue_schedule_skb() has the following condition:
> > > >
> > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > goto drop;
> > > >
> > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > or SO_RCVBUFFORCE.
> > > >
> > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > as sk->sk_rmem_alloc is also signed int.
> > > >
> > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > unconditionally.
> > > >
> > > > This results in integer overflow (possibly multiple times) on
> > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > net.core.udp_mem[1].
> > > >
> > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > overflows:
> > > >
> > > > # cat /proc/net/sockstat | grep UDP:
> > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > ^- PAGE_SHIFT
> > > > # ss -uam
> > > > State Recv-Q ...
> > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > >
> > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > >
> > > > A complete fix would be to revert it and cap the right operand by
> > > > INT_MAX:
> > > >
> > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > goto uncharge_drop;
> > > >
> > > > but we do not want to add the expensive atomic_add_return() back just
> > > > for the corner case.
> > > >
> > > > So, let's perform the first check as unsigned int to detect the
> > > > integer overflow.
> > > >
> > > > Note that we still allow a single wraparound, which can be observed
> > > > from userspace, but it's acceptable considering it's unlikely that
> > > > no recv() is called for a long period, and the negative value will
> > > > soon flip back to positive after a few recv() calls.
> > >
> > > Can we do better than this?
> >
> > Another approach I had in mind was to restore the original validation
> > under the recvq lock but without atomic ops like
> >
> > 1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > 2. access it with READ_ONCE() or under the recvq lock
> > 3. perform the validation under the lock
> >
> > But it requires more changes around the error queue handling and
> > the general socket impl, so will be too invasive for net.git but
> > maybe worth a try for net-next ?
>
> Definitely not net material. Adding more complexity here
> would also need some convincing benchmark data probably.
>
> >
> > > Is this because of the "Always allow at least one packet" below, and
> > > due to testing the value of the counter without skb->truesize added?
> >
> > Yes, that's the reason although we don't receive a single >INT_MAX
> > packet.
>
> I was surprised that we don't take the current skb size into
> account when doing this calculation.
>
> Turns out that this code used to do that.
>
> commit 363dc73acacb ("udp: be less conservative with sock rmem
> accounting") made this change:
>
> - if (rmem && (rmem + size > sk->sk_rcvbuf))
> + if (rmem > sk->sk_rcvbuf)
> goto drop;
>
> The special consideration to allow one packet is to avoid starvation
> with small rcvbuf, judging also from this review comment:
>
> https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
Interesting, thanks for the info !
Now it's allowed to exceed by the total size of the incoming skb
on every CPUs, and a user may notice that rmem > rcvbuf via ss,
but I guess it's allowed because the fast recovery is expected.
>
> That clearly doesn't apply when rcvbuf is near INT_MAX.
> Can we separate the tiny budget case and hard drop including the
> skb->truesize for normal buffer sizes?
Maybe like this ?
if (rcvbuf < UDP_MIN_RCVBUF) {
if (rmem > rcvbuf)
goto drop;
} else {
if (rmem + size > rcvbuf)
goto drop;
}
SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
reported after that in 2016, so UDP_MIN_RCVBUF would be more ?
But I wonder if adding new branches in the fast path is worth for
the corner case, and that's why I chose integrating the cast into
the exisintg branch, allowing a small overflow, which is observable
only when no thread calls recv() and skbs are queued more than INT_MAX.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-24 20:46 ` Kuniyuki Iwashima
@ 2025-03-25 14:18 ` Willem de Bruijn
2025-03-25 17:53 ` Kuniyuki Iwashima
0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2025-03-25 14:18 UTC (permalink / raw)
To: Kuniyuki Iwashima, willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 24 Mar 2025 15:44:40 -0400
> > Kuniyuki Iwashima wrote:
> > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > > Kuniyuki Iwashima wrote:
> > > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > >
> > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > > goto drop;
> > > > >
> > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > > or SO_RCVBUFFORCE.
> > > > >
> > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > > as sk->sk_rmem_alloc is also signed int.
> > > > >
> > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > > unconditionally.
> > > > >
> > > > > This results in integer overflow (possibly multiple times) on
> > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > > net.core.udp_mem[1].
> > > > >
> > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > > overflows:
> > > > >
> > > > > # cat /proc/net/sockstat | grep UDP:
> > > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > > ^- PAGE_SHIFT
> > > > > # ss -uam
> > > > > State Recv-Q ...
> > > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > >
> > > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > >
> > > > > A complete fix would be to revert it and cap the right operand by
> > > > > INT_MAX:
> > > > >
> > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > > goto uncharge_drop;
> > > > >
> > > > > but we do not want to add the expensive atomic_add_return() back just
> > > > > for the corner case.
> > > > >
> > > > > So, let's perform the first check as unsigned int to detect the
> > > > > integer overflow.
> > > > >
> > > > > Note that we still allow a single wraparound, which can be observed
> > > > > from userspace, but it's acceptable considering it's unlikely that
> > > > > no recv() is called for a long period, and the negative value will
> > > > > soon flip back to positive after a few recv() calls.
> > > >
> > > > Can we do better than this?
> > >
> > > Another approach I had in mind was to restore the original validation
> > > under the recvq lock but without atomic ops like
> > >
> > > 1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > > 2. access it with READ_ONCE() or under the recvq lock
> > > 3. perform the validation under the lock
> > >
> > > But it requires more changes around the error queue handling and
> > > the general socket impl, so will be too invasive for net.git but
> > > maybe worth a try for net-next ?
> >
> > Definitely not net material. Adding more complexity here
> > would also need some convincing benchmark data probably.
> >
> > >
> > > > Is this because of the "Always allow at least one packet" below, and
> > > > due to testing the value of the counter without skb->truesize added?
> > >
> > > Yes, that's the reason although we don't receive a single >INT_MAX
> > > packet.
> >
> > I was surprised that we don't take the current skb size into
> > account when doing this calculation.
> >
> > Turns out that this code used to do that.
> >
> > commit 363dc73acacb ("udp: be less conservative with sock rmem
> > accounting") made this change:
> >
> > - if (rmem && (rmem + size > sk->sk_rcvbuf))
> > + if (rmem > sk->sk_rcvbuf)
> > goto drop;
> >
> > The special consideration to allow one packet is to avoid starvation
> > with small rcvbuf, judging also from this review comment:
> >
> > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
>
> Interesting, thanks for the info !
>
> Now it's allowed to exceed by the total size of the incoming skb
> on every CPUs, and a user may notice that rmem > rcvbuf via ss,
> but I guess it's allowed because the fast recovery is expected.
>
>
> >
> > That clearly doesn't apply when rcvbuf is near INT_MAX.
> > Can we separate the tiny budget case and hard drop including the
> > skb->truesize for normal buffer sizes?
>
> Maybe like this ?
>
> if (rcvbuf < UDP_MIN_RCVBUF) {
> if (rmem > rcvbuf)
> goto drop;
> } else {
> if (rmem + size > rcvbuf)
> goto drop;
> }
>
> SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
> reported after that in 2016, so UDP_MIN_RCVBUF would be more ?
Since the only issue is the overflow, could use a higher bound like
INT_MAX >> 1.
> But I wonder if adding new branches in the fast path is worth for
> the corner case, and that's why I chose integrating the cast into
> the exisintg branch, allowing a small overflow, which is observable
> only when no thread calls recv() and skbs are queued more than INT_MAX.
Okay. Though it can probably be structured that the likely path does
not even see this?
if (rmem + size > rcvbuf) {
if (rcvbuf > INT_MAX << 1)
goto drop;
if (rmem > rcvbuf)
goto drop;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.
2025-03-25 14:18 ` Willem de Bruijn
@ 2025-03-25 17:53 ` Kuniyuki Iwashima
0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-25 17:53 UTC (permalink / raw)
To: willemdebruijn.kernel
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 25 Mar 2025 10:18:30 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Mon, 24 Mar 2025 15:44:40 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > > > Kuniyuki Iwashima wrote:
> > > > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > > >
> > > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > > > goto drop;
> > > > > >
> > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > > > or SO_RCVBUFFORCE.
> > > > > >
> > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > > > as sk->sk_rmem_alloc is also signed int.
> > > > > >
> > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > > > unconditionally.
> > > > > >
> > > > > > This results in integer overflow (possibly multiple times) on
> > > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > > > net.core.udp_mem[1].
> > > > > >
> > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > > > overflows:
> > > > > >
> > > > > > # cat /proc/net/sockstat | grep UDP:
> > > > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > > > ^- PAGE_SHIFT
> > > > > > # ss -uam
> > > > > > State Recv-Q ...
> > > > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly
> > > > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > > >
> > > > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > > >
> > > > > > A complete fix would be to revert it and cap the right operand by
> > > > > > INT_MAX:
> > > > > >
> > > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > > > goto uncharge_drop;
> > > > > >
> > > > > > but we do not want to add the expensive atomic_add_return() back just
> > > > > > for the corner case.
> > > > > >
> > > > > > So, let's perform the first check as unsigned int to detect the
> > > > > > integer overflow.
> > > > > >
> > > > > > Note that we still allow a single wraparound, which can be observed
> > > > > > from userspace, but it's acceptable considering it's unlikely that
> > > > > > no recv() is called for a long period, and the negative value will
> > > > > > soon flip back to positive after a few recv() calls.
> > > > >
> > > > > Can we do better than this?
> > > >
> > > > Another approach I had in mind was to restore the original validation
> > > > under the recvq lock but without atomic ops like
> > > >
> > > > 1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > > > 2. access it with READ_ONCE() or under the recvq lock
> > > > 3. perform the validation under the lock
> > > >
> > > > But it requires more changes around the error queue handling and
> > > > the general socket impl, so will be too invasive for net.git but
> > > > maybe worth a try for net-next ?
> > >
> > > Definitely not net material. Adding more complexity here
> > > would also need some convincing benchmark data probably.
> > >
> > > >
> > > > > Is this because of the "Always allow at least one packet" below, and
> > > > > due to testing the value of the counter without skb->truesize added?
> > > >
> > > > Yes, that's the reason although we don't receive a single >INT_MAX
> > > > packet.
> > >
> > > I was surprised that we don't take the current skb size into
> > > account when doing this calculation.
> > >
> > > Turns out that this code used to do that.
> > >
> > > commit 363dc73acacb ("udp: be less conservative with sock rmem
> > > accounting") made this change:
> > >
> > > - if (rmem && (rmem + size > sk->sk_rcvbuf))
> > > + if (rmem > sk->sk_rcvbuf)
> > > goto drop;
> > >
> > > The special consideration to allow one packet is to avoid starvation
> > > with small rcvbuf, judging also from this review comment:
> > >
> > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
> >
> > Interesting, thanks for the info !
> >
> > Now it's allowed to exceed by the total size of the incoming skb
> > on every CPUs, and a user may notice that rmem > rcvbuf via ss,
> > but I guess it's allowed because the fast recovery is expected.
> >
> >
> > >
> > > That clearly doesn't apply when rcvbuf is near INT_MAX.
> > > Can we separate the tiny budget case and hard drop including the
> > > skb->truesize for normal buffer sizes?
> >
> > Maybe like this ?
> >
> > if (rcvbuf < UDP_MIN_RCVBUF) {
> > if (rmem > rcvbuf)
> > goto drop;
> > } else {
> > if (rmem + size > rcvbuf)
> > goto drop;
> > }
> >
> > SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
> > reported after that in 2016, so UDP_MIN_RCVBUF would be more ?
>
> Since the only issue is the overflow, could use a higher bound like
> INT_MAX >> 1.
>
> > But I wonder if adding new branches in the fast path is worth for
> > the corner case, and that's why I chose integrating the cast into
> > the exisintg branch, allowing a small overflow, which is observable
> > only when no thread calls recv() and skbs are queued more than INT_MAX.
>
> Okay. Though it can probably be structured that the likely path does
> not even see this?
Ah exactly, will use this and update commit message in v2.
Thanks!
>
> if (rmem + size > rcvbuf) {
> if (rcvbuf > INT_MAX << 1)
> goto drop;
> if (rmem > rcvbuf)
> goto drop;
> }
>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-03-25 17:54 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-23 23:09 [PATCH v1 net 0/3] udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc Kuniyuki Iwashima
2025-03-24 10:01 ` Eric Dumazet
2025-03-24 18:00 ` Kuniyuki Iwashima
2025-03-24 14:59 ` Willem de Bruijn
2025-03-24 18:10 ` Kuniyuki Iwashima
2025-03-24 19:44 ` Willem de Bruijn
2025-03-24 20:46 ` Kuniyuki Iwashima
2025-03-25 14:18 ` Willem de Bruijn
2025-03-25 17:53 ` Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 2/3] udp: Fix memory accounting leak Kuniyuki Iwashima
2025-03-23 23:09 ` [PATCH v1 net 3/3] selftest: net: Check wraparounds for sk->sk_rmem_alloc Kuniyuki Iwashima
2025-03-24 15:58 ` Willem de Bruijn
2025-03-24 18:25 ` Kuniyuki Iwashima
2025-03-24 19:31 ` Willem de Bruijn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).