netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long
@ 2024-11-07 21:03 Mina Almasry
  2024-11-07 21:03 ` [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Mina Almasry
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mina Almasry @ 2024-11-07 21:03 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, Mina Almasry, linux-doc, linux-kernel
  Cc: Paolo Abeni, Simon Horman, Jonathan Corbet, Yi Lai,
	Stanislav Fomichev

Exit early if we're freeing more than 1024 frags, to prevent
looping too long.

Also minor code cleanups:
- Flip checks to reduce indentation.
- Use sizeof(*tokens) everywhere for consistentcy.

Cc: Yi Lai <yi1.lai@linux.intel.com>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v2:
- Retain token check to prevent allocation of too much memory.
- Exit early instead of pre-checking in a loop so that we don't penalize
  well behaved applications (sdf)

---
 net/core/sock.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 039be95c40cf..da50df485090 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1052,32 +1052,34 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 
 #ifdef CONFIG_PAGE_POOL
 
-/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
- * 1 syscall. The limit exists to limit the amount of memory the kernel
- * allocates to copy these tokens.
+/* This is the number of tokens and frags that the user can SO_DEVMEM_DONTNEED
+ * in 1 syscall. The limit exists to limit the amount of memory the kernel
+ * allocates to copy these tokens, and to prevent looping over the frags for
+ * too long.
  */
 #define MAX_DONTNEED_TOKENS 128
+#define MAX_DONTNEED_FRAGS 1024
 
 static noinline_for_stack int
 sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 {
 	unsigned int num_tokens, i, j, k, netmem_num = 0;
 	struct dmabuf_token *tokens;
+	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
-	int ret = 0;
 
 	if (!sk_is_tcp(sk))
 		return -EBADF;
 
-	if (optlen % sizeof(struct dmabuf_token) ||
+	if (optlen % sizeof(*tokens) ||
 	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
 		return -EINVAL;
 
-	tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
+	num_tokens = optlen / sizeof(*tokens);
+	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
 	if (!tokens)
 		return -ENOMEM;
 
-	num_tokens = optlen / sizeof(struct dmabuf_token);
 	if (copy_from_sockptr(tokens, optval, optlen)) {
 		kvfree(tokens);
 		return -EFAULT;
@@ -1086,24 +1088,28 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 	xa_lock_bh(&sk->sk_user_frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
+			if (++num_frags > MAX_DONTNEED_FRAGS)
+				goto frag_limit_reached;
+
 			netmem_ref netmem = (__force netmem_ref)__xa_erase(
 				&sk->sk_user_frags, tokens[i].token_start + j);
 
-			if (netmem &&
-			    !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) {
-				netmems[netmem_num++] = netmem;
-				if (netmem_num == ARRAY_SIZE(netmems)) {
-					xa_unlock_bh(&sk->sk_user_frags);
-					for (k = 0; k < netmem_num; k++)
-						WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
-					netmem_num = 0;
-					xa_lock_bh(&sk->sk_user_frags);
-				}
-				ret++;
+			if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+				continue;
+
+			netmems[netmem_num++] = netmem;
+			if (netmem_num == ARRAY_SIZE(netmems)) {
+				xa_unlock_bh(&sk->sk_user_frags);
+				for (k = 0; k < netmem_num; k++)
+					WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+				netmem_num = 0;
+				xa_lock_bh(&sk->sk_user_frags);
 			}
+			ret++;
 		}
 	}
 
+frag_limit_reached:
 	xa_unlock_bh(&sk->sk_user_frags);
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-07 21:03 [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Mina Almasry
@ 2024-11-07 21:03 ` Mina Almasry
  2024-11-08  1:30   ` Stanislav Fomichev
  2024-11-08  1:28 ` [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Stanislav Fomichev
  2024-11-12  2:40 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 12+ messages in thread
From: Mina Almasry @ 2024-11-07 21:03 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, Mina Almasry, linux-doc, linux-kernel
  Cc: Paolo Abeni, Simon Horman, Jonathan Corbet, Yi Lai,
	Stanislav Fomichev

Document new behavior when the number of frags passed is too big.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 Documentation/networking/devmem.rst | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a55bf21f671c..d95363645331 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
 Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
 and will lead to packet drops.
 
+The user must pass no more than 128 tokens, with no more than 1024 total frags
+among the token->token_count across all the tokens. If the user provides more
+than 1024 frags, the kernel will free up to 1024 frags and return early.
+
+The kernel returns the number of actual frags freed. The number of frags freed
+can be less than the tokens provided by the user in case of:
+
+(a) an internal kernel leak bug.
+(b) the user passed more than 1024 frags.
 
 Implementation & Caveats
 ========================
-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long
  2024-11-07 21:03 [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Mina Almasry
  2024-11-07 21:03 ` [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Mina Almasry
@ 2024-11-08  1:28 ` Stanislav Fomichev
  2024-11-08  1:33   ` Mina Almasry
  2024-11-12  2:40 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2024-11-08  1:28 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On 11/07, Mina Almasry wrote:
> Exit early if we're freeing more than 1024 frags, to prevent
> looping too long.
> 
> Also minor code cleanups:
> - Flip checks to reduce indentation.
> - Use sizeof(*tokens) everywhere for consistentcy.
> 
> Cc: Yi Lai <yi1.lai@linux.intel.com>
> Cc: Stanislav Fomichev <sdf@fomichev.me>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
> 
> v2:
> - Retain token check to prevent allocation of too much memory.
> - Exit early instead of pre-checking in a loop so that we don't penalize
>   well behaved applications (sdf)
> 
> ---
>  net/core/sock.c | 42 ++++++++++++++++++++++++------------------
>  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 039be95c40cf..da50df485090 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1052,32 +1052,34 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
>  
>  #ifdef CONFIG_PAGE_POOL
>  
> -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> - * 1 syscall. The limit exists to limit the amount of memory the kernel
> - * allocates to copy these tokens.
> +/* This is the number of tokens and frags that the user can SO_DEVMEM_DONTNEED
> + * in 1 syscall. The limit exists to limit the amount of memory the kernel
> + * allocates to copy these tokens, and to prevent looping over the frags for
> + * too long.
>   */
>  #define MAX_DONTNEED_TOKENS 128
> +#define MAX_DONTNEED_FRAGS 1024
>  
>  static noinline_for_stack int
>  sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
>  {
>  	unsigned int num_tokens, i, j, k, netmem_num = 0;
>  	struct dmabuf_token *tokens;
> +	int ret = 0, num_frags = 0;
>  	netmem_ref netmems[16];
> -	int ret = 0;
>  
>  	if (!sk_is_tcp(sk))
>  		return -EBADF;
>  
> -	if (optlen % sizeof(struct dmabuf_token) ||
> +	if (optlen % sizeof(*tokens) ||
>  	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
>  		return -EINVAL;
>  

[..]

> -	tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);

Oh, so we currently allocate optlen*8? This is a sneaky fix :-p

> +	num_tokens = optlen / sizeof(*tokens);
> +	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
>  	if (!tokens)
>  		return -ENOMEM;
>  
> -	num_tokens = optlen / sizeof(struct dmabuf_token);
>  	if (copy_from_sockptr(tokens, optval, optlen)) {
>  		kvfree(tokens);
>  		return -EFAULT;
> @@ -1086,24 +1088,28 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
>  	xa_lock_bh(&sk->sk_user_frags);
>  	for (i = 0; i < num_tokens; i++) {
>  		for (j = 0; j < tokens[i].token_count; j++) {

[..]

> +			if (++num_frags > MAX_DONTNEED_FRAGS)
> +				goto frag_limit_reached;
> +

nit: maybe reuse existing ret (and rename it to num_frags) instead of
introducing new num_frags? Both variables now seem to track the same
number.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-07 21:03 ` [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Mina Almasry
@ 2024-11-08  1:30   ` Stanislav Fomichev
  2024-11-08  1:40     ` Mina Almasry
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2024-11-08  1:30 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On 11/07, Mina Almasry wrote:
> Document new behavior when the number of frags passed is too big.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>  Documentation/networking/devmem.rst | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> index a55bf21f671c..d95363645331 100644
> --- a/Documentation/networking/devmem.rst
> +++ b/Documentation/networking/devmem.rst
> @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
>  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
>  and will lead to packet drops.
>  
> +The user must pass no more than 128 tokens, with no more than 1024 total frags
> +among the token->token_count across all the tokens. If the user provides more
> +than 1024 frags, the kernel will free up to 1024 frags and return early.
> +
> +The kernel returns the number of actual frags freed. The number of frags freed
> +can be less than the tokens provided by the user in case of:
> +

[..]

> +(a) an internal kernel leak bug.

If you're gonna respin, might be worth mentioning that the dmesg
will contain a warning in case of a leak?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long
  2024-11-08  1:28 ` [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Stanislav Fomichev
@ 2024-11-08  1:33   ` Mina Almasry
  2024-11-08  2:58     ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Mina Almasry @ 2024-11-08  1:33 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On Thu, Nov 7, 2024 at 5:28 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 11/07, Mina Almasry wrote:
> > Exit early if we're freeing more than 1024 frags, to prevent
> > looping too long.
> >
> > Also minor code cleanups:
> > - Flip checks to reduce indentation.
> > - Use sizeof(*tokens) everywhere for consistentcy.
> >
> > Cc: Yi Lai <yi1.lai@linux.intel.com>
> > Cc: Stanislav Fomichev <sdf@fomichev.me>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > v2:
> > - Retain token check to prevent allocation of too much memory.
> > - Exit early instead of pre-checking in a loop so that we don't penalize
> >   well behaved applications (sdf)
> >
> > ---
> >  net/core/sock.c | 42 ++++++++++++++++++++++++------------------
> >  1 file changed, 24 insertions(+), 18 deletions(-)
> >
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 039be95c40cf..da50df485090 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1052,32 +1052,34 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> >
> >  #ifdef CONFIG_PAGE_POOL
> >
> > -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> > - * 1 syscall. The limit exists to limit the amount of memory the kernel
> > - * allocates to copy these tokens.
> > +/* This is the number of tokens and frags that the user can SO_DEVMEM_DONTNEED
> > + * in 1 syscall. The limit exists to limit the amount of memory the kernel
> > + * allocates to copy these tokens, and to prevent looping over the frags for
> > + * too long.
> >   */
> >  #define MAX_DONTNEED_TOKENS 128
> > +#define MAX_DONTNEED_FRAGS 1024
> >
> >  static noinline_for_stack int
> >  sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> >  {
> >       unsigned int num_tokens, i, j, k, netmem_num = 0;
> >       struct dmabuf_token *tokens;
> > +     int ret = 0, num_frags = 0;
> >       netmem_ref netmems[16];
> > -     int ret = 0;
> >
> >       if (!sk_is_tcp(sk))
> >               return -EBADF;
> >
> > -     if (optlen % sizeof(struct dmabuf_token) ||
> > +     if (optlen % sizeof(*tokens) ||
> >           optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> >               return -EINVAL;
> >
>
> [..]
>
> > -     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
>
> Oh, so we currently allocate optlen*8? This is a sneaky fix :-p
>
> > +     num_tokens = optlen / sizeof(*tokens);
> > +     tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> >       if (!tokens)
> >               return -ENOMEM;
> >
> > -     num_tokens = optlen / sizeof(struct dmabuf_token);
> >       if (copy_from_sockptr(tokens, optval, optlen)) {
> >               kvfree(tokens);
> >               return -EFAULT;
> > @@ -1086,24 +1088,28 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> >       xa_lock_bh(&sk->sk_user_frags);
> >       for (i = 0; i < num_tokens; i++) {
> >               for (j = 0; j < tokens[i].token_count; j++) {
>
> [..]
>
> > +                     if (++num_frags > MAX_DONTNEED_FRAGS)
> > +                             goto frag_limit_reached;
> > +
>
> nit: maybe reuse existing ret (and rename it to num_frags) instead of
> introducing new num_frags? Both variables now seem to track the same
> number.

I almost sent it this way, but I think that would be wrong.

num_frags is all the frags inspected.
ret is all the frags freed.

The difference is subtle but critical. We want to exit when we've
inspected 1024 frags, not when we've freed 1024 frags, because the
user may make us loop forever if they ask us to free 10000000 frags of
which none exist.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-08  1:30   ` Stanislav Fomichev
@ 2024-11-08  1:40     ` Mina Almasry
  2024-11-08  3:01       ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Mina Almasry @ 2024-11-08  1:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On Thu, Nov 7, 2024 at 5:30 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 11/07, Mina Almasry wrote:
> > Document new behavior when the number of frags passed is too big.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >  Documentation/networking/devmem.rst | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> > index a55bf21f671c..d95363645331 100644
> > --- a/Documentation/networking/devmem.rst
> > +++ b/Documentation/networking/devmem.rst
> > @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
> >  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
> >  and will lead to packet drops.
> >
> > +The user must pass no more than 128 tokens, with no more than 1024 total frags
> > +among the token->token_count across all the tokens. If the user provides more
> > +than 1024 frags, the kernel will free up to 1024 frags and return early.
> > +
> > +The kernel returns the number of actual frags freed. The number of frags freed
> > +can be less than the tokens provided by the user in case of:
> > +
>
> [..]
>
> > +(a) an internal kernel leak bug.
>
> If you're gonna respin, might be worth mentioning that the dmesg
> will contain a warning in case of a leak?

We will not actually warn in the likely cases of leak.

We warn when we find an entry in the xarray that is not a net_iov, or
if napi_pp_put_page fails on that net_iov. Both are very unlikely to
happen honestly.

The likely 'leaks' are when we don't find the frag_id in the xarray.
We do not warn on that because the user can intentionally trigger the
warning with invalid input. If the user is actually giving valid input
and the warn still happens, likely a kernel bug like I mentioned in
another thread, but we still don't warn.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long
  2024-11-08  1:33   ` Mina Almasry
@ 2024-11-08  2:58     ` Stanislav Fomichev
  0 siblings, 0 replies; 12+ messages in thread
From: Stanislav Fomichev @ 2024-11-08  2:58 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On 11/07, Mina Almasry wrote:
> On Thu, Nov 7, 2024 at 5:28 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 11/07, Mina Almasry wrote:
> > > Exit early if we're freeing more than 1024 frags, to prevent
> > > looping too long.
> > >
> > > Also minor code cleanups:
> > > - Flip checks to reduce indentation.
> > > - Use sizeof(*tokens) everywhere for consistentcy.
> > >
> > > Cc: Yi Lai <yi1.lai@linux.intel.com>
> > > Cc: Stanislav Fomichev <sdf@fomichev.me>
> > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >
> > > ---
> > >
> > > v2:
> > > - Retain token check to prevent allocation of too much memory.
> > > - Exit early instead of pre-checking in a loop so that we don't penalize
> > >   well behaved applications (sdf)
> > >
> > > ---
> > >  net/core/sock.c | 42 ++++++++++++++++++++++++------------------
> > >  1 file changed, 24 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index 039be95c40cf..da50df485090 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1052,32 +1052,34 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> > >
> > >  #ifdef CONFIG_PAGE_POOL
> > >
> > > -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> > > - * 1 syscall. The limit exists to limit the amount of memory the kernel
> > > - * allocates to copy these tokens.
> > > +/* This is the number of tokens and frags that the user can SO_DEVMEM_DONTNEED
> > > + * in 1 syscall. The limit exists to limit the amount of memory the kernel
> > > + * allocates to copy these tokens, and to prevent looping over the frags for
> > > + * too long.
> > >   */
> > >  #define MAX_DONTNEED_TOKENS 128
> > > +#define MAX_DONTNEED_FRAGS 1024
> > >
> > >  static noinline_for_stack int
> > >  sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > >  {
> > >       unsigned int num_tokens, i, j, k, netmem_num = 0;
> > >       struct dmabuf_token *tokens;
> > > +     int ret = 0, num_frags = 0;
> > >       netmem_ref netmems[16];
> > > -     int ret = 0;
> > >
> > >       if (!sk_is_tcp(sk))
> > >               return -EBADF;
> > >
> > > -     if (optlen % sizeof(struct dmabuf_token) ||
> > > +     if (optlen % sizeof(*tokens) ||
> > >           optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > >               return -EINVAL;
> > >
> >
> > [..]
> >
> > > -     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> >
> > Oh, so we currently allocate optlen*8? This is a sneaky fix :-p
> >
> > > +     num_tokens = optlen / sizeof(*tokens);
> > > +     tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> > >       if (!tokens)
> > >               return -ENOMEM;
> > >
> > > -     num_tokens = optlen / sizeof(struct dmabuf_token);
> > >       if (copy_from_sockptr(tokens, optval, optlen)) {
> > >               kvfree(tokens);
> > >               return -EFAULT;
> > > @@ -1086,24 +1088,28 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > >       xa_lock_bh(&sk->sk_user_frags);
> > >       for (i = 0; i < num_tokens; i++) {
> > >               for (j = 0; j < tokens[i].token_count; j++) {
> >
> > [..]
> >
> > > +                     if (++num_frags > MAX_DONTNEED_FRAGS)
> > > +                             goto frag_limit_reached;
> > > +
> >
> > nit: maybe reuse existing ret (and rename it to num_frags) instead of
> > introducing new num_frags? Both variables now seem to track the same
> > number.
> 
> I almost sent it this way, but I think that would be wrong.
> 
> num_frags is all the frags inspected.
> ret is all the frags freed.
> 
> The difference is subtle but critical. We want to exit when we've
> inspected 1024 frags, not when we've freed 1024 frags, because the
> user may make us loop forever if they ask us to free 10000000 frags of
> which none exist.

I see. Maybe can mitigate the damage with the following:

for (i = 0; i < min(num_tokens, MAX_DONTNEED_FRAGS); i++)
	for (j = 0; j < min(tokens[i].token_count, MAX_DONTNEED_FRAGS); j++)

In this case, worst case, we loop 1024*1024 on the invalid input :-D
But up to you, separate num_frag works as well (but, as you've seen with
my initial reply, it's not super straightforward).

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-08  1:40     ` Mina Almasry
@ 2024-11-08  3:01       ` Stanislav Fomichev
  2024-11-08 16:30         ` Mina Almasry
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2024-11-08  3:01 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On 11/07, Mina Almasry wrote:
> On Thu, Nov 7, 2024 at 5:30 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 11/07, Mina Almasry wrote:
> > > Document new behavior when the number of frags passed is too big.
> > >
> > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > ---
> > >  Documentation/networking/devmem.rst | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> > > index a55bf21f671c..d95363645331 100644
> > > --- a/Documentation/networking/devmem.rst
> > > +++ b/Documentation/networking/devmem.rst
> > > @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
> > >  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
> > >  and will lead to packet drops.
> > >
> > > +The user must pass no more than 128 tokens, with no more than 1024 total frags
> > > +among the token->token_count across all the tokens. If the user provides more
> > > +than 1024 frags, the kernel will free up to 1024 frags and return early.
> > > +
> > > +The kernel returns the number of actual frags freed. The number of frags freed
> > > +can be less than the tokens provided by the user in case of:
> > > +
> >
> > [..]
> >
> > > +(a) an internal kernel leak bug.
> >
> > If you're gonna respin, might be worth mentioning that the dmesg
> > will contain a warning in case of a leak?
> 
> We will not actually warn in the likely cases of leak.
> 
> We warn when we find an entry in the xarray that is not a net_iov, or
> if napi_pp_put_page fails on that net_iov. Both are very unlikely to
> happen honestly.
> 
> The likely 'leaks' are when we don't find the frag_id in the xarray.
> We do not warn on that because the user can intentionally trigger the
> warning with invalid input. If the user is actually giving valid input
> and the warn still happens, likely a kernel bug like I mentioned in
> another thread, but we still don't warn.

In this case, maybe don't mention the leaks at all? If it's not
actionable, not sure how it helps?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-08  3:01       ` Stanislav Fomichev
@ 2024-11-08 16:30         ` Mina Almasry
  2024-11-08 18:07           ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Mina Almasry @ 2024-11-08 16:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On Thu, Nov 7, 2024 at 7:01 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 11/07, Mina Almasry wrote:
> > On Thu, Nov 7, 2024 at 5:30 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 11/07, Mina Almasry wrote:
> > > > Document new behavior when the number of frags passed is too big.
> > > >
> > > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > ---
> > > >  Documentation/networking/devmem.rst | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> > > > index a55bf21f671c..d95363645331 100644
> > > > --- a/Documentation/networking/devmem.rst
> > > > +++ b/Documentation/networking/devmem.rst
> > > > @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
> > > >  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
> > > >  and will lead to packet drops.
> > > >
> > > > +The user must pass no more than 128 tokens, with no more than 1024 total frags
> > > > +among the token->token_count across all the tokens. If the user provides more
> > > > +than 1024 frags, the kernel will free up to 1024 frags and return early.
> > > > +
> > > > +The kernel returns the number of actual frags freed. The number of frags freed
> > > > +can be less than the tokens provided by the user in case of:
> > > > +
> > >
> > > [..]
> > >
> > > > +(a) an internal kernel leak bug.
> > >
> > > If you're gonna respin, might be worth mentioning that the dmesg
> > > will contain a warning in case of a leak?
> >
> > We will not actually warn in the likely cases of leak.
> >
> > We warn when we find an entry in the xarray that is not a net_iov, or
> > if napi_pp_put_page fails on that net_iov. Both are very unlikely to
> > happen honestly.
> >
> > The likely 'leaks' are when we don't find the frag_id in the xarray.
> > We do not warn on that because the user can intentionally trigger the
> > warning with invalid input. If the user is actually giving valid input
> > and the warn still happens, likely a kernel bug like I mentioned in
> > another thread, but we still don't warn.
>
> In this case, maybe don't mention the leaks at all? If it's not
> actionable, not sure how it helps?

It's good to explain what the return code of the setsockopt means, and
when it would be less than the number of passed in tokens.

Also it's not really 'not actionable'. I expect serious users of
devmem tcp to log such leaks in metrics and try to root cause the
userspace or kernel bug causing them if they happen.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-08 16:30         ` Mina Almasry
@ 2024-11-08 18:07           ` Stanislav Fomichev
  2024-11-08 18:45             ` Mina Almasry
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2024-11-08 18:07 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On 11/08, Mina Almasry wrote:
> On Thu, Nov 7, 2024 at 7:01 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 11/07, Mina Almasry wrote:
> > > On Thu, Nov 7, 2024 at 5:30 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 11/07, Mina Almasry wrote:
> > > > > Document new behavior when the number of frags passed is too big.
> > > > >
> > > > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > ---
> > > > >  Documentation/networking/devmem.rst | 9 +++++++++
> > > > >  1 file changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> > > > > index a55bf21f671c..d95363645331 100644
> > > > > --- a/Documentation/networking/devmem.rst
> > > > > +++ b/Documentation/networking/devmem.rst
> > > > > @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
> > > > >  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
> > > > >  and will lead to packet drops.
> > > > >
> > > > > +The user must pass no more than 128 tokens, with no more than 1024 total frags
> > > > > +among the token->token_count across all the tokens. If the user provides more
> > > > > +than 1024 frags, the kernel will free up to 1024 frags and return early.
> > > > > +
> > > > > +The kernel returns the number of actual frags freed. The number of frags freed
> > > > > +can be less than the tokens provided by the user in case of:
> > > > > +
> > > >
> > > > [..]
> > > >
> > > > > +(a) an internal kernel leak bug.
> > > >
> > > > If you're gonna respin, might be worth mentioning that the dmesg
> > > > will contain a warning in case of a leak?
> > >
> > > We will not actually warn in the likely cases of leak.
> > >
> > > We warn when we find an entry in the xarray that is not a net_iov, or
> > > if napi_pp_put_page fails on that net_iov. Both are very unlikely to
> > > happen honestly.
> > >
> > > The likely 'leaks' are when we don't find the frag_id in the xarray.
> > > We do not warn on that because the user can intentionally trigger the
> > > warning with invalid input. If the user is actually giving valid input
> > > and the warn still happens, likely a kernel bug like I mentioned in
> > > another thread, but we still don't warn.
> >
> > In this case, maybe don't mention the leaks at all? If it's not
> > actionable, not sure how it helps?
> 
> It's good to explain what the return code of the setsockopt means, and
> when it would be less than the number of passed in tokens.
> 
> Also it's not really 'not actionable'. I expect serious users of
> devmem tcp to log such leaks in metrics and try to root cause the
> userspace or kernel bug causing them if they happen.

Right now it reads like both (a) and (b) have a similar probability. Maybe
even (a) is more probable because you mention it first? In theory, any syscall
can have a bug in it where it returns something bogus, so maybe at least
downplay the 'leak' part a bit? "In the extremely rare cases, kernel
might free less frags than requested .... "

Imagine a situation where the user inadvertently tries to free the same token
twice or something and gets the unexpected return value. Why? Might be
the kernel leak, right?

From the POW of the kernel, the most probable cases where we return
less tokens are:
1. user gave us more than 1024
2. user gave us incorrect tokens
...
99. kernel is full of bugs and we lost the frag

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
  2024-11-08 18:07           ` Stanislav Fomichev
@ 2024-11-08 18:45             ` Mina Almasry
  0 siblings, 0 replies; 12+ messages in thread
From: Mina Almasry @ 2024-11-08 18:45 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, Jakub Kicinski, Eric Dumazet, Willem de Bruijn,
	David S. Miller, linux-doc, linux-kernel, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Yi Lai, Stanislav Fomichev

On Fri, Nov 8, 2024 at 10:07 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 11/08, Mina Almasry wrote:
> > On Thu, Nov 7, 2024 at 7:01 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 11/07, Mina Almasry wrote:
> > > > On Thu, Nov 7, 2024 at 5:30 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > > >
> > > > > On 11/07, Mina Almasry wrote:
> > > > > > Document new behavior when the number of frags passed is too big.
> > > > > >
> > > > > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > ---
> > > > > >  Documentation/networking/devmem.rst | 9 +++++++++
> > > > > >  1 file changed, 9 insertions(+)
> > > > > >
> > > > > > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
> > > > > > index a55bf21f671c..d95363645331 100644
> > > > > > --- a/Documentation/networking/devmem.rst
> > > > > > +++ b/Documentation/networking/devmem.rst
> > > > > > @@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
> > > > > >  Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
> > > > > >  and will lead to packet drops.
> > > > > >
> > > > > > +The user must pass no more than 128 tokens, with no more than 1024 total frags
> > > > > > +among the token->token_count across all the tokens. If the user provides more
> > > > > > +than 1024 frags, the kernel will free up to 1024 frags and return early.
> > > > > > +
> > > > > > +The kernel returns the number of actual frags freed. The number of frags freed
> > > > > > +can be less than the tokens provided by the user in case of:
> > > > > > +
> > > > >
> > > > > [..]
> > > > >
> > > > > > +(a) an internal kernel leak bug.
> > > > >
> > > > > If you're gonna respin, might be worth mentioning that the dmesg
> > > > > will contain a warning in case of a leak?
> > > >
> > > > We will not actually warn in the likely cases of leak.
> > > >
> > > > We warn when we find an entry in the xarray that is not a net_iov, or
> > > > if napi_pp_put_page fails on that net_iov. Both are very unlikely to
> > > > happen honestly.
> > > >
> > > > The likely 'leaks' are when we don't find the frag_id in the xarray.
> > > > We do not warn on that because the user can intentionally trigger the
> > > > warning with invalid input. If the user is actually giving valid input
> > > > and the warn still happens, likely a kernel bug like I mentioned in
> > > > another thread, but we still don't warn.
> > >
> > > In this case, maybe don't mention the leaks at all? If it's not
> > > actionable, not sure how it helps?
> >
> > It's good to explain what the return code of the setsockopt means, and
> > when it would be less than the number of passed in tokens.
> >
> > Also it's not really 'not actionable'. I expect serious users of
> > devmem tcp to log such leaks in metrics and try to root cause the
> > userspace or kernel bug causing them if they happen.
>
> Right now it reads like both (a) and (b) have a similar probability. Maybe
> even (a) is more probable because you mention it first? In theory, any syscall
> can have a bug in it where it returns something bogus, so maybe at least
> downplay the 'leak' part a bit? "In the extremely rare cases, kernel
> might free less frags than requested .... "
>
> Imagine a situation where the user inadvertently tries to free the same token
> twice or something and gets the unexpected return value. Why? Might be
> the kernel leak, right?
>
> From the POW of the kernel, the most probable cases where we return
> less tokens are:
> 1. user gave us more than 1024
> 2. user gave us incorrect tokens
> ...
> 99. kernel is full of bugs and we lost the frag

The current wording doesn't make any comment about probability. More
information is better than less. I don't see a strong reason to omit
information. I think the docs are better now and will be improved
further in the future. Lets not bike shed too much on docs wording.
It's painfully obvious invalid input is more likely not subtle kernel
bugs IMO without calling out.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long
  2024-11-07 21:03 [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Mina Almasry
  2024-11-07 21:03 ` [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Mina Almasry
  2024-11-08  1:28 ` [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Stanislav Fomichev
@ 2024-11-12  2:40 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-11-12  2:40 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, kuba, edumazet, willemb, davem, linux-doc, linux-kernel,
	pabeni, horms, corbet, yi1.lai, sdf

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu,  7 Nov 2024 21:03:30 +0000 you wrote:
> Exit early if we're freeing more than 1024 frags, to prevent
> looping too long.
> 
> Also minor code cleanups:
> - Flip checks to reduce indentation.
> - Use sizeof(*tokens) everywhere for consistentcy.
> 
> [...]

Here is the summary with links:
  - [net,v2,1/2] net: fix SO_DEVMEM_DONTNEED looping too long
    https://git.kernel.org/netdev/net/c/f2685c00c322
  - [net,v2,2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation
    https://git.kernel.org/netdev/net/c/102d1404c385

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-12  2:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-07 21:03 [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Mina Almasry
2024-11-07 21:03 ` [PATCH net v2 2/2] net: clarify SO_DEVMEM_DONTNEED behavior in documentation Mina Almasry
2024-11-08  1:30   ` Stanislav Fomichev
2024-11-08  1:40     ` Mina Almasry
2024-11-08  3:01       ` Stanislav Fomichev
2024-11-08 16:30         ` Mina Almasry
2024-11-08 18:07           ` Stanislav Fomichev
2024-11-08 18:45             ` Mina Almasry
2024-11-08  1:28 ` [PATCH net v2 1/2] net: fix SO_DEVMEM_DONTNEED looping too long Stanislav Fomichev
2024-11-08  1:33   ` Mina Almasry
2024-11-08  2:58     ` Stanislav Fomichev
2024-11-12  2:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).