[PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
@ 2023-04-18 18:08 Kuniyuki Iwashima
  2023-04-18 18:33 ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-18 18:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
	syzbot

syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs.  We can reproduce the problem with these sequences:

  sk = socket(AF_INET, SOCK_DGRAM, 0)
  sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
  sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
  sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
  sk.close()

sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.

When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.

The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg().  When we close() the socket holding such skb, we
never call sock_put() and leak the count, skb, and sk.

To avoid this problem, we must (i) call skb_queue_purge() after
flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
and TCP lacks (ii).

Without (ii), a skb queued in a qdisc or device could be put into
the error queue after skb_queue_purge().

  sendmsg() /* return immediately, but packets
             * are queued in a qdisc or device
             */
                                    close()
                                      skb_queue_purge()
  __skb_tstamp_tx()
    __skb_complete_tx_timestamp()
      sock_queue_err_skb()
        skb_queue_tail()

Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
in sock_queue_err_skb() to avoid this race.

  if (!sock_flag(sk, SOCK_DEAD))
                                    sock_set_flag(sk, SOCK_DEAD)
                                    skb_queue_purge()

    skb_queue_tail()

[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
    [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
    [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
    [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
    [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
    [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
    [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
    [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
    [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
    [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
  backtrace:
    [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
    [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
    [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
    [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
    [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
    [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
    [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
    [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
    [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
    [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
    [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
    [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
    [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
    [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
    [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
  * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
  * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
  * Add Fixes tag for TCP

v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
---
 net/core/skbuff.c | 15 ++++++++++++---
 net/ipv4/udp.c    |  5 +++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4c0879798eb8..287b834df9c8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
  */
 int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 {
+	unsigned long flags;
+
 	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
 	    (unsigned int)READ_ONCE(sk->sk_rcvbuf))
 		return -ENOMEM;
@@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 	/* before exiting rcu section, make sure dst is refcounted */
 	skb_dst_force(skb);
 
-	skb_queue_tail(&sk->sk_error_queue, skb);
-	if (!sock_flag(sk, SOCK_DEAD))
-		sk_error_report(sk);
+	spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
+	if (sock_flag(sk, SOCK_DEAD)) {
+		spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
+		return -EINVAL;
+	}
+	__skb_queue_tail(&sk->sk_error_queue, skb);
+	spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
+
+	sk_error_report(sk);
+
 	return 0;
 }
 EXPORT_SYMBOL(sock_queue_err_skb);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..7060a5cda711 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2674,6 +2674,11 @@ void udp_destroy_sock(struct sock *sk)
 		if (up->encap_enabled)
 			static_branch_dec(&udp_encap_needed_key);
 	}
+
+	/* A zerocopy skb has a refcnt of sk and may be
+	 * put into sk_error_queue with TX timestamp
+	 */
+	skb_queue_purge(&sk->sk_error_queue);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-18 18:08 [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp Kuniyuki Iwashima
@ 2023-04-18 18:33 ` Eric Dumazet
  2023-04-18 18:44   ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2023-04-18 18:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Willem de Bruijn,
	Kuniyuki Iwashima, netdev, syzbot

On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> skbs.  We can reproduce the problem with these sequences:
>
>   sk = socket(AF_INET, SOCK_DGRAM, 0)
>   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
>   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
>   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
>   sk.close()
>
> sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> the socket's error queue with the TX timestamp.
>
> When the original skb is received locally, skb_copy_ubufs() calls
> skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> This additional count is decremented while freeing the skb, but struct
> ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> not called.
>
> The last refcnt is not released unless we retrieve the TX timestamped
> skb by recvmsg().  When we close() the socket holding such skb, we
> never call sock_put() and leak the count, skb, and sk.
>
> To avoid this problem, we must (i) call skb_queue_purge() after
> flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> and TCP lacks (ii).
>
> Without (ii), a skb queued in a qdisc or device could be put into
> the error queue after skb_queue_purge().
>
>   sendmsg() /* return immediately, but packets
>              * are queued in a qdisc or device
>              */
>                                     close()
>                                       skb_queue_purge()
>   __skb_tstamp_tx()
>     __skb_complete_tx_timestamp()
>       sock_queue_err_skb()
>         skb_queue_tail()
>
> Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> in sock_queue_err_skb() to avoid this race.
>
>   if (!sock_flag(sk, SOCK_DEAD))
>                                     sock_set_flag(sk, SOCK_DEAD)
>                                     skb_queue_purge()
>
>     skb_queue_tail()
>
> [0]:

> Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> v2:
>   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
>   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
>   * Add Fixes tag for TCP
>
> v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> ---
>  net/core/skbuff.c | 15 ++++++++++++---
>  net/ipv4/udp.c    |  5 +++++
>  2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4c0879798eb8..287b834df9c8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
>   */
>  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>  {
> +       unsigned long flags;
> +
>         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
>             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
>                 return -ENOMEM;
> @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>         /* before exiting rcu section, make sure dst is refcounted */
>         skb_dst_force(skb);
>
> -       skb_queue_tail(&sk->sk_error_queue, skb);
> -       if (!sock_flag(sk, SOCK_DEAD))
> -               sk_error_report(sk);
> +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> +       if (sock_flag(sk, SOCK_DEAD)) {

SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
want to add a confusing construct.

Just bail early ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
sk_buff *skb)
            (unsigned int)READ_ONCE(sk->sk_rcvbuf))
                return -ENOMEM;

+       if (sock_flag(sk, SOCK_DEAD))
+               return -EINVAL;
+
        skb_orphan(skb);
        skb->sk = sk;
        skb->destructor = sock_rmem_free;
@@ -4993,8 +4996,7 @@ int sock_queue_err_skb(struct sock *sk, struct
sk_buff *skb)
        skb_dst_force(skb);

        skb_queue_tail(&sk->sk_error_queue, skb);
-       if (!sock_flag(sk, SOCK_DEAD))
-               sk_error_report(sk);
+       sk_error_report(sk);
        return 0;
 }
 EXPORT_SYMBOL(sock_queue_err_skb);


> +               spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
> +               return -EINVAL;
> +       }
> +       __skb_queue_tail(&sk->sk_error_queue, skb);
> +       spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
> +
> +       sk_error_report(sk);
> +
>         return 0;
>  }
>  EXPORT_SYMBOL(sock_queue_err_skb);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..7060a5cda711 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2674,6 +2674,11 @@ void udp_destroy_sock(struct sock *sk)
>                 if (up->encap_enabled)
>                         static_branch_dec(&udp_encap_needed_key);
>         }
> +
> +       /* A zerocopy skb has a refcnt of sk and may be
> +        * put into sk_error_queue with TX timestamp
> +        */
> +       skb_queue_purge(&sk->sk_error_queue);
>  }
>
>  /*
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-18 18:33 ` Eric Dumazet
@ 2023-04-18 18:44   ` Kuniyuki Iwashima
  2023-04-18 19:04     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-18 18:44 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni, syzkaller, willemb

From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 18 Apr 2023 20:33:44 +0200
> On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > skbs.  We can reproduce the problem with these sequences:
> >
> >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> >   sk.close()
> >
> > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > the socket's error queue with the TX timestamp.
> >
> > When the original skb is received locally, skb_copy_ubufs() calls
> > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > This additional count is decremented while freeing the skb, but struct
> > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > not called.
> >
> > The last refcnt is not released unless we retrieve the TX timestamped
> > skb by recvmsg().  When we close() the socket holding such skb, we
> > never call sock_put() and leak the count, skb, and sk.
> >
> > To avoid this problem, we must (i) call skb_queue_purge() after
> > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > and TCP lacks (ii).
> >
> > Without (ii), a skb queued in a qdisc or device could be put into
> > the error queue after skb_queue_purge().
> >
> >   sendmsg() /* return immediately, but packets
> >              * are queued in a qdisc or device
> >              */
> >                                     close()
> >                                       skb_queue_purge()
> >   __skb_tstamp_tx()
> >     __skb_complete_tx_timestamp()
> >       sock_queue_err_skb()
> >         skb_queue_tail()
> >
> > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > in sock_queue_err_skb() to avoid this race.
> >
> >   if (!sock_flag(sk, SOCK_DEAD))
> >                                     sock_set_flag(sk, SOCK_DEAD)
> >                                     skb_queue_purge()
> >
> >     skb_queue_tail()
> >
> > [0]:
> 
> > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > v2:
> >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> >   * Add Fixes tag for TCP
> >
> > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > ---
> >  net/core/skbuff.c | 15 ++++++++++++---
> >  net/ipv4/udp.c    |  5 +++++
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 4c0879798eb8..287b834df9c8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> >   */
> >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> > +       unsigned long flags;
> > +
> >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> >                 return -ENOMEM;
> > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> >         /* before exiting rcu section, make sure dst is refcounted */
> >         skb_dst_force(skb);
> >
> > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > -       if (!sock_flag(sk, SOCK_DEAD))
> > -               sk_error_report(sk);
> > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > +       if (sock_flag(sk, SOCK_DEAD)) {
> 
> SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> want to add a confusing construct.
> 
> Just bail early ?
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> sk_buff *skb)
>             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
>                 return -ENOMEM;
> 
> +       if (sock_flag(sk, SOCK_DEAD))
> +               return -EINVAL;
> +

Isn't it possible that these sequences happen

  close()
    sock_set_flag(sk, SOCK_DEAD);
    skb_queue_purge(&sk->sk_error_queue)

between the skb_queue_tail() below ? (2nd race mentioned in changelog)

I thought we can guarantee the ordering by taking the same lock.

>         skb_orphan(skb);
>         skb->sk = sk;
>         skb->destructor = sock_rmem_free;
> @@ -4993,8 +4996,7 @@ int sock_queue_err_skb(struct sock *sk, struct
> sk_buff *skb)
>         skb_dst_force(skb);
> 
>         skb_queue_tail(&sk->sk_error_queue, skb);
> -       if (!sock_flag(sk, SOCK_DEAD))
> -               sk_error_report(sk);
> +       sk_error_report(sk);
>         return 0;
>  }
>  EXPORT_SYMBOL(sock_queue_err_skb);
> 
> 
> > +               spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
> > +               return -EINVAL;
> > +       }
> > +       __skb_queue_tail(&sk->sk_error_queue, skb);
> > +       spin_unlock_irqrestore(&sk->sk_error_queue.lock, flags);
> > +
> > +       sk_error_report(sk);
> > +
> >         return 0;
> >  }
> >  EXPORT_SYMBOL(sock_queue_err_skb);
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index c605d171eb2d..7060a5cda711 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2674,6 +2674,11 @@ void udp_destroy_sock(struct sock *sk)
> >                 if (up->encap_enabled)
> >                         static_branch_dec(&udp_encap_needed_key);
> >         }
> > +
> > +       /* A zerocopy skb has a refcnt of sk and may be
> > +        * put into sk_error_queue with TX timestamp
> > +        */
> > +       skb_queue_purge(&sk->sk_error_queue);
> >  }
> >
> >  /*
> > --
> > 2.30.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-18 18:44   ` Kuniyuki Iwashima
@ 2023-04-18 19:04     ` Eric Dumazet
  2023-04-18 19:25       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2023-04-18 19:04 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, kuba, kuni1840, netdev, pabeni, syzkaller, willemb

On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Eric Dumazet <edumazet@google.com>
> Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > skbs.  We can reproduce the problem with these sequences:
> > >
> > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > >   sk.close()
> > >
> > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > the socket's error queue with the TX timestamp.
> > >
> > > When the original skb is received locally, skb_copy_ubufs() calls
> > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > This additional count is decremented while freeing the skb, but struct
> > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > not called.
> > >
> > > The last refcnt is not released unless we retrieve the TX timestamped
> > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > never call sock_put() and leak the count, skb, and sk.
> > >
> > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > and TCP lacks (ii).
> > >
> > > Without (ii), a skb queued in a qdisc or device could be put into
> > > the error queue after skb_queue_purge().
> > >
> > >   sendmsg() /* return immediately, but packets
> > >              * are queued in a qdisc or device
> > >              */
> > >                                     close()
> > >                                       skb_queue_purge()
> > >   __skb_tstamp_tx()
> > >     __skb_complete_tx_timestamp()
> > >       sock_queue_err_skb()
> > >         skb_queue_tail()
> > >
> > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > in sock_queue_err_skb() to avoid this race.
> > >
> > >   if (!sock_flag(sk, SOCK_DEAD))
> > >                                     sock_set_flag(sk, SOCK_DEAD)
> > >                                     skb_queue_purge()
> > >
> > >     skb_queue_tail()
> > >
> > > [0]:
> >
> > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > > v2:
> > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > >   * Add Fixes tag for TCP
> > >
> > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > ---
> > >  net/core/skbuff.c | 15 ++++++++++++---
> > >  net/ipv4/udp.c    |  5 +++++
> > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 4c0879798eb8..287b834df9c8 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > >   */
> > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > >  {
> > > +       unsigned long flags;
> > > +
> > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > >                 return -ENOMEM;
> > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > >         /* before exiting rcu section, make sure dst is refcounted */
> > >         skb_dst_force(skb);
> > >
> > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > -               sk_error_report(sk);
> > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > +       if (sock_flag(sk, SOCK_DEAD)) {
> >
> > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > want to add a confusing construct.
> >
> > Just bail early ?
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > sk_buff *skb)
> >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> >                 return -ENOMEM;
> >
> > +       if (sock_flag(sk, SOCK_DEAD))
> > +               return -EINVAL;
> > +
>
> Isn't it possible that these sequences happen
>
>   close()
>     sock_set_flag(sk, SOCK_DEAD);
>     skb_queue_purge(&sk->sk_error_queue)
>
> between the skb_queue_tail() below ? (2nd race mentioned in changelog)
>
> I thought we can guarantee the ordering by taking the same lock.
>

This is fragile.

We could very well rewrite skb_queue_purge() to not acquire the lock
in the common case.
I had the following in my tree for a while, to avoid many atomic and
irq masking operations...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ef81452759be3fd251faaf76d89cfd002ee79256..f570e7c89248e2d9d9ab50c2e86ac7019b837484
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3606,14 +3606,24 @@ EXPORT_SYMBOL(skb_dequeue_tail);
  *     skb_queue_purge - empty a list
  *     @list: list to empty
  *
- *     Delete all buffers on an &sk_buff list. Each buffer is removed from
- *     the list and one reference dropped. This function takes the list
- *     lock and is atomic with respect to other list locking functions.
+ *     Delete all buffers on an &sk_buff list.
  */
 void skb_queue_purge(struct sk_buff_head *list)
 {
+       struct sk_buff_head priv;
+       unsigned long flags;
        struct sk_buff *skb;
-       while ((skb = skb_dequeue(list)) != NULL)
+
+       if (skb_queue_empty_lockless(list))
+               return;
+
+       __skb_queue_head_init(&priv);
+
+       spin_lock_irqsave(&list->lock, flags);
+       skb_queue_splice_init(list, &priv);
+       spin_unlock_irqrestore(&list->lock, flags);
+
+       while ((skb = __skb_dequeue(&priv)) != NULL)
                kfree_skb(skb);
 }
 EXPORT_SYMBOL(skb_queue_purge);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-18 19:04     ` Eric Dumazet
@ 2023-04-18 19:25       ` Kuniyuki Iwashima
  2023-04-19  9:34         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-18 19:25 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni, syzkaller, willemb

From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 18 Apr 2023 21:04:32 +0200
> On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Eric Dumazet <edumazet@google.com>
> > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > skbs.  We can reproduce the problem with these sequences:
> > > >
> > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > >   sk.close()
> > > >
> > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > the socket's error queue with the TX timestamp.
> > > >
> > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > This additional count is decremented while freeing the skb, but struct
> > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > not called.
> > > >
> > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > never call sock_put() and leak the count, skb, and sk.
> > > >
> > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > and TCP lacks (ii).
> > > >
> > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > the error queue after skb_queue_purge().
> > > >
> > > >   sendmsg() /* return immediately, but packets
> > > >              * are queued in a qdisc or device
> > > >              */
> > > >                                     close()
> > > >                                       skb_queue_purge()
> > > >   __skb_tstamp_tx()
> > > >     __skb_complete_tx_timestamp()
> > > >       sock_queue_err_skb()
> > > >         skb_queue_tail()
> > > >
> > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > in sock_queue_err_skb() to avoid this race.
> > > >
> > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > >                                     skb_queue_purge()
> > > >
> > > >     skb_queue_tail()
> > > >
> > > > [0]:
> > >
> > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > > v2:
> > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > >   * Add Fixes tag for TCP
> > > >
> > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > ---
> > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > >  net/ipv4/udp.c    |  5 +++++
> > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index 4c0879798eb8..287b834df9c8 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > >   */
> > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > >  {
> > > > +       unsigned long flags;
> > > > +
> > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > >                 return -ENOMEM;
> > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > >         skb_dst_force(skb);
> > > >
> > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > -               sk_error_report(sk);
> > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > >
> > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > want to add a confusing construct.
> > >
> > > Just bail early ?
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > sk_buff *skb)
> > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > >                 return -ENOMEM;
> > >
> > > +       if (sock_flag(sk, SOCK_DEAD))
> > > +               return -EINVAL;
> > > +
> >
> > Isn't it possible that these sequences happen
> >
> >   close()
> >     sock_set_flag(sk, SOCK_DEAD);
> >     skb_queue_purge(&sk->sk_error_queue)
> >
> > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> >
> > I thought we can guarantee the ordering by taking the same lock.
> >
> 
> This is fragile.

Yes, but I didn't have better idea to avoid the race...

> 
> We could very well rewrite skb_queue_purge() to not acquire the lock
> in the common case.
> I had the following in my tree for a while, to avoid many atomic and
> irq masking operations...

Cool, and it still works with my patch, no ?


> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ef81452759be3fd251faaf76d89cfd002ee79256..f570e7c89248e2d9d9ab50c2e86ac7019b837484
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3606,14 +3606,24 @@ EXPORT_SYMBOL(skb_dequeue_tail);
>   *     skb_queue_purge - empty a list
>   *     @list: list to empty
>   *
> - *     Delete all buffers on an &sk_buff list. Each buffer is removed from
> - *     the list and one reference dropped. This function takes the list
> - *     lock and is atomic with respect to other list locking functions.
> + *     Delete all buffers on an &sk_buff list.
>   */
>  void skb_queue_purge(struct sk_buff_head *list)
>  {
> +       struct sk_buff_head priv;
> +       unsigned long flags;
>         struct sk_buff *skb;
> -       while ((skb = skb_dequeue(list)) != NULL)
> +
> +       if (skb_queue_empty_lockless(list))
> +               return;
> +
> +       __skb_queue_head_init(&priv);
> +
> +       spin_lock_irqsave(&list->lock, flags);
> +       skb_queue_splice_init(list, &priv);
> +       spin_unlock_irqrestore(&list->lock, flags);
> +
> +       while ((skb = __skb_dequeue(&priv)) != NULL)
>                 kfree_skb(skb);
>  }
>  EXPORT_SYMBOL(skb_queue_purge);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-18 19:25       ` Kuniyuki Iwashima
@ 2023-04-19  9:34         ` Eric Dumazet
  2023-04-19 14:16           ` Willem de Bruijn
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2023-04-19  9:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, kuba, kuni1840, netdev, pabeni, syzkaller, willemb

On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Eric Dumazet <edumazet@google.com>
> Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Eric Dumazet <edumazet@google.com>
> > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > skbs.  We can reproduce the problem with these sequences:
> > > > >
> > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > >   sk.close()
> > > > >
> > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > the socket's error queue with the TX timestamp.
> > > > >
> > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > This additional count is decremented while freeing the skb, but struct
> > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > not called.
> > > > >
> > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > never call sock_put() and leak the count, skb, and sk.
> > > > >
> > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > and TCP lacks (ii).
> > > > >
> > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > the error queue after skb_queue_purge().
> > > > >
> > > > >   sendmsg() /* return immediately, but packets
> > > > >              * are queued in a qdisc or device
> > > > >              */
> > > > >                                     close()
> > > > >                                       skb_queue_purge()
> > > > >   __skb_tstamp_tx()
> > > > >     __skb_complete_tx_timestamp()
> > > > >       sock_queue_err_skb()
> > > > >         skb_queue_tail()
> > > > >
> > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > in sock_queue_err_skb() to avoid this race.
> > > > >
> > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > >                                     skb_queue_purge()
> > > > >
> > > > >     skb_queue_tail()
> > > > >
> > > > > [0]:
> > > >
> > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > > v2:
> > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > >   * Add Fixes tag for TCP
> > > > >
> > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > ---
> > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > >  net/ipv4/udp.c    |  5 +++++
> > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > >   */
> > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > >  {
> > > > > +       unsigned long flags;
> > > > > +
> > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > >                 return -ENOMEM;
> > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > >         skb_dst_force(skb);
> > > > >
> > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > -               sk_error_report(sk);
> > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > >
> > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > want to add a confusing construct.
> > > >
> > > > Just bail early ?
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > sk_buff *skb)
> > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > >                 return -ENOMEM;
> > > >
> > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > +               return -EINVAL;
> > > > +
> > >
> > > Isn't it possible that these sequences happen
> > >
> > >   close()
> > >     sock_set_flag(sk, SOCK_DEAD);
> > >     skb_queue_purge(&sk->sk_error_queue)
> > >
> > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > >
> > > I thought we can guarantee the ordering by taking the same lock.
> > >
> >
> > This is fragile.
>
> Yes, but I didn't have better idea to avoid the race...
>
> >
> > We could very well rewrite skb_queue_purge() to not acquire the lock
> > in the common case.
> > I had the following in my tree for a while, to avoid many atomic and
> > irq masking operations...
>
> Cool, and it still works with my patch, no ?
>


Really the only thing that ensures a race is not possible is the
typical sk_refcnt acquisition.

But I do not see why an skb stored in error_queue should keep the
refcnt on the socket.
This seems like a chicken and egg problem, and caused various issues
in the past,
see for instance [1]

We better make sure error queue is purged at socket dismantle (after
refcnt reached 0)

I suggest we generalize fix that went in this patch :

dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
in skb_complete_wifi_ack()

Instead of adding yet another rule about a queue lock, and a socket
flag, this would keep things tied to sk_refcnt.

Also I do not think TCP has an issue, please take a look at

[1]
commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
sk_error_queue in sk_stream_kill_queues()

(A leak in TCP would be caught in a few hours by syzbot really)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-19  9:34         ` Eric Dumazet
@ 2023-04-19 14:16           ` Willem de Bruijn
  2023-04-19 17:07             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2023-04-19 14:16 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima
  Cc: davem, kuba, kuni1840, netdev, pabeni, syzkaller, willemb

Eric Dumazet wrote:
> On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Eric Dumazet <edumazet@google.com>
> > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Eric Dumazet <edumazet@google.com>
> > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > >
> > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > >   sk.close()
> > > > > >
> > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > the socket's error queue with the TX timestamp.
> > > > > >
> > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > not called.
> > > > > >
> > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > >
> > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > and TCP lacks (ii).
> > > > > >
> > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > the error queue after skb_queue_purge().
> > > > > >
> > > > > >   sendmsg() /* return immediately, but packets
> > > > > >              * are queued in a qdisc or device
> > > > > >              */
> > > > > >                                     close()
> > > > > >                                       skb_queue_purge()
> > > > > >   __skb_tstamp_tx()
> > > > > >     __skb_complete_tx_timestamp()
> > > > > >       sock_queue_err_skb()
> > > > > >         skb_queue_tail()
> > > > > >
> > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > >
> > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > >                                     skb_queue_purge()
> > > > > >
> > > > > >     skb_queue_tail()
> > > > > >
> > > > > > [0]:
> > > > >
> > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > > v2:
> > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > >   * Add Fixes tag for TCP
> > > > > >
> > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > ---
> > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > --- a/net/core/skbuff.c
> > > > > > +++ b/net/core/skbuff.c
> > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > >   */
> > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > >  {
> > > > > > +       unsigned long flags;
> > > > > > +
> > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > >                 return -ENOMEM;
> > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > >         skb_dst_force(skb);
> > > > > >
> > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > -               sk_error_report(sk);
> > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > >
> > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > want to add a confusing construct.
> > > > >
> > > > > Just bail early ?
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > sk_buff *skb)
> > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > >                 return -ENOMEM;
> > > > >
> > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > +               return -EINVAL;
> > > > > +
> > > >
> > > > Isn't it possible that these sequences happen
> > > >
> > > >   close()
> > > >     sock_set_flag(sk, SOCK_DEAD);
> > > >     skb_queue_purge(&sk->sk_error_queue)
> > > >
> > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > >
> > > > I thought we can guarantee the ordering by taking the same lock.
> > > >
> > >
> > > This is fragile.
> >
> > Yes, but I didn't have better idea to avoid the race...
> >
> > >
> > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > in the common case.
> > > I had the following in my tree for a while, to avoid many atomic and
> > > irq masking operations...
> >
> > Cool, and it still works with my patch, no ?
> >
> 
> 
> Really the only thing that ensures a race is not possible is the
> typical sk_refcnt acquisition.
> 
> But I do not see why an skb stored in error_queue should keep the
> refcnt on the socket.
> This seems like a chicken and egg problem, and caused various issues
> in the past,
> see for instance [1]
> 
> We better make sure error queue is purged at socket dismantle (after
> refcnt reached 0)

The problem here is that the timestamp queued on the error queue
holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
sk_ref.

The timestamped packet may contain packet contents, so the ubuf
ref is not superfluous.

Come to think of it, we've always maintained that zerocopy packets
should not be looped to sockets where they can be queued indefinitely,
including packet sockets.

If we enforce that for these tx timestamps too, then that also
solves this issue.

A process that wants efficient MSG_ZEROCOPY will have to request
timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
data along with the timestamp.

> I suggest we generalize fix that went in this patch :
> 
> dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
> in skb_complete_wifi_ack()
> 
> Instead of adding yet another rule about a queue lock, and a socket
> flag, this would keep things tied to sk_refcnt.
> 
> Also I do not think TCP has an issue, please take a look at
> 
> [1]
> commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
> sk_error_queue in sk_stream_kill_queues()
> 
> (A leak in TCP would be caught in a few hours by syzbot really)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-19 14:16           ` Willem de Bruijn
@ 2023-04-19 17:07             ` Kuniyuki Iwashima
  2023-04-19 17:50               ` Willem de Bruijn
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-19 17:07 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date:   Wed, 19 Apr 2023 10:16:07 -0400
> Eric Dumazet wrote:
> > On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Eric Dumazet <edumazet@google.com>
> > > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > > >
> > > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > > >   sk.close()
> > > > > > >
> > > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > > the socket's error queue with the TX timestamp.
> > > > > > >
> > > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > > not called.
> > > > > > >
> > > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > > >
> > > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > > and TCP lacks (ii).
> > > > > > >
> > > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > > the error queue after skb_queue_purge().
> > > > > > >
> > > > > > >   sendmsg() /* return immediately, but packets
> > > > > > >              * are queued in a qdisc or device
> > > > > > >              */
> > > > > > >                                     close()
> > > > > > >                                       skb_queue_purge()
> > > > > > >   __skb_tstamp_tx()
> > > > > > >     __skb_complete_tx_timestamp()
> > > > > > >       sock_queue_err_skb()
> > > > > > >         skb_queue_tail()
> > > > > > >
> > > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > > >
> > > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > > >                                     skb_queue_purge()
> > > > > > >
> > > > > > >     skb_queue_tail()
> > > > > > >
> > > > > > > [0]:
> > > > > >
> > > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > ---
> > > > > > > v2:
> > > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > > >   * Add Fixes tag for TCP
> > > > > > >
> > > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > > ---
> > > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > > --- a/net/core/skbuff.c
> > > > > > > +++ b/net/core/skbuff.c
> > > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > > >   */
> > > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > >  {
> > > > > > > +       unsigned long flags;
> > > > > > > +
> > > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > >                 return -ENOMEM;
> > > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > > >         skb_dst_force(skb);
> > > > > > >
> > > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > -               sk_error_report(sk);
> > > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > > >
> > > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > > want to add a confusing construct.
> > > > > >
> > > > > > Just bail early ?
> > > > > >
> > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > > 100644
> > > > > > --- a/net/core/skbuff.c
> > > > > > +++ b/net/core/skbuff.c
> > > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > > sk_buff *skb)
> > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > >                 return -ENOMEM;
> > > > > >
> > > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > >
> > > > > Isn't it possible that these sequences happen
> > > > >
> > > > >   close()
> > > > >     sock_set_flag(sk, SOCK_DEAD);
> > > > >     skb_queue_purge(&sk->sk_error_queue)
> > > > >
> > > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > > >
> > > > > I thought we can guarantee the ordering by taking the same lock.
> > > > >
> > > >
> > > > This is fragile.
> > >
> > > Yes, but I didn't have better idea to avoid the race...
> > >
> > > >
> > > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > > in the common case.
> > > > I had the following in my tree for a while, to avoid many atomic and
> > > > irq masking operations...
> > >
> > > Cool, and it still works with my patch, no ?
> > >
> > 
> > 
> > Really the only thing that ensures a race is not possible is the
> > typical sk_refcnt acquisition.
> > 
> > But I do not see why an skb stored in error_queue should keep the
> > refcnt on the socket.
> > This seems like a chicken and egg problem, and caused various issues
> > in the past,
> > see for instance [1]
> > 
> > We better make sure error queue is purged at socket dismantle (after
> > refcnt reached 0)
> 
> The problem here is that the timestamp queued on the error queue
> holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
> sk_ref.
> 
> The timestamped packet may contain packet contents, so the ubuf
> ref is not superfluous.
> 
> Come to think of it, we've always maintained that zerocopy packets
> should not be looped to sockets where they can be queued indefinitely,
> including packet sockets.
> 
> If we enforce that for these tx timestamps too, then that also
> solves this issue.
> 
> A process that wants efficient MSG_ZEROCOPY will have to request
> timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
> data along with the timestamp.

Actually, my first attempt was similar to this that avoids skb_clone()
silently if MSG_ZEROCOPY, but this kind of way could break users who
were using tstamp and just added MSG_ZEROCOPY logic to their app, so
I placed skb_queue_purge() during close().

---8<---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb7d33b41e71..9318b438888e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5135,7 +5149,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!skb_may_tx_timestamp(sk, tsonly))
 		return;
 
-	if (tsonly) {
+	if (tsonly || skb_zcopy(orig_skb)) {
 #ifdef CONFIG_INET
 		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
 		    sk_is_tcp(sk)) {
---8<---


> 
> > I suggest we generalize fix that went in this patch :
> > 
> > dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
> > in skb_complete_wifi_ack()
> > 
> > Instead of adding yet another rule about a queue lock, and a socket
> > flag, this would keep things tied to sk_refcnt.
> > 
> > Also I do not think TCP has an issue, please take a look at
> > 
> > [1]
> > commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
> > sk_error_queue in sk_stream_kill_queues()
> > 
> > (A leak in TCP would be caught in a few hours by syzbot really)

I have tested TCP before posting v1 and found this skb_queue_purge()
was to prevent the same issue, but Willem's point sounds reasonable
that packets queued in qdisc or device could be put in the error queue
after close() completes.  That's why I mentioned TCP since v2.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-19 17:07             ` Kuniyuki Iwashima
@ 2023-04-19 17:50               ` Willem de Bruijn
  2023-04-19 22:43                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2023-04-19 17:50 UTC (permalink / raw)
  To: Kuniyuki Iwashima, willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

Kuniyuki Iwashima wrote:
> From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date:   Wed, 19 Apr 2023 10:16:07 -0400
> > Eric Dumazet wrote:
> > > On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Eric Dumazet <edumazet@google.com>
> > > > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > >
> > > > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > > > >
> > > > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > > > >   sk.close()
> > > > > > > >
> > > > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > > > the socket's error queue with the TX timestamp.
> > > > > > > >
> > > > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > > > not called.
> > > > > > > >
> > > > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > > > >
> > > > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > > > and TCP lacks (ii).
> > > > > > > >
> > > > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > > > the error queue after skb_queue_purge().
> > > > > > > >
> > > > > > > >   sendmsg() /* return immediately, but packets
> > > > > > > >              * are queued in a qdisc or device
> > > > > > > >              */
> > > > > > > >                                     close()
> > > > > > > >                                       skb_queue_purge()
> > > > > > > >   __skb_tstamp_tx()
> > > > > > > >     __skb_complete_tx_timestamp()
> > > > > > > >       sock_queue_err_skb()
> > > > > > > >         skb_queue_tail()
> > > > > > > >
> > > > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > > > >
> > > > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > > > >                                     skb_queue_purge()
> > > > > > > >
> > > > > > > >     skb_queue_tail()
> > > > > > > >
> > > > > > > > [0]:
> > > > > > >
> > > > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > > ---
> > > > > > > > v2:
> > > > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > > > >   * Add Fixes tag for TCP
> > > > > > > >
> > > > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > > > ---
> > > > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > > > >   */
> > > > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > >  {
> > > > > > > > +       unsigned long flags;
> > > > > > > > +
> > > > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > >                 return -ENOMEM;
> > > > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > > > >         skb_dst_force(skb);
> > > > > > > >
> > > > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > -               sk_error_report(sk);
> > > > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > > > >
> > > > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > > > want to add a confusing construct.
> > > > > > >
> > > > > > > Just bail early ?
> > > > > > >
> > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > > > 100644
> > > > > > > --- a/net/core/skbuff.c
> > > > > > > +++ b/net/core/skbuff.c
> > > > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > > > sk_buff *skb)
> > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > >                 return -ENOMEM;
> > > > > > >
> > > > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > > > +               return -EINVAL;
> > > > > > > +
> > > > > >
> > > > > > Isn't it possible that these sequences happen
> > > > > >
> > > > > >   close()
> > > > > >     sock_set_flag(sk, SOCK_DEAD);
> > > > > >     skb_queue_purge(&sk->sk_error_queue)
> > > > > >
> > > > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > > > >
> > > > > > I thought we can guarantee the ordering by taking the same lock.
> > > > > >
> > > > >
> > > > > This is fragile.
> > > >
> > > > Yes, but I didn't have better idea to avoid the race...
> > > >
> > > > >
> > > > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > > > in the common case.
> > > > > I had the following in my tree for a while, to avoid many atomic and
> > > > > irq masking operations...
> > > >
> > > > Cool, and it still works with my patch, no ?
> > > >
> > > 
> > > 
> > > Really the only thing that ensures a race is not possible is the
> > > typical sk_refcnt acquisition.
> > > 
> > > But I do not see why an skb stored in error_queue should keep the
> > > refcnt on the socket.
> > > This seems like a chicken and egg problem, and caused various issues
> > > in the past,
> > > see for instance [1]
> > > 
> > > We better make sure error queue is purged at socket dismantle (after
> > > refcnt reached 0)
> > 
> > The problem here is that the timestamp queued on the error queue
> > holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
> > sk_ref.
> > 
> > The timestamped packet may contain packet contents, so the ubuf
> > ref is not superfluous.
> > 
> > Come to think of it, we've always maintained that zerocopy packets
> > should not be looped to sockets where they can be queued indefinitely,
> > including packet sockets.
> > 
> > If we enforce that for these tx timestamps too, then that also
> > solves this issue.
> > 
> > A process that wants efficient MSG_ZEROCOPY will have to request
> > timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
> > data along with the timestamp.
> 
> Actually, my first attempt was similar to this that avoids skb_clone()
> silently if MSG_ZEROCOPY, but this kind of way could break users who
> were using tstamp and just added MSG_ZEROCOPY logic to their app, so
> I placed skb_queue_purge() during close().
> 
> ---8<---
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eb7d33b41e71..9318b438888e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5135,7 +5149,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  	if (!skb_may_tx_timestamp(sk, tsonly))
>  		return;
>  
> -	if (tsonly) {
> +	if (tsonly || skb_zcopy(orig_skb)) {
>  #ifdef CONFIG_INET
>  		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
>  		    sk_is_tcp(sk)) {
> ---8<---

Actually, the skb_clone in __skb_tstamp_tx should already release
the reference on the ubuf.

With the same mechanism that we rely on for packet sockets, e.g.,
in dev_queue_xmit_nit.

skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
skbs. Which creates a copy of the data and calls skb_zcopy_clear.

The skb that gets queued onto the error queue should not have a
reference on an ubuf: skb_zcopy(skb) should return NULL.

> 
> > 
> > > I suggest we generalize fix that went in this patch :
> > > 
> > > dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
> > > in skb_complete_wifi_ack()
> > > 
> > > Instead of adding yet another rule about a queue lock, and a socket
> > > flag, this would keep things tied to sk_refcnt.
> > > 
> > > Also I do not think TCP has an issue, please take a look at
> > > 
> > > [1]
> > > commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
> > > sk_error_queue in sk_stream_kill_queues()
> > > 
> > > (A leak in TCP would be caught in a few hours by syzbot really)
> 
> I have tested TCP before posting v1 and found this skb_queue_purge()
> was to prevent the same issue, but Willem's point sounds reasonable
> that packets queued in qdisc or device could be put in the error queue
> after close() completes.  That's why I mentioned TCP since v2.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-19 17:50               ` Willem de Bruijn
@ 2023-04-19 22:43                 ` Kuniyuki Iwashima
  2023-04-20  2:20                   ` Willem de Bruijn
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-19 22:43 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date:   Wed, 19 Apr 2023 13:50:47 -0400
> Kuniyuki Iwashima wrote:
> > From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date:   Wed, 19 Apr 2023 10:16:07 -0400
> > > Eric Dumazet wrote:
> > > > On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > > > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > > >
> > > > > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > > > > >
> > > > > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > > > > >   sk.close()
> > > > > > > > >
> > > > > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > > > > the socket's error queue with the TX timestamp.
> > > > > > > > >
> > > > > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > > > > not called.
> > > > > > > > >
> > > > > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > > > > >
> > > > > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > > > > and TCP lacks (ii).
> > > > > > > > >
> > > > > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > > > > the error queue after skb_queue_purge().
> > > > > > > > >
> > > > > > > > >   sendmsg() /* return immediately, but packets
> > > > > > > > >              * are queued in a qdisc or device
> > > > > > > > >              */
> > > > > > > > >                                     close()
> > > > > > > > >                                       skb_queue_purge()
> > > > > > > > >   __skb_tstamp_tx()
> > > > > > > > >     __skb_complete_tx_timestamp()
> > > > > > > > >       sock_queue_err_skb()
> > > > > > > > >         skb_queue_tail()
> > > > > > > > >
> > > > > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > > > > >
> > > > > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > > > > >                                     skb_queue_purge()
> > > > > > > > >
> > > > > > > > >     skb_queue_tail()
> > > > > > > > >
> > > > > > > > > [0]:
> > > > > > > >
> > > > > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > > > > >   * Add Fixes tag for TCP
> > > > > > > > >
> > > > > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > > > > ---
> > > > > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > > > > >   */
> > > > > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > >  {
> > > > > > > > > +       unsigned long flags;
> > > > > > > > > +
> > > > > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > > >                 return -ENOMEM;
> > > > > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > > > > >         skb_dst_force(skb);
> > > > > > > > >
> > > > > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > > -               sk_error_report(sk);
> > > > > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > > > > >
> > > > > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > > > > want to add a confusing construct.
> > > > > > > >
> > > > > > > > Just bail early ?
> > > > > > > >
> > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > > > > 100644
> > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > > > > sk_buff *skb)
> > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > >                 return -ENOMEM;
> > > > > > > >
> > > > > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > > > > +               return -EINVAL;
> > > > > > > > +
> > > > > > >
> > > > > > > Isn't it possible that these sequences happen
> > > > > > >
> > > > > > >   close()
> > > > > > >     sock_set_flag(sk, SOCK_DEAD);
> > > > > > >     skb_queue_purge(&sk->sk_error_queue)
> > > > > > >
> > > > > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > > > > >
> > > > > > > I thought we can guarantee the ordering by taking the same lock.
> > > > > > >
> > > > > >
> > > > > > This is fragile.
> > > > >
> > > > > Yes, but I didn't have better idea to avoid the race...
> > > > >
> > > > > >
> > > > > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > > > > in the common case.
> > > > > > I had the following in my tree for a while, to avoid many atomic and
> > > > > > irq masking operations...
> > > > >
> > > > > Cool, and it still works with my patch, no ?
> > > > >
> > > > 
> > > > 
> > > > Really the only thing that ensures a race is not possible is the
> > > > typical sk_refcnt acquisition.
> > > > 
> > > > But I do not see why an skb stored in error_queue should keep the
> > > > refcnt on the socket.
> > > > This seems like a chicken and egg problem, and caused various issues
> > > > in the past,
> > > > see for instance [1]
> > > > 
> > > > We better make sure error queue is purged at socket dismantle (after
> > > > refcnt reached 0)
> > > 
> > > The problem here is that the timestamp queued on the error queue
> > > holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
> > > sk_ref.
> > > 
> > > The timestamped packet may contain packet contents, so the ubuf
> > > ref is not superfluous.
> > > 
> > > Come to think of it, we've always maintained that zerocopy packets
> > > should not be looped to sockets where they can be queued indefinitely,
> > > including packet sockets.
> > > 
> > > If we enforce that for these tx timestamps too, then that also
> > > solves this issue.
> > > 
> > > A process that wants efficient MSG_ZEROCOPY will have to request
> > > timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
> > > data along with the timestamp.
> > 
> > Actually, my first attempt was similar to this that avoids skb_clone()
> > silently if MSG_ZEROCOPY, but this kind of way could break users who
> > were using tstamp and just added MSG_ZEROCOPY logic to their app, so
> > I placed skb_queue_purge() during close().
> > 
> > ---8<---
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eb7d33b41e71..9318b438888e 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5135,7 +5149,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >  	if (!skb_may_tx_timestamp(sk, tsonly))
> >  		return;
> >  
> > -	if (tsonly) {
> > +	if (tsonly || skb_zcopy(orig_skb)) {
> >  #ifdef CONFIG_INET
> >  		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
> >  		    sk_is_tcp(sk)) {
> > ---8<---
> 
> Actually, the skb_clone in __skb_tstamp_tx should already release
> the reference on the ubuf.
> 
> With the same mechanism that we rely on for packet sockets, e.g.,
> in dev_queue_xmit_nit.
> 
> skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
> skbs. Which creates a copy of the data and calls skb_zcopy_clear.
> 
> The skb that gets queued onto the error queue should not have a
> reference on an ubuf: skb_zcopy(skb) should return NULL.

Exactly, so how about this ?

---8<---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 768f9d04911f..0fa0b2ac7071 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5166,6 +5166,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!skb)
 		return;
 
+	if (skb_zcopy(skb) && skb_copy_ubufs(skb, GFP_ATOMIC))
+		return;
+
 	if (tsonly) {
 		skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
 					     SKBTX_ANY_TSTAMP;
---8<---


> 
> > 
> > > 
> > > > I suggest we generalize fix that went in this patch :
> > > > 
> > > > dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
> > > > in skb_complete_wifi_ack()
> > > > 
> > > > Instead of adding yet another rule about a queue lock, and a socket
> > > > flag, this would keep things tied to sk_refcnt.
> > > > 
> > > > Also I do not think TCP has an issue, please take a look at
> > > > 
> > > > [1]
> > > > commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
> > > > sk_error_queue in sk_stream_kill_queues()
> > > > 
> > > > (A leak in TCP would be caught in a few hours by syzbot really)
> > 
> > I have tested TCP before posting v1 and found this skb_queue_purge()
> > was to prevent the same issue, but Willem's point sounds reasonable
> > that packets queued in qdisc or device could be put in the error queue
> > after close() completes.  That's why I mentioned TCP since v2.


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-19 22:43                 ` Kuniyuki Iwashima
@ 2023-04-20  2:20                   ` Willem de Bruijn
  2023-04-20  3:30                     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2023-04-20  2:20 UTC (permalink / raw)
  To: Kuniyuki Iwashima, willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

Kuniyuki Iwashima wrote:
> From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date:   Wed, 19 Apr 2023 13:50:47 -0400
> > Kuniyuki Iwashima wrote:
> > > From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > Date:   Wed, 19 Apr 2023 10:16:07 -0400
> > > > Eric Dumazet wrote:
> > > > > On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > > > > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > >
> > > > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > > > >
> > > > > > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > > > > > >
> > > > > > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > > > > > >   sk.close()
> > > > > > > > > >
> > > > > > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > > > > > the socket's error queue with the TX timestamp.
> > > > > > > > > >
> > > > > > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > > > > > not called.
> > > > > > > > > >
> > > > > > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > > > > > >
> > > > > > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > > > > > and TCP lacks (ii).
> > > > > > > > > >
> > > > > > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > > > > > the error queue after skb_queue_purge().
> > > > > > > > > >
> > > > > > > > > >   sendmsg() /* return immediately, but packets
> > > > > > > > > >              * are queued in a qdisc or device
> > > > > > > > > >              */
> > > > > > > > > >                                     close()
> > > > > > > > > >                                       skb_queue_purge()
> > > > > > > > > >   __skb_tstamp_tx()
> > > > > > > > > >     __skb_complete_tx_timestamp()
> > > > > > > > > >       sock_queue_err_skb()
> > > > > > > > > >         skb_queue_tail()
> > > > > > > > > >
> > > > > > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > > > > > >
> > > > > > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > > > > > >                                     skb_queue_purge()
> > > > > > > > > >
> > > > > > > > > >     skb_queue_tail()
> > > > > > > > > >
> > > > > > > > > > [0]:
> > > > > > > > >
> > > > > > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > > > > ---
> > > > > > > > > > v2:
> > > > > > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > > > > > >   * Add Fixes tag for TCP
> > > > > > > > > >
> > > > > > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > > > > > ---
> > > > > > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > > > > > >   */
> > > > > > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > > >  {
> > > > > > > > > > +       unsigned long flags;
> > > > > > > > > > +
> > > > > > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > > > > > >         skb_dst_force(skb);
> > > > > > > > > >
> > > > > > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > > > -               sk_error_report(sk);
> > > > > > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > > > > > >
> > > > > > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > > > > > want to add a confusing construct.
> > > > > > > > >
> > > > > > > > > Just bail early ?
> > > > > > > > >
> > > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > > > > > 100644
> > > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > > > > > sk_buff *skb)
> > > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > > >                 return -ENOMEM;
> > > > > > > > >
> > > > > > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > > > > > +               return -EINVAL;
> > > > > > > > > +
> > > > > > > >
> > > > > > > > Isn't it possible that these sequences happen
> > > > > > > >
> > > > > > > >   close()
> > > > > > > >     sock_set_flag(sk, SOCK_DEAD);
> > > > > > > >     skb_queue_purge(&sk->sk_error_queue)
> > > > > > > >
> > > > > > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > > > > > >
> > > > > > > > I thought we can guarantee the ordering by taking the same lock.
> > > > > > > >
> > > > > > >
> > > > > > > This is fragile.
> > > > > >
> > > > > > Yes, but I didn't have better idea to avoid the race...
> > > > > >
> > > > > > >
> > > > > > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > > > > > in the common case.
> > > > > > > I had the following in my tree for a while, to avoid many atomic and
> > > > > > > irq masking operations...
> > > > > >
> > > > > > Cool, and it still works with my patch, no ?
> > > > > >
> > > > > 
> > > > > 
> > > > > Really the only thing that ensures a race is not possible is the
> > > > > typical sk_refcnt acquisition.
> > > > > 
> > > > > But I do not see why an skb stored in error_queue should keep the
> > > > > refcnt on the socket.
> > > > > This seems like a chicken and egg problem, and caused various issues
> > > > > in the past,
> > > > > see for instance [1]
> > > > > 
> > > > > We better make sure error queue is purged at socket dismantle (after
> > > > > refcnt reached 0)
> > > > 
> > > > The problem here is that the timestamp queued on the error queue
> > > > holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
> > > > sk_ref.
> > > > 
> > > > The timestamped packet may contain packet contents, so the ubuf
> > > > ref is not superfluous.
> > > > 
> > > > Come to think of it, we've always maintained that zerocopy packets
> > > > should not be looped to sockets where they can be queued indefinitely,
> > > > including packet sockets.
> > > > 
> > > > If we enforce that for these tx timestamps too, then that also
> > > > solves this issue.
> > > > 
> > > > A process that wants efficient MSG_ZEROCOPY will have to request
> > > > timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
> > > > data along with the timestamp.
> > > 
> > > Actually, my first attempt was similar to this that avoids skb_clone()
> > > silently if MSG_ZEROCOPY, but this kind of way could break users who
> > > were using tstamp and just added MSG_ZEROCOPY logic to their app, so
> > > I placed skb_queue_purge() during close().
> > > 
> > > ---8<---
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index eb7d33b41e71..9318b438888e 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5135,7 +5149,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> > >  	if (!skb_may_tx_timestamp(sk, tsonly))
> > >  		return;
> > >  
> > > -	if (tsonly) {
> > > +	if (tsonly || skb_zcopy(orig_skb)) {
> > >  #ifdef CONFIG_INET
> > >  		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
> > >  		    sk_is_tcp(sk)) {
> > > ---8<---
> > 
> > Actually, the skb_clone in __skb_tstamp_tx should already release
> > the reference on the ubuf.
> > 
> > With the same mechanism that we rely on for packet sockets, e.g.,
> > in dev_queue_xmit_nit.
> > 
> > skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
> > skbs. Which creates a copy of the data and calls skb_zcopy_clear.
> > 
> > The skb that gets queued onto the error queue should not have a
> > reference on an ubuf: skb_zcopy(skb) should return NULL.
> 
> Exactly, so how about this ?
> 
> ---8<---
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 768f9d04911f..0fa0b2ac7071 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5166,6 +5166,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  	if (!skb)
>  		return;
>  
> +	if (skb_zcopy(skb) && skb_copy_ubufs(skb, GFP_ATOMIC))
> +		return;
> +
>  	if (tsonly) {
>  		skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
>  					     SKBTX_ANY_TSTAMP;
> ---8<---
> 

What I meant was that given this I don't understand how a packet
with ubuf references gets queued at all.

__skb_tstamp_tx does not queue orig_skb. It either allocates a new
skb or calls skb = skb_clone(orig_skb).

That existing call internally calls skb_orphan_frags and
skb_copy_ubufs.

So the extra test should not be needed. Indeed I would be surprised if
this triggers:

@@ -5164,6 +5164,8 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
        if (!skb)
                return;
 
+       WARN_ON_ONCE(skb_zcopy(sbk));
+

        if (tsonly) {
                skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
                                             SKBTX_ANY_TSTAMP;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-20  2:20                   ` Willem de Bruijn
@ 2023-04-20  3:30                     ` Kuniyuki Iwashima
  2023-04-20 13:14                       ` Willem de Bruijn
  0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-20  3:30 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date:   Wed, 19 Apr 2023 22:20:07 -0400
> Kuniyuki Iwashima wrote:
> > From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date:   Wed, 19 Apr 2023 13:50:47 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > > Date:   Wed, 19 Apr 2023 10:16:07 -0400
> > > > > Eric Dumazet wrote:
> > > > > > On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > > > > > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > > >
> > > > > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > > > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > > > > > > >
> > > > > > > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > > > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > > > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > > > > > > >   sk.close()
> > > > > > > > > > >
> > > > > > > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > > > > > > the socket's error queue with the TX timestamp.
> > > > > > > > > > >
> > > > > > > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > > > > > > not called.
> > > > > > > > > > >
> > > > > > > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > > > > > > >
> > > > > > > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > > > > > > and TCP lacks (ii).
> > > > > > > > > > >
> > > > > > > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > > > > > > the error queue after skb_queue_purge().
> > > > > > > > > > >
> > > > > > > > > > >   sendmsg() /* return immediately, but packets
> > > > > > > > > > >              * are queued in a qdisc or device
> > > > > > > > > > >              */
> > > > > > > > > > >                                     close()
> > > > > > > > > > >                                       skb_queue_purge()
> > > > > > > > > > >   __skb_tstamp_tx()
> > > > > > > > > > >     __skb_complete_tx_timestamp()
> > > > > > > > > > >       sock_queue_err_skb()
> > > > > > > > > > >         skb_queue_tail()
> > > > > > > > > > >
> > > > > > > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > > > > > > >
> > > > > > > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > > > > > > >                                     skb_queue_purge()
> > > > > > > > > > >
> > > > > > > > > > >     skb_queue_tail()
> > > > > > > > > > >
> > > > > > > > > > > [0]:
> > > > > > > > > >
> > > > > > > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > > > > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > > > > > ---
> > > > > > > > > > > v2:
> > > > > > > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > > > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > > > > > > >   * Add Fixes tag for TCP
> > > > > > > > > > >
> > > > > > > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > > > > > > ---
> > > > > > > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > > > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > > > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > > > > > > >   */
> > > > > > > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > > > >  {
> > > > > > > > > > > +       unsigned long flags;
> > > > > > > > > > > +
> > > > > > > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > > > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > > > > > > >         skb_dst_force(skb);
> > > > > > > > > > >
> > > > > > > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > > > > > > -               sk_error_report(sk);
> > > > > > > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > > > > > > >
> > > > > > > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > > > > > > want to add a confusing construct.
> > > > > > > > > >
> > > > > > > > > > Just bail early ?
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > > > > > > 100644
> > > > > > > > > > --- a/net/core/skbuff.c
> > > > > > > > > > +++ b/net/core/skbuff.c
> > > > > > > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > > > > > > sk_buff *skb)
> > > > > > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > >
> > > > > > > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > +
> > > > > > > > >
> > > > > > > > > Isn't it possible that these sequences happen
> > > > > > > > >
> > > > > > > > >   close()
> > > > > > > > >     sock_set_flag(sk, SOCK_DEAD);
> > > > > > > > >     skb_queue_purge(&sk->sk_error_queue)
> > > > > > > > >
> > > > > > > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > > > > > > >
> > > > > > > > > I thought we can guarantee the ordering by taking the same lock.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is fragile.
> > > > > > >
> > > > > > > Yes, but I didn't have better idea to avoid the race...
> > > > > > >
> > > > > > > >
> > > > > > > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > > > > > > in the common case.
> > > > > > > > I had the following in my tree for a while, to avoid many atomic and
> > > > > > > > irq masking operations...
> > > > > > >
> > > > > > > Cool, and it still works with my patch, no ?
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > Really the only thing that ensures a race is not possible is the
> > > > > > typical sk_refcnt acquisition.
> > > > > > 
> > > > > > But I do not see why an skb stored in error_queue should keep the
> > > > > > refcnt on the socket.
> > > > > > This seems like a chicken and egg problem, and caused various issues
> > > > > > in the past,
> > > > > > see for instance [1]
> > > > > > 
> > > > > > We better make sure error queue is purged at socket dismantle (after
> > > > > > refcnt reached 0)
> > > > > 
> > > > > The problem here is that the timestamp queued on the error queue
> > > > > holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
> > > > > sk_ref.
> > > > > 
> > > > > The timestamped packet may contain packet contents, so the ubuf
> > > > > ref is not superfluous.
> > > > > 
> > > > > Come to think of it, we've always maintained that zerocopy packets
> > > > > should not be looped to sockets where they can be queued indefinitely,
> > > > > including packet sockets.
> > > > > 
> > > > > If we enforce that for these tx timestamps too, then that also
> > > > > solves this issue.
> > > > > 
> > > > > A process that wants efficient MSG_ZEROCOPY will have to request
> > > > > timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
> > > > > data along with the timestamp.
> > > > 
> > > > Actually, my first attempt was similar to this that avoids skb_clone()
> > > > silently if MSG_ZEROCOPY, but this kind of way could break users who
> > > > were using tstamp and just added MSG_ZEROCOPY logic to their app, so
> > > > I placed skb_queue_purge() during close().
> > > > 
> > > > ---8<---
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index eb7d33b41e71..9318b438888e 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5135,7 +5149,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> > > >  	if (!skb_may_tx_timestamp(sk, tsonly))
> > > >  		return;
> > > >  
> > > > -	if (tsonly) {
> > > > +	if (tsonly || skb_zcopy(orig_skb)) {
> > > >  #ifdef CONFIG_INET
> > > >  		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
> > > >  		    sk_is_tcp(sk)) {
> > > > ---8<---
> > > 
> > > Actually, the skb_clone in __skb_tstamp_tx should already release
> > > the reference on the ubuf.
> > > 
> > > With the same mechanism that we rely on for packet sockets, e.g.,
> > > in dev_queue_xmit_nit.
> > > 
> > > skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
> > > skbs. Which creates a copy of the data and calls skb_zcopy_clear.
> > > 
> > > The skb that gets queued onto the error queue should not have a
> > > reference on an ubuf: skb_zcopy(skb) should return NULL.
> > 
> > Exactly, so how about this ?
> > 
> > ---8<---
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 768f9d04911f..0fa0b2ac7071 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5166,6 +5166,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >  	if (!skb)
> >  		return;
> >  
> > +	if (skb_zcopy(skb) && skb_copy_ubufs(skb, GFP_ATOMIC))
> > +		return;
> > +
> >  	if (tsonly) {
> >  		skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
> >  					     SKBTX_ANY_TSTAMP;
> > ---8<---
> > 
> 
> What I meant was that given this I don't understand how a packet
> with ubuf references gets queued at all.
> 
> __skb_tstamp_tx does not queue orig_skb. It either allocates a new
> skb or calls skb = skb_clone(orig_skb).
> 
> That existing call internally calls skb_orphan_frags and
> skb_copy_ubufs.

No, skb_orphan_frags() does not call skb_copy_ubufs() here because
msg_zerocopy_alloc() sets SKBFL_DONT_ORPHAN for orig_skb.

So, we need to call skb_copy_ubufs() explicitly if skb_zcopy(skb).

> 
> So the extra test should not be needed. Indeed I would be surprised if
> this triggers:

And this actually triggers.

> 
> @@ -5164,6 +5164,8 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>         if (!skb)
>                 return;
>  
> +       WARN_ON_ONCE(skb_zcopy(sbk));
> +
> 
>         if (tsonly) {
>                 skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
>                                              SKBTX_ANY_TSTAMP;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-20  3:30                     ` Kuniyuki Iwashima
@ 2023-04-20 13:14                       ` Willem de Bruijn
  2023-04-20 16:46                         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2023-04-20 13:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima, willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

> > > > Actually, the skb_clone in __skb_tstamp_tx should already release
> > > > the reference on the ubuf.
> > > > 
> > > > With the same mechanism that we rely on for packet sockets, e.g.,
> > > > in dev_queue_xmit_nit.
> > > > 
> > > > skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
> > > > skbs. Which creates a copy of the data and calls skb_zcopy_clear.
> > > > 
> > > > The skb that gets queued onto the error queue should not have a
> > > > reference on an ubuf: skb_zcopy(skb) should return NULL.
> > > 
> > > Exactly, so how about this ?
> > > 
> > > ---8<---
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 768f9d04911f..0fa0b2ac7071 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5166,6 +5166,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> > >  	if (!skb)
> > >  		return;
> > >  
> > > +	if (skb_zcopy(skb) && skb_copy_ubufs(skb, GFP_ATOMIC))
> > > +		return;
> > > +
> > >  	if (tsonly) {
> > >  		skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
> > >  					     SKBTX_ANY_TSTAMP;
> > > ---8<---
> > > 
> > 
> > What I meant was that given this I don't understand how a packet
> > with ubuf references gets queued at all.
> > 
> > __skb_tstamp_tx does not queue orig_skb. It either allocates a new
> > skb or calls skb = skb_clone(orig_skb).
> > 
> > That existing call internally calls skb_orphan_frags and
> > skb_copy_ubufs.
> 
> No, skb_orphan_frags() does not call skb_copy_ubufs() here because
> msg_zerocopy_alloc() sets SKBFL_DONT_ORPHAN for orig_skb.
> 
> So, we need to call skb_copy_ubufs() explicitly if skb_zcopy(skb).
> 
> > 
> > So the extra test should not be needed. Indeed I would be surprised if
> > this triggers:
> 
> And this actually triggers.

Oh right, I confused skb_orphan_frags and skb_orphan_frags_rx.

We need to add a call to that, the same approach used for looping in
__netif_receive_skb_core and packet sockets in deliver_skb and
dev_queue_xmit_nit.

@@ -5160,6 +5160,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
                        skb = alloc_skb(0, GFP_ATOMIC);
        } else {
                skb = skb_clone(orig_skb, GFP_ATOMIC);
+
+               if (skb_orphan_frags_rx(skb, GFP_ATOMIC))
+                       return;
        }
        if (!skb)
                return;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  2023-04-20 13:14                       ` Willem de Bruijn
@ 2023-04-20 16:46                         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 14+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-20 16:46 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni,
	syzkaller, willemb

From:   Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date:   Thu, 20 Apr 2023 09:14:58 -0400
> > > > > Actually, the skb_clone in __skb_tstamp_tx should already release
> > > > > the reference on the ubuf.
> > > > > 
> > > > > With the same mechanism that we rely on for packet sockets, e.g.,
> > > > > in dev_queue_xmit_nit.
> > > > > 
> > > > > skb_clone calls skb_orphan_frags calls skb_copy_ubufs for zerocopy
> > > > > skbs. Which creates a copy of the data and calls skb_zcopy_clear.
> > > > > 
> > > > > The skb that gets queued onto the error queue should not have a
> > > > > reference on an ubuf: skb_zcopy(skb) should return NULL.
> > > > 
> > > > Exactly, so how about this ?
> > > > 
> > > > ---8<---
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index 768f9d04911f..0fa0b2ac7071 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5166,6 +5166,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> > > >  	if (!skb)
> > > >  		return;
> > > >  
> > > > +	if (skb_zcopy(skb) && skb_copy_ubufs(skb, GFP_ATOMIC))
> > > > +		return;
> > > > +
> > > >  	if (tsonly) {
> > > >  		skb_shinfo(skb)->tx_flags |= skb_shinfo(orig_skb)->tx_flags &
> > > >  					     SKBTX_ANY_TSTAMP;
> > > > ---8<---
> > > > 
> > > 
> > > What I meant was that given this I don't understand how a packet
> > > with ubuf references gets queued at all.
> > > 
> > > __skb_tstamp_tx does not queue orig_skb. It either allocates a new
> > > skb or calls skb = skb_clone(orig_skb).
> > > 
> > > That existing call internally calls skb_orphan_frags and
> > > skb_copy_ubufs.
> > 
> > No, skb_orphan_frags() does not call skb_copy_ubufs() here because
> > msg_zerocopy_alloc() sets SKBFL_DONT_ORPHAN for orig_skb.
> > 
> > So, we need to call skb_copy_ubufs() explicitly if skb_zcopy(skb).
> > 
> > > 
> > > So the extra test should not be needed. Indeed I would be surprised if
> > > this triggers:
> > 
> > And this actually triggers.
> 
> Oh right, I confused skb_orphan_frags and skb_orphan_frags_rx.
> 
> We need to add a call to that, the same approach used for looping in
> __netif_receive_skb_core and packet sockets in deliver_skb and
> dev_queue_xmit_nit.

I missed skb_orphan_frags_rx() defined just below skb_orphan_frags(),
which calls very what I want :)

Will post v3, thanks!

> 
> @@ -5160,6 +5160,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>                         skb = alloc_skb(0, GFP_ATOMIC);
>         } else {
>                 skb = skb_clone(orig_skb, GFP_ATOMIC);
> +
> +               if (skb_orphan_frags_rx(skb, GFP_ATOMIC))
> +                       return;
>         }
>         if (!skb)
>                 return;

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-04-20 16:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-18 18:08 [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp Kuniyuki Iwashima
2023-04-18 18:33 ` Eric Dumazet
2023-04-18 18:44   ` Kuniyuki Iwashima
2023-04-18 19:04     ` Eric Dumazet
2023-04-18 19:25       ` Kuniyuki Iwashima
2023-04-19  9:34         ` Eric Dumazet
2023-04-19 14:16           ` Willem de Bruijn
2023-04-19 17:07             ` Kuniyuki Iwashima
2023-04-19 17:50               ` Willem de Bruijn
2023-04-19 22:43                 ` Kuniyuki Iwashima
2023-04-20  2:20                   ` Willem de Bruijn
2023-04-20  3:30                     ` Kuniyuki Iwashima
2023-04-20 13:14                       ` Willem de Bruijn
2023-04-20 16:46                         ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).