netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 net-next 0/5] af_unix: Random improvements for GC.
@ 2024-01-23 17:08 Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc() Kuniyuki Iwashima
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If more than 16000 inflight AF_UNIX sockets exist on a host, each
sendmsg() will be forced to wait for unix_gc() even if a process
is not sending any FD.

This series tries not to impose such a penalty on sane users who
do not send AF_UNIX FDs or do not have inflight sockets more than
SCM_MAX_FD * 8.

The first patch can be backported to -stable.

Cleanup patches for commit 69db702c8387 ("io_uring/af_unix: disable
sending io_uring over sockets") and large refactoring of GC will
be followed later.


Changes:
  v5:
    * Rebase on the latest
    * Add patch 1

  v4: https://lore.kernel.org/netdev/20231219030102.27509-1-kuniyu@amazon.com/
    * Rebase on the latest

  v3: https://lore.kernel.org/netdev/20231218075020.60826-1-kuniyu@amazon.com/
    * Patch 3
      * Reuse gc_in_progress flag.
      * Call flush_work() only when gc is queued or in progress.
    * Patch 4
      * Bump UNIX_INFLIGHT_SANE_USER to (SCM_MAX_FD * 8).

  v2: https://lore.kernel.org/netdev/20231123014747.66063-1-kuniyu@amazon.com/
    * Patch 4
      * Fix build error when CONFIG_UNIX=n

  v1: https://lore.kernel.org/netdev/20231122013629.28554-1-kuniyu@amazon.com/


Kuniyuki Iwashima (5):
  af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc().
  af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
  af_unix: Return struct unix_sock from unix_get_socket().
  af_unix: Run GC on only one CPU.
  af_unix: Try to run GC async.

 include/net/af_unix.h | 14 +++++--
 include/net/scm.h     |  1 +
 net/core/scm.c        |  5 +++
 net/unix/af_unix.c    | 10 +++--
 net/unix/garbage.c    | 98 ++++++++++++++++++++++---------------------
 net/unix/scm.c        | 27 ++++++------
 6 files changed, 85 insertions(+), 70 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 net-next 1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc().
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
@ 2024-01-23 17:08 ` Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 2/5] af_unix: Do not use atomic ops for unix_sk(sk)->inflight Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

gc_in_progress is changed under spin_lock(&unix_gc_lock),
but wait_for_unix_gc() reads it locklessly.

Let's use READ_ONCE().

Fixes: 5f23b734963e ("net: Fix soft lockups/OOM issues w/ unix garbage collector")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 2405f0f9af31..af3d2221cf6a 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -198,7 +198,7 @@ void wait_for_unix_gc(void)
 	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
 	    !READ_ONCE(gc_in_progress))
 		unix_gc();
-	wait_event(unix_gc_wait, gc_in_progress == false);
+	wait_event(unix_gc_wait, !READ_ONCE(gc_in_progress));
 }
 
 /* The external entry point: unix_gc() */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 net-next 2/5] af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc() Kuniyuki Iwashima
@ 2024-01-23 17:08 ` Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 3/5] af_unix: Return struct unix_sock from unix_get_socket() Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
	Simon Horman

When touching unix_sk(sk)->inflight, we are always under
spin_lock(&unix_gc_lock).

Let's convert unix_sk(sk)->inflight to the normal unsigned long.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---
 include/net/af_unix.h |  2 +-
 net/unix/af_unix.c    |  4 ++--
 net/unix/garbage.c    | 17 ++++++++---------
 net/unix/scm.c        |  8 +++++---
 4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 49c4640027d8..ac38b63db554 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,7 +61,7 @@ struct unix_sock {
 	struct mutex		iolock, bindlock;
 	struct sock		*peer;
 	struct list_head	link;
-	atomic_long_t		inflight;
+	unsigned long		inflight;
 	spinlock_t		lock;
 	unsigned long		gc_flags;
 #define UNIX_GC_CANDIDATE	0
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index ac1f2bc18fc9..1e9378036dcc 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -993,11 +993,11 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_write_space	= unix_write_space;
 	sk->sk_max_ack_backlog	= net->unx.sysctl_max_dgram_qlen;
 	sk->sk_destruct		= unix_sock_destructor;
-	u	  = unix_sk(sk);
+	u = unix_sk(sk);
+	u->inflight = 0;
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	spin_lock_init(&u->lock);
-	atomic_long_set(&u->inflight, 0);
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->iolock); /* single task reading lock */
 	mutex_init(&u->bindlock); /* single task binding lock */
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index af3d2221cf6a..051f96a3ab02 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -166,17 +166,18 @@ static void scan_children(struct sock *x, void (*func)(struct unix_sock *),
 
 static void dec_inflight(struct unix_sock *usk)
 {
-	atomic_long_dec(&usk->inflight);
+	usk->inflight--;
 }
 
 static void inc_inflight(struct unix_sock *usk)
 {
-	atomic_long_inc(&usk->inflight);
+	usk->inflight++;
 }
 
 static void inc_inflight_move_tail(struct unix_sock *u)
 {
-	atomic_long_inc(&u->inflight);
+	u->inflight++;
+
 	/* If this still might be part of a cycle, move it to the end
 	 * of the list, so that it's checked even if it was already
 	 * passed over
@@ -237,14 +238,12 @@ void unix_gc(void)
 	 */
 	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
 		long total_refs;
-		long inflight_refs;
 
 		total_refs = file_count(u->sk.sk_socket->file);
-		inflight_refs = atomic_long_read(&u->inflight);
 
-		BUG_ON(inflight_refs < 1);
-		BUG_ON(total_refs < inflight_refs);
-		if (total_refs == inflight_refs) {
+		BUG_ON(!u->inflight);
+		BUG_ON(total_refs < u->inflight);
+		if (total_refs == u->inflight) {
 			list_move_tail(&u->link, &gc_candidates);
 			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
 			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
@@ -271,7 +270,7 @@ void unix_gc(void)
 		/* Move cursor to after the current position. */
 		list_move(&cursor, &u->link);
 
-		if (atomic_long_read(&u->inflight) > 0) {
+		if (u->inflight) {
 			list_move_tail(&u->link, &not_cycle_list);
 			__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
 			scan_children(&u->sk, inc_inflight_move_tail, NULL);
diff --git a/net/unix/scm.c b/net/unix/scm.c
index 822ce0d0d791..e92f2fad6410 100644
--- a/net/unix/scm.c
+++ b/net/unix/scm.c
@@ -53,12 +53,13 @@ void unix_inflight(struct user_struct *user, struct file *fp)
 	if (s) {
 		struct unix_sock *u = unix_sk(s);
 
-		if (atomic_long_inc_return(&u->inflight) == 1) {
+		if (!u->inflight) {
 			BUG_ON(!list_empty(&u->link));
 			list_add_tail(&u->link, &gc_inflight_list);
 		} else {
 			BUG_ON(list_empty(&u->link));
 		}
+		u->inflight++;
 		/* Paired with READ_ONCE() in wait_for_unix_gc() */
 		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + 1);
 	}
@@ -75,10 +76,11 @@ void unix_notinflight(struct user_struct *user, struct file *fp)
 	if (s) {
 		struct unix_sock *u = unix_sk(s);
 
-		BUG_ON(!atomic_long_read(&u->inflight));
+		BUG_ON(!u->inflight);
 		BUG_ON(list_empty(&u->link));
 
-		if (atomic_long_dec_and_test(&u->inflight))
+		u->inflight--;
+		if (!u->inflight)
 			list_del_init(&u->link);
 		/* Paired with READ_ONCE() in wait_for_unix_gc() */
 		WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - 1);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 net-next 3/5] af_unix: Return struct unix_sock from unix_get_socket().
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc() Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 2/5] af_unix: Do not use atomic ops for unix_sk(sk)->inflight Kuniyuki Iwashima
@ 2024-01-23 17:08 ` Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 4/5] af_unix: Run GC on only one CPU Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
	Pavel Begunkov, Simon Horman

Currently, unix_get_socket() returns struct sock, but after calling
it, we always cast it to unix_sk().

Let's return struct unix_sock from unix_get_socket().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c    | 19 +++++++------------
 net/unix/scm.c        | 19 +++++++------------
 3 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index ac38b63db554..2c98ef95017b 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -14,7 +14,7 @@ void unix_destruct_scm(struct sk_buff *skb);
 void io_uring_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
 void wait_for_unix_gc(void);
-struct sock *unix_get_socket(struct file *filp);
+struct unix_sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 051f96a3ab02..d5ef46a75f1f 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -105,20 +105,15 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 
 			while (nfd--) {
 				/* Get the socket the fd matches if it indeed does so */
-				struct sock *sk = unix_get_socket(*fp++);
+				struct unix_sock *u = unix_get_socket(*fp++);
 
-				if (sk) {
-					struct unix_sock *u = unix_sk(sk);
+				/* Ignore non-candidates, they could have been added
+				 * to the queues after starting the garbage collection
+				 */
+				if (u && test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
+					hit = true;
 
-					/* Ignore non-candidates, they could
-					 * have been added to the queues after
-					 * starting the garbage collection
-					 */
-					if (test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
-						hit = true;
-
-						func(u);
-					}
+					func(u);
 				}
 			}
 			if (hit && hitlist != NULL) {
diff --git a/net/unix/scm.c b/net/unix/scm.c
index e92f2fad6410..b5ae5ab16777 100644
--- a/net/unix/scm.c
+++ b/net/unix/scm.c
@@ -21,9 +21,8 @@ EXPORT_SYMBOL(gc_inflight_list);
 DEFINE_SPINLOCK(unix_gc_lock);
 EXPORT_SYMBOL(unix_gc_lock);
 
-struct sock *unix_get_socket(struct file *filp)
+struct unix_sock *unix_get_socket(struct file *filp)
 {
-	struct sock *u_sock = NULL;
 	struct inode *inode = file_inode(filp);
 
 	/* Socket ? */
@@ -34,10 +33,10 @@ struct sock *unix_get_socket(struct file *filp)
 
 		/* PF_UNIX ? */
 		if (s && ops && ops->family == PF_UNIX)
-			u_sock = s;
+			return unix_sk(s);
 	}
 
-	return u_sock;
+	return NULL;
 }
 EXPORT_SYMBOL(unix_get_socket);
 
@@ -46,13 +45,11 @@ EXPORT_SYMBOL(unix_get_socket);
  */
 void unix_inflight(struct user_struct *user, struct file *fp)
 {
-	struct sock *s = unix_get_socket(fp);
+	struct unix_sock *u = unix_get_socket(fp);
 
 	spin_lock(&unix_gc_lock);
 
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
+	if (u) {
 		if (!u->inflight) {
 			BUG_ON(!list_empty(&u->link));
 			list_add_tail(&u->link, &gc_inflight_list);
@@ -69,13 +66,11 @@ void unix_inflight(struct user_struct *user, struct file *fp)
 
 void unix_notinflight(struct user_struct *user, struct file *fp)
 {
-	struct sock *s = unix_get_socket(fp);
+	struct unix_sock *u = unix_get_socket(fp);
 
 	spin_lock(&unix_gc_lock);
 
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
+	if (u) {
 		BUG_ON(!u->inflight);
 		BUG_ON(list_empty(&u->link));
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 net-next 4/5] af_unix: Run GC on only one CPU.
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2024-01-23 17:08 ` [PATCH v5 net-next 3/5] af_unix: Return struct unix_sock from unix_get_socket() Kuniyuki Iwashima
@ 2024-01-23 17:08 ` Kuniyuki Iwashima
  2024-01-23 17:08 ` [PATCH v5 net-next 5/5] af_unix: Try to run GC async Kuniyuki Iwashima
  2024-01-27  5:10 ` [PATCH v5 net-next 0/5] af_unix: Random improvements for GC patchwork-bot+netdevbpf
  5 siblings, 0 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If more than 16000 inflight AF_UNIX sockets exist and the garbage
collector is not running, unix_(dgram|stream)_sendmsg() call unix_gc().
Also, they wait for unix_gc() to complete.

In unix_gc(), all inflight AF_UNIX sockets are traversed at least once,
and more if they are the GC candidate.  Thus, sendmsg() significantly
slows down with too many inflight AF_UNIX sockets.

There is a small window to invoke multiple unix_gc() instances, which
will then be blocked by the same spinlock except for one.

Let's convert unix_gc() to use struct work so that it will not consume
CPUs unnecessarily.

Note WRITE_ONCE(gc_in_progress, true) is moved before running GC.
If we leave the WRITE_ONCE() as is and use the following test to
call flush_work(), a process might not call it.

    CPU 0                                     CPU 1
    ---                                       ---
                                              start work and call __unix_gc()
    if (work_pending(&unix_gc_work) ||        <-- false
        READ_ONCE(gc_in_progress))            <-- false
            flush_work();                     <-- missed!
	                                      WRITE_ONCE(gc_in_progress, true)

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 54 +++++++++++++++++++++++-----------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index d5ef46a75f1f..3f599db59f9d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -86,7 +86,6 @@
 /* Internal data structures and random procedures: */
 
 static LIST_HEAD(gc_candidates);
-static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
 
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			  struct sk_buff_head *hitlist)
@@ -182,23 +181,8 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 }
 
 static bool gc_in_progress;
-#define UNIX_INFLIGHT_TRIGGER_GC 16000
-
-void wait_for_unix_gc(void)
-{
-	/* If number of inflight sockets is insane,
-	 * force a garbage collect right now.
-	 * Paired with the WRITE_ONCE() in unix_inflight(),
-	 * unix_notinflight() and gc_in_progress().
-	 */
-	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
-	    !READ_ONCE(gc_in_progress))
-		unix_gc();
-	wait_event(unix_gc_wait, !READ_ONCE(gc_in_progress));
-}
 
-/* The external entry point: unix_gc() */
-void unix_gc(void)
+static void __unix_gc(struct work_struct *work)
 {
 	struct sk_buff *next_skb, *skb;
 	struct unix_sock *u;
@@ -209,13 +193,6 @@ void unix_gc(void)
 
 	spin_lock(&unix_gc_lock);
 
-	/* Avoid a recursive GC. */
-	if (gc_in_progress)
-		goto out;
-
-	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
-	WRITE_ONCE(gc_in_progress, true);
-
 	/* First, select candidates for garbage collection.  Only
 	 * in-flight sockets are considered, and from those only ones
 	 * which don't have any external reference.
@@ -322,8 +299,31 @@ void unix_gc(void)
 	/* Paired with READ_ONCE() in wait_for_unix_gc(). */
 	WRITE_ONCE(gc_in_progress, false);
 
-	wake_up(&unix_gc_wait);
-
- out:
 	spin_unlock(&unix_gc_lock);
 }
+
+static DECLARE_WORK(unix_gc_work, __unix_gc);
+
+void unix_gc(void)
+{
+	WRITE_ONCE(gc_in_progress, true);
+	queue_work(system_unbound_wq, &unix_gc_work);
+}
+
+#define UNIX_INFLIGHT_TRIGGER_GC 16000
+
+void wait_for_unix_gc(void)
+{
+	/* If number of inflight sockets is insane,
+	 * force a garbage collect right now.
+	 *
+	 * Paired with the WRITE_ONCE() in unix_inflight(),
+	 * unix_notinflight(), and __unix_gc().
+	 */
+	if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&
+	    !READ_ONCE(gc_in_progress))
+		unix_gc();
+
+	if (READ_ONCE(gc_in_progress))
+		flush_work(&unix_gc_work);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 net-next 5/5] af_unix: Try to run GC async.
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2024-01-23 17:08 ` [PATCH v5 net-next 4/5] af_unix: Run GC on only one CPU Kuniyuki Iwashima
@ 2024-01-23 17:08 ` Kuniyuki Iwashima
  2024-01-27  5:10 ` [PATCH v5 net-next 0/5] af_unix: Random improvements for GC patchwork-bot+netdevbpf
  5 siblings, 0 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-23 17:08 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ivan Babrou, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If more than 16000 inflight AF_UNIX sockets exist and the garbage
collector is not running, unix_(dgram|stream)_sendmsg() call unix_gc().
Also, they wait for unix_gc() to complete.

In unix_gc(), all inflight AF_UNIX sockets are traversed at least once,
and more if they are the GC candidate.  Thus, sendmsg() significantly
slows down with too many inflight AF_UNIX sockets.

However, if a process sends data with no AF_UNIX FD, the sendmsg() call
does not need to wait for GC.  After this change, only the process that
meets the condition below will be blocked under such a situation.

  1) cmsg contains AF_UNIX socket
  2) more than 32 AF_UNIX sent by the same user are still inflight

Note that even a sendmsg() call that does not meet the condition but has
AF_UNIX FD will be blocked later in unix_scm_to_skb() by the spinlock,
but we allow that as a bonus for sane users.

The results below are the time spent in unix_dgram_sendmsg() sending 1
byte of data with no FD 4096 times on a host where 32K inflight AF_UNIX
sockets exist.

Without series: the sane sendmsg() needs to wait gc unreasonably.

  $ sudo /usr/share/bcc/tools/funclatency -p 11165 unix_dgram_sendmsg
  Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end.
  ^C
       nsecs               : count     distribution
  [...]
      524288 -> 1048575    : 0        |                                        |
     1048576 -> 2097151    : 3881     |****************************************|
     2097152 -> 4194303    : 214      |**                                      |
     4194304 -> 8388607    : 1        |                                        |

  avg = 1825567 nsecs, total: 7477526027 nsecs, count: 4096

With series: the sane sendmsg() can finish much faster.

  $ sudo /usr/share/bcc/tools/funclatency -p 8702  unix_dgram_sendmsg
  Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end.
  ^C
       nsecs               : count     distribution
  [...]
         128 -> 255        : 0        |                                        |
         256 -> 511        : 4092     |****************************************|
         512 -> 1023       : 2        |                                        |
        1024 -> 2047       : 0        |                                        |
        2048 -> 4095       : 0        |                                        |
        4096 -> 8191       : 1        |                                        |
        8192 -> 16383      : 1        |                                        |

  avg = 410 nsecs, total: 1680510 nsecs, count: 4096

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/af_unix.h | 12 ++++++++++--
 include/net/scm.h     |  1 +
 net/core/scm.c        |  5 +++++
 net/unix/af_unix.c    |  6 ++++--
 net/unix/garbage.c    | 10 +++++++++-
 5 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 2c98ef95017b..f045bbd9017d 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -8,13 +8,21 @@
 #include <linux/refcount.h>
 #include <net/sock.h>
 
+#if IS_ENABLED(CONFIG_UNIX)
+struct unix_sock *unix_get_socket(struct file *filp);
+#else
+static inline struct unix_sock *unix_get_socket(struct file *filp)
+{
+	return NULL;
+}
+#endif
+
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_destruct_scm(struct sk_buff *skb);
 void io_uring_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
-void wait_for_unix_gc(void);
-struct unix_sock *unix_get_socket(struct file *filp);
+void wait_for_unix_gc(struct scm_fp_list *fpl);
 struct sock *unix_peer_get(struct sock *sk);
 
 #define UNIX_HASH_MOD	(256 - 1)
diff --git a/include/net/scm.h b/include/net/scm.h
index cf68acec4d70..92276a2c5543 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -25,6 +25,7 @@ struct scm_creds {
 
 struct scm_fp_list {
 	short			count;
+	short			count_unix;
 	short			max;
 	struct user_struct	*user;
 	struct file		*fp[SCM_MAX_FD];
diff --git a/net/core/scm.c b/net/core/scm.c
index d0e0852a24d5..9cd4b0a01cd6 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -36,6 +36,7 @@
 #include <net/compat.h>
 #include <net/scm.h>
 #include <net/cls_cgroup.h>
+#include <net/af_unix.h>
 
 
 /*
@@ -85,6 +86,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 			return -ENOMEM;
 		*fplp = fpl;
 		fpl->count = 0;
+		fpl->count_unix = 0;
 		fpl->max = SCM_MAX_FD;
 		fpl->user = NULL;
 	}
@@ -109,6 +111,9 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 			fput(file);
 			return -EINVAL;
 		}
+		if (unix_get_socket(file))
+			fpl->count_unix++;
+
 		*fpp++ = file;
 		fpl->count++;
 	}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 1e9378036dcc..1720419d93d6 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1923,11 +1923,12 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	long timeo;
 	int err;
 
-	wait_for_unix_gc();
 	err = scm_send(sock, msg, &scm, false);
 	if (err < 0)
 		return err;
 
+	wait_for_unix_gc(scm.fp);
+
 	err = -EOPNOTSUPP;
 	if (msg->msg_flags&MSG_OOB)
 		goto out;
@@ -2199,11 +2200,12 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	bool fds_sent = false;
 	int data_len;
 
-	wait_for_unix_gc();
 	err = scm_send(sock, msg, &scm, false);
 	if (err < 0)
 		return err;
 
+	wait_for_unix_gc(scm.fp);
+
 	err = -EOPNOTSUPP;
 	if (msg->msg_flags & MSG_OOB) {
 #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 3f599db59f9d..4046c606f0e6 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -311,8 +311,9 @@ void unix_gc(void)
 }
 
 #define UNIX_INFLIGHT_TRIGGER_GC 16000
+#define UNIX_INFLIGHT_SANE_USER (SCM_MAX_FD * 8)
 
-void wait_for_unix_gc(void)
+void wait_for_unix_gc(struct scm_fp_list *fpl)
 {
 	/* If number of inflight sockets is insane,
 	 * force a garbage collect right now.
@@ -324,6 +325,13 @@ void wait_for_unix_gc(void)
 	    !READ_ONCE(gc_in_progress))
 		unix_gc();
 
+	/* Penalise users who want to send AF_UNIX sockets
+	 * but whose sockets have not been received yet.
+	 */
+	if (!fpl || !fpl->count_unix ||
+	    READ_ONCE(fpl->user->unix_inflight) < UNIX_INFLIGHT_SANE_USER)
+		return;
+
 	if (READ_ONCE(gc_in_progress))
 		flush_work(&unix_gc_work);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 net-next 0/5] af_unix: Random improvements for GC.
  2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2024-01-23 17:08 ` [PATCH v5 net-next 5/5] af_unix: Try to run GC async Kuniyuki Iwashima
@ 2024-01-27  5:10 ` patchwork-bot+netdevbpf
  5 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-01-27  5:10 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, pabeni, ivan, kuni1840, netdev

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Jan 2024 09:08:51 -0800 you wrote:
> If more than 16000 inflight AF_UNIX sockets exist on a host, each
> sendmsg() will be forced to wait for unix_gc() even if a process
> is not sending any FD.
> 
> This series tries not to impose such a penalty on sane users who
> do not send AF_UNIX FDs or do not have inflight sockets more than
> SCM_MAX_FD * 8.
> 
> [...]

Here is the summary with links:
  - [v5,net-next,1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc().
    https://git.kernel.org/netdev/net-next/c/31e03207119a
  - [v5,net-next,2/5] af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
    https://git.kernel.org/netdev/net-next/c/97af84a6bba2
  - [v5,net-next,3/5] af_unix: Return struct unix_sock from unix_get_socket().
    https://git.kernel.org/netdev/net-next/c/5b17307bd078
  - [v5,net-next,4/5] af_unix: Run GC on only one CPU.
    https://git.kernel.org/netdev/net-next/c/8b90a9f819dc
  - [v5,net-next,5/5] af_unix: Try to run GC async.
    https://git.kernel.org/netdev/net-next/c/d9f21b361333

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-01-27  5:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23 17:08 [PATCH v5 net-next 0/5] af_unix: Random improvements for GC Kuniyuki Iwashima
2024-01-23 17:08 ` [PATCH v5 net-next 1/5] af_unix: Annotate data-race of gc_in_progress in wait_for_unix_gc() Kuniyuki Iwashima
2024-01-23 17:08 ` [PATCH v5 net-next 2/5] af_unix: Do not use atomic ops for unix_sk(sk)->inflight Kuniyuki Iwashima
2024-01-23 17:08 ` [PATCH v5 net-next 3/5] af_unix: Return struct unix_sock from unix_get_socket() Kuniyuki Iwashima
2024-01-23 17:08 ` [PATCH v5 net-next 4/5] af_unix: Run GC on only one CPU Kuniyuki Iwashima
2024-01-23 17:08 ` [PATCH v5 net-next 5/5] af_unix: Try to run GC async Kuniyuki Iwashima
2024-01-27  5:10 ` [PATCH v5 net-next 0/5] af_unix: Random improvements for GC patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).