From: John Ericson <John.Ericson@Obsidian.Systems>
To: "David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: "John Ericson" <mail@JohnEricson.me>,
"Cong Wang" <cwang@multikernel.io>,
"Kuniyuki Iwashima" <kuniyu@google.com>,
"Simon Horman" <horms@kernel.org>,
"Christian Brauner" <brauner@kernel.org>,
"David Rheinsberg" <david@readahead.eu>,
"Andy Lutomirski" <luto@kernel.org>,
"Sergei Zimmerman" <sergei@zimmerman.foo>,
netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Mickaël Salaün" <mic@digikod.net>,
"Günther Noack" <gnoack@google.com>,
"Paul Moore" <paul@paul-moore.com>,
linux-security-module@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: [RFC PATCH 2/3] af_unix: factor out kernel_unix_connect_direct()
Date: Fri, 3 Jul 2026 03:39:43 -0400 [thread overview]
Message-ID: <20260703073948.2541875-3-John.Ericson@Obsidian.Systems> (raw)
In-Reply-To: <20260703073948.2541875-1-John.Ericson@Obsidian.Systems>
From: John Ericson <mail@JohnEricson.me>
I was hoping this was going to be a simple matter of factoring out the
back half of `unix_stream_connect`. No such luck was had, because
actually instead of `unix_stream_connect` looking up the socket from the
VFS once, it does it repeatedly in the same loop that is used to deal
with full listening queues.
(This behavior is rather surprising to me, because it would allow a
deleted and recreated socket to be picked up on the next loop iteration.
But, I don't want to make any UAPI-visible changes in this patch series,
so I did not consider changing it.)
Seeing that this was going to be more complex, I instead factored out
three helpers (setup, commit, cleanup) on a state struct, so I could
reuse them both in the existing `unix_stream_connect` and also in the
new `kernel_unix_connect_direct`. This allows each caller to implement a
slightly different loop:
- resource management of `struct sock *other`:
- `unix_stream_connect` acquires (and releases) it.
- `kernel_unix_connect_direct` uses the caller-provided one.
- stale `other` behavior:
- `unix_stream_connect` retries, because on the next iteration the
socket may have been replaced by a fresh one.
- `kernel_unix_connect_direct` fails, because no reacquisition means
staleness is permanent.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: John Ericson <mail@JohnEricson.me>
---
include/net/af_unix.h | 1 +
net/unix/af_unix.c | 247 +++++++++++++++++++++++++++++++++---------
2 files changed, 199 insertions(+), 49 deletions(-)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index fe4547508af1..7d810321efa3 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -15,6 +15,7 @@
#if IS_ENABLED(CONFIG_UNIX)
struct unix_sock *unix_get_socket(struct file *filp);
struct sock *unix_lookup_bsd_path(const struct path *path, int type);
+int kernel_unix_connect_direct(struct sock *other, struct socket *sock, int flags);
#else
static inline struct unix_sock *unix_get_socket(struct file *filp)
{
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3270299238c4..aa94da1f8c24 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1649,44 +1649,60 @@ static long unix_wait_for_peer(struct sock *other, long timeo)
return timeo;
}
-static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
- int addr_len, int flags)
-{
- struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
- struct sock *sk = sock->sk, *newsk = NULL, *other = NULL;
- struct unix_sock *u = unix_sk(sk), *newu, *otheru;
- struct unix_peercred peercred = {};
- struct net *net = sock_net(sk);
- struct sk_buff *skb = NULL;
- unsigned char state;
- long timeo;
- int err;
+/*
+ * The state a stream connect() builds up before it has a peer: the new
+ * sock and the connection-request skb handed to the listener, the
+ * connecting side's credentials and its send timeout.
+ *
+ * - Built once by unix_stream_connect_setup()
+ * - Used to finish connecting by unix_stream_connect_commit()
+ * - Cleaned up in the failure case by unix_stream_connect_cleanup()
+ */
+struct unix_connect_state {
+ struct sock *newsk;
+ struct sk_buff *skb;
+ struct unix_peercred peercred;
+ long timeo;
+};
- err = unix_validate_addr(sunaddr, addr_len);
- if (err)
- goto out;
+/* Free a connect state that no connection consumed (i.e. on failure). */
+static void unix_stream_connect_cleanup(struct unix_connect_state *st)
+{
+ consume_skb(st->skb);
+ unix_release_sock(st->newsk, 0);
+ drop_peercred(&st->peercred);
+}
- err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, uaddr, &addr_len);
- if (err)
- goto out;
+/*
+ * Build the state a stream connect needs before it looks for a peer:
+ * autobind if required, snapshot the send timeout, and allocate the new
+ * sock, the request skb and the peer credentials. On failure nothing is
+ * left allocated in @st.
+ */
+static int unix_stream_connect_setup(struct socket *sock, int flags,
+ struct unix_connect_state *st)
+{
+ struct sock *sk = sock->sk, *newsk;
+ struct sk_buff *skb;
+ int err;
- if (unix_may_passcred(sk) && !READ_ONCE(u->addr)) {
+ if (unix_may_passcred(sk) && !READ_ONCE(unix_sk(sk)->addr)) {
err = unix_autobind(sk);
if (err)
- goto out;
+ return err;
}
- timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
+ st->timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
- err = prepare_peercred(&peercred);
+ err = prepare_peercred(&st->peercred);
if (err)
- goto out;
+ return err;
/* create new sock for complete connection */
- newsk = unix_create1(net, NULL, 0, sock->type);
+ newsk = unix_create1(sock_net(sk), NULL, 0, sock->type);
if (IS_ERR(newsk)) {
err = PTR_ERR(newsk);
- goto out;
+ goto out_drop;
}
/* Allocate skb for sending to listening sock */
@@ -1696,21 +1712,56 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
goto out_free_sk;
}
-restart:
- /* Find listening sock. */
- other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
- if (IS_ERR(other)) {
- err = PTR_ERR(other);
- goto out_free_skb;
- }
+ st->newsk = newsk;
+ st->skb = skb;
+ return 0;
+
+out_free_sk:
+ unix_release_sock(newsk, 0);
+out_drop:
+ drop_peercred(&st->peercred);
+ return err;
+}
+
+/*
+ * Positive returns from unix_stream_connect_commit() ask the caller to
+ * try again. They are distinct only for a caller with a fixed peer
+ * (kernel_unix_connect_direct()): a full backlog can be retried on the
+ * same peer, but a peer found dead cannot -- the by-name path must
+ * re-resolve it, and a fixed peer has no such recourse and fails.
+ */
+#define UNIX_CONNECT_STALE 1 /* peer was found dead */
+#define UNIX_CONNECT_FULL 2 /* backlog was full and we slept */
+
+/*
+ * Try to connect @sk to the listening peer @other, using the connect
+ * state @st built by unix_stream_connect_setup(). Takes and releases
+ * unix_state_lock(@other) itself.
+ *
+ * Returns 0 on success (@st->skb queued to @other, @st->newsk linked to
+ * @sk and @st->peercred consumed), a negative errno on terminal failure,
+ * or a positive value (UNIX_CONNECT_STALE / UNIX_CONNECT_FULL) asking the
+ * caller to re-obtain @other and call again -- because @other was found
+ * dead, or its backlog was full and we slept (updating @st->timeo)
+ * waiting for room.
+ */
+static int unix_stream_connect_commit(struct sock *sk, struct sock *other,
+ struct unix_connect_state *st)
+{
+ struct sock *newsk = st->newsk;
+ struct sk_buff *skb = st->skb;
+ struct unix_peercred *peercred = &st->peercred;
+ long *timeo = &st->timeo;
+ struct unix_sock *newu, *otheru;
+ unsigned char state;
+ int err;
unix_state_lock(other);
- /* Apparently VFS overslept socket death. Retry. */
+ /* Apparently VFS overslept socket death; ask the caller to retry. */
if (sock_flag(other, SOCK_DEAD)) {
unix_state_unlock(other);
- sock_put(other);
- goto restart;
+ return UNIX_CONNECT_STALE;
}
if (other->sk_state != TCP_LISTEN ||
@@ -1720,19 +1771,19 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
}
if (unix_recvq_full_lockless(other)) {
- if (!timeo) {
+ if (!*timeo) {
err = -EAGAIN;
goto out_unlock;
}
- timeo = unix_wait_for_peer(other, timeo);
- sock_put(other);
+ /* unix_wait_for_peer() drops unix_state_lock(other). */
+ *timeo = unix_wait_for_peer(other, *timeo);
- err = sock_intr_errno(timeo);
+ err = sock_intr_errno(*timeo);
if (signal_pending(current))
- goto out_free_skb;
+ return err;
- goto restart;
+ return UNIX_CONNECT_FULL;
}
/* self connect and simultaneous connect are eliminated
@@ -1765,7 +1816,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
newsk->sk_state = TCP_ESTABLISHED;
newsk->sk_type = sk->sk_type;
newsk->sk_scm_recv_flags = other->sk_scm_recv_flags;
- init_peercred(newsk, &peercred);
+ init_peercred(newsk, peercred);
newu = unix_sk(newsk);
newu->listener = other;
@@ -1813,20 +1864,118 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
spin_unlock(&other->sk_receive_queue.lock);
unix_state_unlock(other);
READ_ONCE(other->sk_data_ready)(other);
- sock_put(other);
return 0;
out_unlock:
unix_state_unlock(other);
+ return err;
+}
+
+static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
+ int addr_len, int flags)
+{
+ struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
+ struct sock *sk = sock->sk, *other;
+ struct unix_connect_state st = {};
+ struct net *net = sock_net(sk);
+ int err;
+
+ err = unix_validate_addr(sunaddr, addr_len);
+ if (err)
+ return err;
+
+ err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, uaddr, &addr_len);
+ if (err)
+ return err;
+
+ err = unix_stream_connect_setup(sock, flags, &st);
+ if (err)
+ return err;
+
+restart:
+ /* Find the listening sock. A positive return from
+ * unix_stream_connect_commit() means "retry": the peer had died,
+ * or its backlog was full and we slept -- so re-resolve the name.
+ */
+ other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
+ if (IS_ERR(other)) {
+ err = PTR_ERR(other);
+ goto out_free;
+ }
+
+ err = unix_stream_connect_commit(sk, other, &st);
sock_put(other);
-out_free_skb:
- consume_skb(skb);
-out_free_sk:
- unix_release_sock(newsk, 0);
-out:
- drop_peercred(&peercred);
+ switch (err) {
+ case 0:
+ return 0;
+ case UNIX_CONNECT_FULL:
+ goto restart;
+ case UNIX_CONNECT_STALE:
+ /* A full backlog or a dead peer: re-resolve and try again. */
+ goto restart;
+ case INT_MIN ... -1:
+ /* terminal errno, propagate as-is */
+ break;
+ default:
+ /* commit() only returns 0, a retry code, or an errno */
+ WARN_ONCE(1, "unix_stream_connect_commit() returned %d\n", err);
+ err = -EINVAL;
+ break;
+ }
+out_free:
+ unix_stream_connect_cleanup(&st);
+ return err;
+}
+
+/**
+ * kernel_unix_connect_direct - connect a socket to a specific AF_UNIX sock
+ * @other: a held listening sock to connect to (e.g. from
+ * unix_lookup_bsd_path())
+ * @sock: the connecting socket, created with sock_create_kern()
+ * @flags: connect flags; without O_NONBLOCK a full listen backlog on
+ * @other is waited on, as for connect(2)
+ *
+ * Connects @sock to @other without any name lookup, address validation
+ * or path-based permission check. For in-kernel callers that have
+ * already located the target under their own policy. The caller
+ * retains its reference on @other.
+ */
+int kernel_unix_connect_direct(struct sock *other, struct socket *sock, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct unix_connect_state st = {};
+ int err;
+
+ err = unix_stream_connect_setup(sock, flags, &st);
+ if (err)
+ return err;
+
+restart:
+ sock_hold(other);
+ err = unix_stream_connect_commit(sk, other, &st);
+ sock_put(other);
+ switch (err) {
+ case 0:
+ return 0;
+ case UNIX_CONNECT_FULL:
+ goto restart;
+ case UNIX_CONNECT_STALE:
+ /* The peer is fixed, so a dead one cannot be re-found. */
+ err = -ECONNREFUSED;
+ break;
+ case INT_MIN ... -1:
+ /* terminal errno, propagate as-is */
+ break;
+ default:
+ /* commit() only returns 0, a retry code, or an errno */
+ WARN_ONCE(1, "unix_stream_connect_commit() returned %d\n", err);
+ err = -EINVAL;
+ break;
+ }
+ unix_stream_connect_cleanup(&st);
return err;
}
+EXPORT_SYMBOL_GPL(kernel_unix_connect_direct);
static int unix_socketpair(struct socket *socka, struct socket *sockb)
{
--
2.54.0
next prev parent reply other threads:[~2026-07-03 7:40 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 7:39 [RFC PATCH 0/3] coredump, net: fix layer violation with direct connection John Ericson
2026-07-03 7:39 ` [RFC PATCH 1/3] af_unix: factor out unix_lookup_bsd_path() John Ericson
2026-07-03 7:39 ` John Ericson [this message]
2026-07-03 7:39 ` [RFC PATCH 3/3] coredump, net: remove `SOCK_COREDUMP` John Ericson
2026-07-03 8:11 ` Christian Brauner
2026-07-03 9:08 ` John Ericson
2026-07-03 9:31 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703073948.2541875-3-John.Ericson@Obsidian.Systems \
--to=john.ericson@obsidian.systems \
--cc=brauner@kernel.org \
--cc=cwang@multikernel.io \
--cc=davem@davemloft.net \
--cc=david@readahead.eu \
--cc=edumazet@google.com \
--cc=gnoack@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mail@JohnEricson.me \
--cc=mic@digikod.net \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paul@paul-moore.com \
--cc=sergei@zimmerman.foo \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox