* [RFC] connectat()/bindat() or an alternative design
@ 2026-05-18 19:09 John Ericson
2026-06-08 19:45 ` Cong Wang
0 siblings, 1 reply; 4+ messages in thread
From: John Ericson @ 2026-05-18 19:09 UTC (permalink / raw)
To: network dev
[-- Attachment #1: Type: text/plain, Size: 3104 bytes --]
Hello, I am interested in seeing something like the BSD's connectat/bindat in
Linux.
The deficiencies with traditional `bind`/`connect` for unix sockets are
well-known. The advantages of openat2 over openat over open are also well-known, so I
won't waste anyone's time restating either.
https://lore.kernel.org/netdev/4FCF171B.8000207@parallels.com/
https://lore.kernel.org/all/CAEnbY+co6YLXANfeMnfBOBs8Ba_Sbdqz0Ahm8RzAhRF7MrxL4Q@mail.gmail.com/
Here are two previous times this was raised, unfortunately with no replies.
Hoping this third time is the charm!
To hopefully give the discussion a bit more concreteness, I have taken a stab at
two refactors (with an LLM, full disclosure) to get a sense of what the
approaches look like. (To be clear, these are not patches I am trying to
formally submit, these are just sketches to guide the conversation. I haven't
built or tested them, just read them.)
https://github.com/Ericson2314/linux/tree/bindat-connectat
1. 9b72f7e2add657cfd0d755c7ea24e56f1aab7025
The first is a straightforward port of connectat/bindat, except for a (not yet
used) flags argument so the likes of `AT_SYMLINK_NOFOLLOW` and `AT_EMPTY_PATH`
could someday be supported. (Or `RESOLVE_NO_SYMLINKS` from `openat2` for
something more stringent than `AT_SYMLINK_NOFOLLOW`.)
I made some effort to deduplicate things as much as possible, and support
io_uring. So the (again WIP) diff is somewhat elegant. Still though, to me at
least, a glaring issue with this design is that we are making new syscall and
other "entrypoint" machinery for *all* socket types, when only unix sockets
stand to benefit from this. For everything else, the other parameters are just
nonsense to be ignored or, worse, confused by.
That brings me to the next attempt, which I personally prefer:
https://github.com/Ericson2314/linux/tree/sockaddr_un2
1. 96c45c5cc43799f95a1a90a87cfaccf9377b82da
2. d183785e63d0b224de21fc4f7e505eef99b27592
Here, instead of making a new `connect`/`bind`, I've made a new `struct
sockaddr_un` alternative.
Actually, there are two commits. The first commit cleans up an internal data
structure to decouple it from `struct sockaddr_un` and thus the stable syscall
ABI. I like this commit in any event, but it is fine to also just view it as
prep for the second commit.
In that second commit, a new `struct sockaddr_un2` is created, which has a
pointer to a path rather than an inline path to get around path length issues,
an fd / `AT_FDCWD` for at-relative argument, and the flags argument.
I very much like how this localizes what is useful for unix sockets to just that
case, without bloating the code for the other socket types. Even including the
preparatory first commit, this way is still fewer lines of code, too. It feels
like the correct approach to me.
Ultimately though, I am happy to go with whichever approach the networking
maintainers go with --- either would be preferable to the status quo. If one of
those (or something else!) looks promising, let me know, and I am happy to take
the time to polish it up into a proper submitted patch.
Thanks,
John
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-no-submit-connectat-and-bindat-like-BSD-for-Linu.patch --]
[-- Type: text/x-patch; name="0001-WIP-no-submit-connectat-and-bindat-like-BSD-for-Linu.patch", Size: 37267 bytes --]
From 9b72f7e2add657cfd0d755c7ea24e56f1aab7025 Mon Sep 17 00:00:00 2001
From: John Ericson <John.Ericson@Obsidian.Systems>
Date: Mon, 18 May 2026 11:17:03 -0400
Subject: [PATCH] [WIP, no-submit] `connectat` and `bindat` like BSD for Linux
A quick exploratory refactor with an LLM.
The `flags` param is new here. It would allow supporting e.g.
`AT_SYMLINK_NOFOLLOW` and `AT_EMPTY_PATH` in the future.
N.B. Tests have not been run yet. I don't know if those are good at all.
I just read and reviewed the refactors themselves as they are easy to
read.
Signed-off-by: John Ericson <John.Ericson@Obsidian.Systems>
Assisted-by: Claude:claude-opus-4-6
---
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/namei.c | 7 +
include/linux/namei.h | 1 +
include/linux/net.h | 6 +
include/linux/socket.h | 14 +-
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/io_uring.h | 2 +
io_uring/net.c | 73 +++--
io_uring/net.h | 6 +-
io_uring/opdef.c | 32 ++-
net/compat.c | 4 +-
net/socket.c | 85 ++++--
net/unix/af_unix.c | 67 ++++-
tools/testing/selftests/net/af_unix/Makefile | 1 +
.../selftests/net/af_unix/unix_connect_at.c | 265 ++++++++++++++++++
17 files changed, 513 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/net/af_unix/unix_connect_at.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..89f436d8ca5e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 bindat sys_bindat
+473 i386 connectat sys_connectat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..e12893dabca8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common bindat sys_bindat
+473 common connectat sys_connectat
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/namei.c b/fs/namei.c
index c7fac83c9a85..a58fcd9ab603 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3045,6 +3045,13 @@ int kern_path(const char *name, unsigned int flags, struct path *path)
}
EXPORT_SYMBOL(kern_path);
+int kern_path_at(int dfd, const char *name, unsigned int flags, struct path *path)
+{
+ CLASS(filename_kernel, filename)(name);
+ return filename_lookup(dfd, filename, flags, path, NULL);
+}
+EXPORT_SYMBOL(kern_path_at);
+
/**
* vfs_path_parent_lookup - lookup a parent path relative to a dentry-vfsmount pair
* @filename: filename structure
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 2ad6dd9987b9..950921232b03 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -55,6 +55,7 @@ extern int path_pts(struct path *path);
extern int user_path_at(int, const char __user *, unsigned, struct path *);
extern int kern_path(const char *, unsigned, struct path *);
+extern int kern_path_at(int, const char *, unsigned, struct path *);
struct dentry *kern_path_parent(const char *name, struct path *parent);
extern struct dentry *start_creating_path(int, const char *, struct path *, unsigned int);
diff --git a/include/linux/net.h b/include/linux/net.h
index f268f395ce47..8c65aff06d77 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -185,9 +185,15 @@ struct proto_ops {
int (*bind) (struct socket *sock,
struct sockaddr_unsized *myaddr,
int sockaddr_len);
+ int (*bind_at) (struct socket *sock, int dfd,
+ struct sockaddr_unsized *myaddr,
+ int sockaddr_len, int flags);
int (*connect) (struct socket *sock,
struct sockaddr_unsized *vaddr,
int sockaddr_len, int flags);
+ int (*connect_at)(struct socket *sock, int dfd,
+ struct sockaddr_unsized *vaddr,
+ int sockaddr_len, int flags);
int (*socketpair)(struct socket *sock1,
struct socket *sock2);
int (*accept) (struct socket *sock,
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ec4a0a025793..69e23adca997 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -459,13 +459,15 @@ extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
int __user *upeer_addrlen, int flags);
extern int __sys_socket(int family, int type, int protocol);
extern struct file *__sys_socket_file(int family, int type, int protocol);
-extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen);
-extern int __sys_bind_socket(struct socket *sock, struct sockaddr_storage *address,
- int addrlen);
-extern int __sys_connect_file(struct file *file, struct sockaddr_storage *addr,
+extern int __sys_bindat(int dfd, int fd, struct sockaddr __user *umyaddr,
+ int addrlen, int flags);
+extern int __sys_bind_socket(struct socket *sock, int dfd,
+ struct sockaddr_storage *address, int addrlen);
+extern int __sys_connect_file(struct file *file, int dfd,
+ struct sockaddr_storage *addr,
int addrlen, int file_flags);
-extern int __sys_connect(int fd, struct sockaddr __user *uservaddr,
- int addrlen);
+extern int __sys_connectat(int dfd, int fd, struct sockaddr __user *uservaddr,
+ int addrlen, int flags);
extern int __sys_listen(int fd, int backlog);
extern int __sys_listen_socket(struct socket *sock, int backlog);
extern int do_getsockname(struct socket *sock, int peer,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..cb87ded87ff2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -772,9 +772,11 @@ asmlinkage long sys_shmdt(char __user *shmaddr);
asmlinkage long sys_socket(int, int, int);
asmlinkage long sys_socketpair(int, int, int, int __user *);
asmlinkage long sys_bind(int, struct sockaddr __user *, int);
+asmlinkage long sys_bindat(int, int, struct sockaddr __user *, int, int);
asmlinkage long sys_listen(int, int);
asmlinkage long sys_accept(int, struct sockaddr __user *, int __user *);
asmlinkage long sys_connect(int, struct sockaddr __user *, int);
+asmlinkage long sys_connectat(int, int, struct sockaddr __user *, int, int);
asmlinkage long sys_getsockname(int, struct sockaddr __user *, int __user *);
asmlinkage long sys_getpeername(int, struct sockaddr __user *, int __user *);
asmlinkage long sys_sendto(int, void __user *, size_t, unsigned,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..a4db55b4172f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,14 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+/* net/socket.c */
+#define __NR_bindat 472
+__SYSCALL(__NR_bindat, sys_bindat)
+#define __NR_connectat 473
+__SYSCALL(__NR_connectat, sys_connectat)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 909fb7aea638..7992638c23a0 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -318,6 +318,8 @@ enum io_uring_op {
IORING_OP_PIPE,
IORING_OP_NOP128,
IORING_OP_URING_CMD128,
+ IORING_OP_BINDAT,
+ IORING_OP_CONNECTAT,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/net.c b/io_uring/net.c
index 30cd22c0b934..caf559282e7a 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/kernel.h>
#include <linux/errno.h>
+#include <linux/fcntl.h>
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/net.h>
@@ -44,17 +45,19 @@ struct io_socket {
unsigned long nofile;
};
-struct io_connect {
+struct io_connectat {
struct file *file;
struct sockaddr __user *addr;
int addr_len;
+ int dfd;
bool in_progress;
bool seen_econnaborted;
};
-struct io_bind {
+struct io_bindat {
struct file *file;
int addr_len;
+ int dfd;
};
struct io_listen {
@@ -1728,16 +1731,17 @@ int io_socket(struct io_kiocb *req, unsigned int issue_flags)
return IOU_COMPLETE;
}
-int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+int io_connectat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- struct io_connect *conn = io_kiocb_to_cmd(req, struct io_connect);
+ struct io_connectat *conn = io_kiocb_to_cmd(req, struct io_connectat);
struct io_async_msghdr *io;
- if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in)
+ if (sqe->len || sqe->buf_index || sqe->rw_flags)
return -EINVAL;
conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- conn->addr_len = READ_ONCE(sqe->addr2);
+ conn->addr_len = READ_ONCE(sqe->addr2);
+ conn->dfd = READ_ONCE(sqe->splice_fd_in);
conn->in_progress = conn->seen_econnaborted = false;
io = io_msg_alloc_async(req);
@@ -1747,9 +1751,26 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return move_addr_to_kernel(conn->addr, conn->addr_len, &io->addr);
}
-int io_connect(struct io_kiocb *req, unsigned int issue_flags)
+int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_connectat *conn;
+ int ret;
+
+ if (sqe->splice_fd_in)
+ return -EINVAL;
+
+ ret = io_connectat_prep(req, sqe);
+ if (ret)
+ return ret;
+
+ conn = io_kiocb_to_cmd(req, struct io_connectat);
+ conn->dfd = AT_FDCWD;
+ return 0;
+}
+
+int io_connectat(struct io_kiocb *req, unsigned int issue_flags)
{
- struct io_connect *connect = io_kiocb_to_cmd(req, struct io_connect);
+ struct io_connectat *connect = io_kiocb_to_cmd(req, struct io_connectat);
struct io_async_msghdr *io = req->async_data;
unsigned file_flags;
int ret;
@@ -1764,8 +1785,8 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
file_flags = force_nonblock ? O_NONBLOCK : 0;
- ret = __sys_connect_file(req->file, &io->addr, connect->addr_len,
- file_flags);
+ ret = __sys_connect_file(req->file, connect->dfd, &io->addr,
+ connect->addr_len, file_flags);
if ((ret == -EAGAIN || ret == -EINPROGRESS || ret == -ECONNABORTED)
&& force_nonblock) {
if (ret == -EINPROGRESS) {
@@ -1799,17 +1820,18 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
return IOU_COMPLETE;
}
-int io_bind_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+int io_bindat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- struct io_bind *bind = io_kiocb_to_cmd(req, struct io_bind);
+ struct io_bindat *bind = io_kiocb_to_cmd(req, struct io_bindat);
struct sockaddr __user *uaddr;
struct io_async_msghdr *io;
- if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in)
+ if (sqe->len || sqe->buf_index || sqe->rw_flags)
return -EINVAL;
uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- bind->addr_len = READ_ONCE(sqe->addr2);
+ bind->addr_len = READ_ONCE(sqe->addr2);
+ bind->dfd = READ_ONCE(sqe->splice_fd_in);
io = io_msg_alloc_async(req);
if (unlikely(!io))
@@ -1817,9 +1839,26 @@ int io_bind_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return move_addr_to_kernel(uaddr, bind->addr_len, &io->addr);
}
-int io_bind(struct io_kiocb *req, unsigned int issue_flags)
+int io_bind_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_bindat *bind;
+ int ret;
+
+ if (sqe->splice_fd_in)
+ return -EINVAL;
+
+ ret = io_bindat_prep(req, sqe);
+ if (ret)
+ return ret;
+
+ bind = io_kiocb_to_cmd(req, struct io_bindat);
+ bind->dfd = AT_FDCWD;
+ return 0;
+}
+
+int io_bindat(struct io_kiocb *req, unsigned int issue_flags)
{
- struct io_bind *bind = io_kiocb_to_cmd(req, struct io_bind);
+ struct io_bindat *bind = io_kiocb_to_cmd(req, struct io_bindat);
struct io_async_msghdr *io = req->async_data;
struct socket *sock;
int ret;
@@ -1828,7 +1867,7 @@ int io_bind(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!sock))
return -ENOTSOCK;
- ret = __sys_bind_socket(sock, &io->addr, bind->addr_len);
+ ret = __sys_bind_socket(sock, bind->dfd, &io->addr, bind->addr_len);
if (ret < 0)
req_set_fail(req);
io_req_set_res(req, ret, 0);
diff --git a/io_uring/net.h b/io_uring/net.h
index d4d1ddce50e3..3c00fb6a1192 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -48,14 +48,16 @@ int io_socket(struct io_kiocb *req, unsigned int issue_flags);
void io_socket_bpf_populate(struct io_uring_bpf_ctx *bctx, struct io_kiocb *req);
int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
-int io_connect(struct io_kiocb *req, unsigned int issue_flags);
+int io_connectat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_connectat(struct io_kiocb *req, unsigned int issue_flags);
int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags);
int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
void io_send_zc_cleanup(struct io_kiocb *req);
int io_bind_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
-int io_bind(struct io_kiocb *req, unsigned int issue_flags);
+int io_bindat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_bindat(struct io_kiocb *req, unsigned int issue_flags);
int io_listen_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_listen(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index c3ef52b70811..cbe84574dc88 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -205,7 +205,7 @@ const struct io_issue_def io_issue_defs[] = {
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_connect_prep,
- .issue = io_connect,
+ .issue = io_connectat,
#else
.prep = io_eopnotsupp_prep,
#endif
@@ -502,7 +502,7 @@ const struct io_issue_def io_issue_defs[] = {
#if defined(CONFIG_NET)
.needs_file = 1,
.prep = io_bind_prep,
- .issue = io_bind,
+ .issue = io_bindat,
.async_size = sizeof(struct io_async_msghdr),
#else
.prep = io_eopnotsupp_prep,
@@ -589,6 +589,28 @@ const struct io_issue_def io_issue_defs[] = {
.prep = io_uring_cmd_prep,
.issue = io_uring_cmd,
},
+ [IORING_OP_BINDAT] = {
+#if defined(CONFIG_NET)
+ .needs_file = 1,
+ .prep = io_bindat_prep,
+ .issue = io_bindat,
+ .async_size = sizeof(struct io_async_msghdr),
+#else
+ .prep = io_eopnotsupp_prep,
+#endif
+ },
+ [IORING_OP_CONNECTAT] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+#if defined(CONFIG_NET)
+ .async_size = sizeof(struct io_async_msghdr),
+ .prep = io_connectat_prep,
+ .issue = io_connectat,
+#else
+ .prep = io_eopnotsupp_prep,
+#endif
+ },
};
const struct io_cold_def io_cold_defs[] = {
@@ -847,6 +869,12 @@ const struct io_cold_def io_cold_defs[] = {
.sqe_copy = io_uring_cmd_sqe_copy,
.cleanup = io_uring_cmd_cleanup,
},
+ [IORING_OP_BINDAT] = {
+ .name = "BINDAT",
+ },
+ [IORING_OP_CONNECTAT] = {
+ .name = "CONNECTAT",
+ },
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/net/compat.c b/net/compat.c
index 2c9bd0edac99..bf92a5d3a470 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -448,10 +448,10 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
ret = __sys_socket(a0, a1, a[2]);
break;
case SYS_BIND:
- ret = __sys_bind(a0, compat_ptr(a1), a[2]);
+ ret = __sys_bindat(AT_FDCWD, a0, compat_ptr(a1), a[2], 0);
break;
case SYS_CONNECT:
- ret = __sys_connect(a0, compat_ptr(a1), a[2]);
+ ret = __sys_connectat(AT_FDCWD, a0, compat_ptr(a1), a[2], 0);
break;
case SYS_LISTEN:
ret = __sys_listen(a0, a1);
diff --git a/net/socket.c b/net/socket.c
index 22a412fdec07..4628e10ea2c1 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -56,6 +56,7 @@
#include <linux/ethtool.h>
#include <linux/mm.h>
#include <linux/socket.h>
+#include <linux/fcntl.h>
#include <linux/file.h>
#include <linux/splice.h>
#include <linux/net.h>
@@ -1922,17 +1923,28 @@ SYSCALL_DEFINE4(socketpair, int, family, int, type, int, protocol,
return __sys_socketpair(family, type, protocol, usockvec);
}
-int __sys_bind_socket(struct socket *sock, struct sockaddr_storage *address,
- int addrlen)
+int __sys_bind_socket(struct socket *sock, int dfd,
+ struct sockaddr_storage *address, int addrlen)
{
+ const struct proto_ops *ops;
int err;
err = security_socket_bind(sock, (struct sockaddr *)address,
addrlen);
- if (!err)
- err = READ_ONCE(sock->ops)->bind(sock,
- (struct sockaddr_unsized *)address,
- addrlen);
+ if (!err) {
+ ops = READ_ONCE(sock->ops);
+ if (dfd != AT_FDCWD) {
+ if (!ops->bind_at)
+ return -EOPNOTSUPP;
+ err = ops->bind_at(sock, dfd,
+ (struct sockaddr_unsized *)address,
+ addrlen, 0);
+ } else {
+ err = ops->bind(sock,
+ (struct sockaddr_unsized *)address,
+ addrlen);
+ }
+ }
return err;
}
@@ -1944,13 +1956,22 @@ int __sys_bind_socket(struct socket *sock, struct sockaddr_storage *address,
* the protocol layer (having also checked the address is ok).
*/
-int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
+SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
+{
+ return __sys_bindat(AT_FDCWD, fd, umyaddr, addrlen, 0);
+}
+
+int __sys_bindat(int dfd, int fd, struct sockaddr __user *umyaddr,
+ int addrlen, int flags)
{
struct socket *sock;
struct sockaddr_storage address;
CLASS(fd, f)(fd);
int err;
+ if (flags)
+ return -EINVAL;
+
if (fd_empty(f))
return -EBADF;
sock = sock_from_file(fd_file(f));
@@ -1961,12 +1982,13 @@ int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
if (unlikely(err))
return err;
- return __sys_bind_socket(sock, &address, addrlen);
+ return __sys_bind_socket(sock, dfd, &address, addrlen);
}
-SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
+SYSCALL_DEFINE5(bindat, int, dfd, int, fd, struct sockaddr __user *, umyaddr,
+ int, addrlen, int, flags)
{
- return __sys_bind(fd, umyaddr, addrlen);
+ return __sys_bindat(dfd, fd, umyaddr, addrlen, flags);
}
/*
@@ -2128,9 +2150,11 @@ SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr,
* include the -EINPROGRESS status for such sockets.
*/
-int __sys_connect_file(struct file *file, struct sockaddr_storage *address,
+int __sys_connect_file(struct file *file, int dfd,
+ struct sockaddr_storage *address,
int addrlen, int file_flags)
{
+ const struct proto_ops *ops;
struct socket *sock;
int err;
@@ -2145,18 +2169,39 @@ int __sys_connect_file(struct file *file, struct sockaddr_storage *address,
if (err)
goto out;
- err = READ_ONCE(sock->ops)->connect(sock, (struct sockaddr_unsized *)address,
- addrlen, sock->file->f_flags | file_flags);
+ ops = READ_ONCE(sock->ops);
+ if (dfd != AT_FDCWD) {
+ if (!ops->connect_at) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+ err = ops->connect_at(sock, dfd,
+ (struct sockaddr_unsized *)address,
+ addrlen, sock->file->f_flags | file_flags);
+ } else {
+ err = ops->connect(sock, (struct sockaddr_unsized *)address,
+ addrlen, sock->file->f_flags | file_flags);
+ }
out:
return err;
}
-int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
+SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
+ int, addrlen)
+{
+ return __sys_connectat(AT_FDCWD, fd, uservaddr, addrlen, 0);
+}
+
+int __sys_connectat(int dfd, int fd, struct sockaddr __user *uservaddr,
+ int addrlen, int flags)
{
struct sockaddr_storage address;
CLASS(fd, f)(fd);
int ret;
+ if (flags)
+ return -EINVAL;
+
if (fd_empty(f))
return -EBADF;
@@ -2164,13 +2209,13 @@ int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
if (ret)
return ret;
- return __sys_connect_file(fd_file(f), &address, addrlen, 0);
+ return __sys_connect_file(fd_file(f), dfd, &address, addrlen, 0);
}
-SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
- int, addrlen)
+SYSCALL_DEFINE5(connectat, int, dfd, int, fd, struct sockaddr __user *,
+ uservaddr, int, addrlen, int, flags)
{
- return __sys_connect(fd, uservaddr, addrlen);
+ return __sys_connectat(dfd, fd, uservaddr, addrlen, flags);
}
int do_getsockname(struct socket *sock, int peer,
@@ -3215,10 +3260,10 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
err = __sys_socket(a0, a1, a[2]);
break;
case SYS_BIND:
- err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]);
+ err = __sys_bindat(AT_FDCWD, a0, (struct sockaddr __user *)a1, a[2], 0);
break;
case SYS_CONNECT:
- err = __sys_connect(a0, (struct sockaddr __user *)a1, a[2]);
+ err = __sys_connectat(AT_FDCWD, a0, (struct sockaddr __user *)a1, a[2], 0);
break;
case SYS_LISTEN:
err = __sys_listen(a0, a1);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 1cbf36ea043b..67c8a393f33d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -843,8 +843,11 @@ static int unix_listen(struct socket *sock, int backlog)
static int unix_release(struct socket *);
static int unix_bind(struct socket *, struct sockaddr_unsized *, int);
+static int unix_bind_at(struct socket *, int, struct sockaddr_unsized *, int, int);
static int unix_stream_connect(struct socket *, struct sockaddr_unsized *,
int addr_len, int flags);
+static int unix_stream_connect_at(struct socket *, int, struct sockaddr_unsized *,
+ int, int);
static int unix_socketpair(struct socket *, struct socket *);
static int unix_accept(struct socket *, struct socket *, struct proto_accept_arg *arg);
static int unix_getname(struct socket *, struct sockaddr *, int);
@@ -867,6 +870,8 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_dgram_connect(struct socket *, struct sockaddr_unsized *,
int, int);
+static int unix_dgram_connect_at(struct socket *, int, struct sockaddr_unsized *,
+ int, int);
static int unix_seqpacket_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_seqpacket_recvmsg(struct socket *, struct msghdr *, size_t,
int);
@@ -968,7 +973,9 @@ static const struct proto_ops unix_stream_ops = {
.owner = THIS_MODULE,
.release = unix_release,
.bind = unix_bind,
+ .bind_at = unix_bind_at,
.connect = unix_stream_connect,
+ .connect_at = unix_stream_connect_at,
.socketpair = unix_socketpair,
.accept = unix_accept,
.getname = unix_getname,
@@ -994,7 +1001,9 @@ static const struct proto_ops unix_dgram_ops = {
.owner = THIS_MODULE,
.release = unix_release,
.bind = unix_bind,
+ .bind_at = unix_bind_at,
.connect = unix_dgram_connect,
+ .connect_at = unix_dgram_connect_at,
.socketpair = unix_socketpair,
.accept = sock_no_accept,
.getname = unix_getname,
@@ -1018,7 +1027,9 @@ static const struct proto_ops unix_seqpacket_ops = {
.owner = THIS_MODULE,
.release = unix_release,
.bind = unix_bind,
+ .bind_at = unix_bind_at,
.connect = unix_stream_connect,
+ .connect_at = unix_stream_connect_at,
.socketpair = unix_socketpair,
.accept = unix_accept,
.getname = unix_getname,
@@ -1188,7 +1199,7 @@ static int unix_release(struct socket *sock)
}
static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
- int type, int flags)
+ int type, int flags, int dfd)
{
struct inode *inode;
struct path path;
@@ -1212,7 +1223,7 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
if (err)
goto fail;
} else {
- err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
+ err = kern_path_at(dfd, sunaddr->sun_path, LOOKUP_FOLLOW, &path);
if (err)
goto fail;
@@ -1273,12 +1284,13 @@ static struct sock *unix_find_abstract(struct net *net,
static struct sock *unix_find_other(struct net *net,
struct sockaddr_un *sunaddr,
- int addr_len, int type, int flags)
+ int addr_len, int type, int flags,
+ int dfd)
{
struct sock *sk;
if (sunaddr->sun_path[0])
- sk = unix_find_bsd(sunaddr, addr_len, type, flags);
+ sk = unix_find_bsd(sunaddr, addr_len, type, flags, dfd);
else
sk = unix_find_abstract(net, sunaddr, addr_len, type);
@@ -1348,7 +1360,7 @@ out: mutex_unlock(&u->bindlock);
}
static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
- int addr_len)
+ int addr_len, int dfd)
{
umode_t mode = S_IFSOCK |
(SOCK_INODE(sk->sk_socket)->i_mode & ~current_umask());
@@ -1370,7 +1382,7 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
* Get the parent directory, calculate the hash for last
* component.
*/
- dentry = start_creating_path(AT_FDCWD, addr->name->sun_path, &parent, 0);
+ dentry = start_creating_path(dfd, addr->name->sun_path, &parent, 0);
if (IS_ERR(dentry)) {
err = PTR_ERR(dentry);
goto out;
@@ -1461,11 +1473,20 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
}
static int unix_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len)
+{
+ return unix_bind_at(sock, AT_FDCWD, uaddr, addr_len, 0);
+}
+
+static int unix_bind_at(struct socket *sock, int dfd,
+ struct sockaddr_unsized *uaddr, int addr_len, int flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
struct sock *sk = sock->sk;
int err;
+ if (flags)
+ return -EINVAL;
+
if (addr_len == offsetof(struct sockaddr_un, sun_path) &&
sunaddr->sun_family == AF_UNIX)
return unix_autobind(sk);
@@ -1475,7 +1496,7 @@ static int unix_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int ad
return err;
if (sunaddr->sun_path[0])
- err = unix_bind_bsd(sk, sunaddr, addr_len);
+ err = unix_bind_bsd(sk, sunaddr, addr_len, dfd);
else
err = unix_bind_abstract(sk, sunaddr, addr_len);
@@ -1506,8 +1527,9 @@ static void unix_state_double_unlock(struct sock *sk1, struct sock *sk2)
unix_state_unlock(sk2);
}
-static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr,
- int alen, int flags)
+static int unix_dgram_connect_at(struct socket *sock, int dfd,
+ struct sockaddr_unsized *addr,
+ int alen, int at_flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)addr;
struct sock *sk = sock->sk;
@@ -1534,7 +1556,8 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr
}
restart:
- other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
+ other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type,
+ 0, dfd);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out;
@@ -1604,6 +1627,12 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr
return err;
}
+static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr,
+ int alen, int flags)
+{
+ return unix_dgram_connect_at(sock, AT_FDCWD, addr, alen, 0);
+}
+
static long unix_wait_for_peer(struct sock *other, long timeo)
{
struct unix_sock *u = unix_sk(other);
@@ -1625,12 +1654,14 @@ static long unix_wait_for_peer(struct sock *other, long timeo)
return timeo;
}
-static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
- int addr_len, int flags)
+static int unix_stream_connect_at(struct socket *sock, int dfd,
+ struct sockaddr_unsized *uaddr,
+ int addr_len, int at_flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
struct sock *sk = sock->sk, *newsk = NULL, *other = NULL;
struct unix_sock *u = unix_sk(sk), *newu, *otheru;
+ int flags = sock->file->f_flags;
struct unix_peercred peercred = {};
struct net *net = sock_net(sk);
struct sk_buff *skb = NULL;
@@ -1674,7 +1705,8 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
restart:
/* Find listening sock. */
- other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
+ other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags,
+ dfd);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out_free_skb;
@@ -1805,6 +1837,12 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
return err;
}
+static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
+ int addr_len, int flags)
+{
+ return unix_stream_connect_at(sock, AT_FDCWD, uaddr, addr_len, 0);
+}
+
static int unix_socketpair(struct socket *socka, struct socket *sockb)
{
struct unix_peercred ska_peercred = {}, skb_peercred = {};
@@ -2160,7 +2198,8 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
if (msg->msg_namelen) {
lookup:
other = unix_find_other(sock_net(sk), msg->msg_name,
- msg->msg_namelen, sk->sk_type, 0);
+ msg->msg_namelen, sk->sk_type, 0,
+ AT_FDCWD);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out_free;
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 4c0375e28bbe..13bb7a339e0e 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -13,6 +13,7 @@ TEST_GEN_PROGS := \
scm_rights \
so_peek_off \
unix_connect \
+ unix_connect_at \
unix_connreset \
# end of TEST_GEN_PROGS
diff --git a/tools/testing/selftests/net/af_unix/unix_connect_at.c b/tools/testing/selftests/net/af_unix/unix_connect_at.c
new file mode 100644
index 000000000000..203086e7f96b
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/unix_connect_at.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef __NR_bindat
+#define __NR_bindat 472
+#endif
+
+#ifndef __NR_connectat
+#define __NR_connectat 473
+#endif
+
+static int sys_bindat(int dfd, int fd, const struct sockaddr *addr,
+ socklen_t addrlen, int flags)
+{
+ return syscall(__NR_bindat, dfd, fd, addr, addrlen, flags);
+}
+
+static int sys_connectat(int dfd, int fd, const struct sockaddr *addr,
+ socklen_t addrlen, int flags)
+{
+ return syscall(__NR_connectat, dfd, fd, addr, addrlen, flags);
+}
+
+FIXTURE(bindat_connectat)
+{
+ int server, client;
+ int dirfd;
+ char tmpdir[64];
+};
+
+FIXTURE_VARIANT(bindat_connectat)
+{
+ int type;
+};
+
+FIXTURE_VARIANT_ADD(bindat_connectat, stream)
+{
+ .type = SOCK_STREAM,
+};
+
+FIXTURE_VARIANT_ADD(bindat_connectat, dgram)
+{
+ .type = SOCK_DGRAM,
+};
+
+FIXTURE_VARIANT_ADD(bindat_connectat, seqpacket)
+{
+ .type = SOCK_SEQPACKET,
+};
+
+FIXTURE_SETUP(bindat_connectat)
+{
+ snprintf(self->tmpdir, sizeof(self->tmpdir),
+ "/tmp/bindat_test.%d", getpid());
+ ASSERT_EQ(0, mkdir(self->tmpdir, 0700));
+ self->dirfd = open(self->tmpdir, O_RDONLY | O_DIRECTORY);
+ ASSERT_LE(0, self->dirfd);
+ self->server = -1;
+ self->client = -1;
+}
+
+FIXTURE_TEARDOWN(bindat_connectat)
+{
+ if (self->client >= 0)
+ close(self->client);
+ if (self->server >= 0)
+ close(self->server);
+ unlinkat(self->dirfd, "sock", 0);
+ close(self->dirfd);
+ rmdir(self->tmpdir);
+}
+
+/* bindat with a dirfd and relative path works */
+TEST_F(bindat_connectat, bindat_relative)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ };
+ socklen_t addrlen;
+ struct stat st;
+ char path[128];
+
+ strcpy(addr.sun_path, "sock");
+ addrlen = offsetof(struct sockaddr_un, sun_path) + strlen("sock") + 1;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, sys_bindat(self->dirfd, self->server,
+ (struct sockaddr *)&addr, addrlen, 0));
+
+ /* Verify socket file was created in the right directory */
+ snprintf(path, sizeof(path), "%s/sock", self->tmpdir);
+ ASSERT_EQ(0, stat(path, &st));
+ ASSERT_TRUE(S_ISSOCK(st.st_mode));
+}
+
+/* connectat with a dirfd and relative path works */
+TEST_F(bindat_connectat, connectat_relative)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ };
+ socklen_t addrlen;
+
+ strcpy(addr.sun_path, "sock");
+ addrlen = offsetof(struct sockaddr_un, sun_path) + strlen("sock") + 1;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, sys_bindat(self->dirfd, self->server,
+ (struct sockaddr *)&addr, addrlen, 0));
+
+ if (variant->type == SOCK_STREAM || variant->type == SOCK_SEQPACKET) {
+ ASSERT_EQ(0, listen(self->server, 1));
+ }
+
+ self->client = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->client);
+
+ ASSERT_EQ(0, sys_connectat(self->dirfd, self->client,
+ (struct sockaddr *)&addr, addrlen, 0));
+}
+
+/* AT_FDCWD behaves like regular bind/connect */
+TEST_F(bindat_connectat, at_fdcwd)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ };
+ socklen_t addrlen;
+ char path[128];
+
+ snprintf(path, sizeof(path), "%s/sock", self->tmpdir);
+ strcpy(addr.sun_path, path);
+ addrlen = offsetof(struct sockaddr_un, sun_path) + strlen(path) + 1;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, sys_bindat(AT_FDCWD, self->server,
+ (struct sockaddr *)&addr, addrlen, 0));
+
+ if (variant->type == SOCK_STREAM || variant->type == SOCK_SEQPACKET) {
+ ASSERT_EQ(0, listen(self->server, 1));
+ }
+
+ self->client = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->client);
+
+ ASSERT_EQ(0, sys_connectat(AT_FDCWD, self->client,
+ (struct sockaddr *)&addr, addrlen, 0));
+}
+
+/* Non-zero flags are rejected */
+TEST_F(bindat_connectat, bad_flags)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ .sun_path = "sock",
+ };
+ socklen_t addrlen;
+
+ addrlen = offsetof(struct sockaddr_un, sun_path) + strlen("sock") + 1;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(-1, sys_bindat(self->dirfd, self->server,
+ (struct sockaddr *)&addr, addrlen, 1));
+ ASSERT_EQ(EINVAL, errno);
+
+ ASSERT_EQ(-1, sys_connectat(self->dirfd, self->server,
+ (struct sockaddr *)&addr, addrlen, 1));
+ ASSERT_EQ(EINVAL, errno);
+}
+
+/* Non-AF_UNIX socket with dfd != AT_FDCWD returns EOPNOTSUPP */
+TEST_F(bindat_connectat, non_unix_eopnotsupp)
+{
+ struct sockaddr_in addr = {
+ .sin_family = AF_INET,
+ };
+ int fd;
+
+ fd = socket(AF_INET, SOCK_STREAM, 0);
+ if (fd < 0)
+ SKIP(return, "AF_INET socket not available");
+
+ ASSERT_EQ(-1, sys_bindat(self->dirfd, fd,
+ (struct sockaddr *)&addr, sizeof(addr), 0));
+ ASSERT_EQ(EOPNOTSUPP, errno);
+
+ ASSERT_EQ(-1, sys_connectat(self->dirfd, fd,
+ (struct sockaddr *)&addr, sizeof(addr), 0));
+ ASSERT_EQ(EOPNOTSUPP, errno);
+
+ close(fd);
+}
+
+/* Abstract sockets work with bindat (dfd is ignored for abstract) */
+TEST_F(bindat_connectat, abstract_socket)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ };
+ socklen_t addrlen;
+ char abstract_name[20];
+
+ snprintf(abstract_name, sizeof(abstract_name), "bindat%d", getpid());
+ addr.sun_path[0] = '\0';
+ memcpy(addr.sun_path + 1, abstract_name, strlen(abstract_name));
+ addrlen = offsetof(struct sockaddr_un, sun_path) + 1 + strlen(abstract_name);
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, sys_bindat(self->dirfd, self->server,
+ (struct sockaddr *)&addr, addrlen, 0));
+
+ if (variant->type == SOCK_STREAM || variant->type == SOCK_SEQPACKET) {
+ ASSERT_EQ(0, listen(self->server, 1));
+ }
+
+ self->client = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->client);
+
+ ASSERT_EQ(0, sys_connectat(self->dirfd, self->client,
+ (struct sockaddr *)&addr, addrlen, 0));
+}
+
+/* bindat with a bad dirfd returns ENOENT (path resolution fails) */
+TEST_F(bindat_connectat, bad_dirfd)
+{
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ .sun_path = "sock",
+ };
+ socklen_t addrlen;
+
+ addrlen = offsetof(struct sockaddr_un, sun_path) + strlen("sock") + 1;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(-1, sys_bindat(9999, self->server,
+ (struct sockaddr *)&addr, addrlen, 0));
+ ASSERT_EQ(EBADF, errno);
+}
+
+TEST_HARNESS_MAIN
--
2.51.2
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-WIP-no-submit-af_unix-store-only-the-name-in-unix_ad.patch --]
[-- Type: text/x-patch; name="0001-WIP-no-submit-af_unix-store-only-the-name-in-unix_ad.patch", Size: 19155 bytes --]
From 96c45c5cc43799f95a1a90a87cfaccf9377b82da Mon Sep 17 00:00:00 2001
From: John Ericson <John.Ericson@Obsidian.Systems>
Date: Mon, 18 May 2026 12:51:49 -0400
Subject: [PATCH 1/2] [WIP, no-submit] af_unix: store only the name in
`unix_address`, not the full `sockaddr_un`
`struct unix_address` previously stored a trailing `struct sockaddr_un
name[]`, including the redundant `sun_family` field. Since
`unix_address` is only used within `AF_UNIX` code, the family is always
`AF_UNIX` and adds no information.
Change `unix_address.name` to `char name[]` containing just the path/
abstract name bytes, with `addr->len` being the name length. This
removes the redundant family field and decouples the internal
representation from the `sockaddr_un` wire format, which will make it
easier to support alternative improved ways of binding/connecting unix
sockets in the future. (By "improved ways", supporting use cases such as
longer paths, fd-relative paths, and openat{,2}-style flags.)
Signed-off-by: John Ericson <John.Ericson@Obsidian.Systems>
Assisted-by: Claude:claude-opus-4-6
---
include/net/af_unix.h | 4 +-
net/unix/af_unix.c | 64 ++++++++++++++---------------
net/unix/diag.c | 4 +-
security/apparmor/af_unix.c | 40 +++++++++---------
security/apparmor/include/af_unix.h | 9 ++--
security/apparmor/net.c | 29 ++++++-------
security/landlock/task.c | 3 +-
security/lsm_audit.c | 4 +-
8 files changed, 73 insertions(+), 84 deletions(-)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 34f53dde65ce..af5258923062 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -23,8 +23,8 @@ static inline struct unix_sock *unix_get_socket(struct file *filp)
struct unix_address {
refcount_t refcnt;
- int len;
- struct sockaddr_un name[];
+ int len; /* length of name[] */
+ char name[];
};
struct scm_stat {
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 1cbf36ea043b..e690cb88f54e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -212,10 +212,9 @@ static unsigned int unix_bsd_hash(struct inode *i)
return i->i_ino & UNIX_HASH_MOD;
}
-static unsigned int unix_abstract_hash(struct sockaddr_un *sunaddr,
- int addr_len, int type)
+static unsigned int unix_abstract_hash(const char *name, int name_len, int type)
{
- __wsum csum = csum_partial(sunaddr, addr_len, 0);
+ __wsum csum = csum_partial(name, name_len, 0);
unsigned int hash;
hash = (__force unsigned int)csum_fold(csum);
@@ -303,18 +302,17 @@ struct sock *unix_peer_get(struct sock *s)
}
EXPORT_SYMBOL_GPL(unix_peer_get);
-static struct unix_address *unix_create_addr(struct sockaddr_un *sunaddr,
- int addr_len)
+static struct unix_address *unix_create_addr(const char *name, int name_len)
{
struct unix_address *addr;
- addr = kmalloc(sizeof(*addr) + addr_len, GFP_KERNEL);
+ addr = kmalloc(sizeof(*addr) + name_len, GFP_KERNEL);
if (!addr)
return NULL;
refcount_set(&addr->refcnt, 1);
- addr->len = addr_len;
- memcpy(addr->name, sunaddr, addr_len);
+ addr->len = name_len;
+ memcpy(addr->name, name, name_len);
return addr;
}
@@ -423,7 +421,7 @@ static void unix_remove_bsd_socket(struct sock *sk)
}
static struct sock *__unix_find_socket_byname(struct net *net,
- struct sockaddr_un *sunname,
+ const char *name,
int len, unsigned int hash)
{
struct sock *s;
@@ -432,20 +430,20 @@ static struct sock *__unix_find_socket_byname(struct net *net,
struct unix_sock *u = unix_sk(s);
if (u->addr->len == len &&
- !memcmp(u->addr->name, sunname, len))
+ !memcmp(u->addr->name, name, len))
return s;
}
return NULL;
}
static inline struct sock *unix_find_socket_byname(struct net *net,
- struct sockaddr_un *sunname,
+ const char *name,
int len, unsigned int hash)
{
struct sock *s;
spin_lock(&net->unx.table.locks[hash]);
- s = __unix_find_socket_byname(net, sunname, len, hash);
+ s = __unix_find_socket_byname(net, name, len, hash);
if (s)
sock_hold(s);
spin_unlock(&net->unx.table.locks[hash]);
@@ -1256,11 +1254,12 @@ static struct sock *unix_find_abstract(struct net *net,
struct sockaddr_un *sunaddr,
int addr_len, int type)
{
- unsigned int hash = unix_abstract_hash(sunaddr, addr_len, type);
+ int name_len = addr_len - offsetof(struct sockaddr_un, sun_path);
+ unsigned int hash = unix_abstract_hash(sunaddr->sun_path, name_len, type);
struct dentry *dentry;
struct sock *sk;
- sk = unix_find_socket_byname(net, sunaddr, addr_len, hash);
+ sk = unix_find_socket_byname(net, sunaddr->sun_path, name_len, hash);
if (!sk)
return ERR_PTR(-ECONNREFUSED);
@@ -1302,13 +1301,11 @@ static int unix_autobind(struct sock *sk)
goto out;
err = -ENOMEM;
- addr = kzalloc(sizeof(*addr) +
- offsetof(struct sockaddr_un, sun_path) + 16, GFP_KERNEL);
+ addr = kzalloc(sizeof(*addr) + 6, GFP_KERNEL);
if (!addr)
goto out;
- addr->len = offsetof(struct sockaddr_un, sun_path) + 6;
- addr->name->sun_family = AF_UNIX;
+ addr->len = 6;
refcount_set(&addr->refcnt, 1);
old_hash = sk->sk_hash;
@@ -1316,7 +1313,7 @@ static int unix_autobind(struct sock *sk)
lastnum = ordernum & 0xFFFFF;
retry:
ordernum = (ordernum + 1) & 0xFFFFF;
- sprintf(addr->name->sun_path + 1, "%05x", ordernum);
+ sprintf(addr->name + 1, "%05x", ordernum);
new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
unix_table_double_lock(net, old_hash, new_hash);
@@ -1362,7 +1359,8 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
int err;
addr_len = unix_mkname_bsd(sunaddr, addr_len);
- addr = unix_create_addr(sunaddr, addr_len);
+ addr = unix_create_addr(sunaddr->sun_path,
+ addr_len - offsetof(struct sockaddr_un, sun_path));
if (!addr)
return -ENOMEM;
@@ -1370,7 +1368,7 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
* Get the parent directory, calculate the hash for last
* component.
*/
- dentry = start_creating_path(AT_FDCWD, addr->name->sun_path, &parent, 0);
+ dentry = start_creating_path(AT_FDCWD, addr->name, &parent, 0);
if (IS_ERR(dentry)) {
err = PTR_ERR(dentry);
goto out;
@@ -1415,7 +1413,6 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
unix_release_addr(addr);
return err == -EEXIST ? -EADDRINUSE : err;
}
-
static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
int addr_len)
{
@@ -1425,7 +1422,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
struct unix_address *addr;
int err;
- addr = unix_create_addr(sunaddr, addr_len);
+ addr = unix_create_addr(sunaddr->sun_path,
+ addr_len - offsetof(struct sockaddr_un, sun_path));
if (!addr)
return -ENOMEM;
@@ -1908,8 +1906,9 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
sunaddr->sun_path[0] = 0;
err = offsetof(struct sockaddr_un, sun_path);
} else {
- err = addr->len;
- memcpy(sunaddr, addr->name, addr->len);
+ err = offsetof(struct sockaddr_un, sun_path) + addr->len;
+ sunaddr->sun_family = AF_UNIX;
+ memcpy(sunaddr->sun_path, addr->name, addr->len);
if (peer)
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, &err,
@@ -2558,8 +2557,11 @@ static void unix_copy_addr(struct msghdr *msg, struct sock *sk)
struct unix_address *addr = smp_load_acquire(&unix_sk(sk)->addr);
if (addr) {
- msg->msg_namelen = addr->len;
- memcpy(msg->msg_name, addr->name, addr->len);
+ struct sockaddr_un *sunaddr = msg->msg_name;
+
+ msg->msg_namelen = offsetof(struct sockaddr_un, sun_path) + addr->len;
+ sunaddr->sun_family = AF_UNIX;
+ memcpy(sunaddr->sun_path, addr->name, addr->len);
}
}
@@ -3581,17 +3583,15 @@ static int unix_seq_show(struct seq_file *seq, void *v)
seq_putc(seq, ' ');
i = 0;
- len = u->addr->len -
- offsetof(struct sockaddr_un, sun_path);
- if (u->addr->name->sun_path[0]) {
+ len = u->addr->len;
+ if (u->addr->name[0]) {
len--;
} else {
seq_putc(seq, '@');
i++;
}
for ( ; i < len; i++)
- seq_putc(seq, u->addr->name->sun_path[i] ?:
- '@');
+ seq_putc(seq, u->addr->name[i] ?: '@');
}
unix_state_unlock(s);
seq_putc(seq, '\n');
diff --git a/net/unix/diag.c b/net/unix/diag.c
index cca7dea05370..ff4d9d58e119 100644
--- a/net/unix/diag.c
+++ b/net/unix/diag.c
@@ -21,9 +21,7 @@ static int sk_diag_dump_name(struct sock *sk, struct sk_buff *nlskb)
if (!addr)
return 0;
- return nla_put(nlskb, UNIX_DIAG_NAME,
- addr->len - offsetof(struct sockaddr_un, sun_path),
- addr->name->sun_path);
+ return nla_put(nlskb, UNIX_DIAG_NAME, addr->len, addr->name);
}
static int sk_diag_dump_vfs(struct sock *sk, struct sk_buff *nlskb)
diff --git a/security/apparmor/af_unix.c b/security/apparmor/af_unix.c
index fdb4a9f212c3..fb3d871a0042 100644
--- a/security/apparmor/af_unix.c
+++ b/security/apparmor/af_unix.c
@@ -67,12 +67,11 @@ static int unix_fs_perm(const char *op, u32 mask, const struct cred *subj_cred,
#define FS_ADDR "/" /* path addr in fs */
static aa_state_t match_addr(struct aa_dfa *dfa, aa_state_t state,
- struct sockaddr_un *addr, int addrlen)
+ const char *name, int name_len)
{
- if (addr)
+ if (name)
/* include leading \0 */
- state = aa_dfa_match_len(dfa, state, addr->sun_path,
- unix_addr_len(addrlen));
+ state = aa_dfa_match_len(dfa, state, name, name_len);
else
state = aa_dfa_match_len(dfa, state, ANONYMOUS_ADDR, 1);
/* todo: could change to out of band for cleaner separation */
@@ -84,14 +83,14 @@ static aa_state_t match_addr(struct aa_dfa *dfa, aa_state_t state,
static aa_state_t match_to_local(struct aa_policydb *policy,
aa_state_t state, u32 request,
int type, int protocol,
- struct sockaddr_un *addr, int addrlen,
+ const char *name, int name_len,
struct aa_perms **p,
const char **info)
{
state = aa_match_to_prot(policy, state, request, PF_UNIX, type,
protocol, NULL, info);
if (state) {
- state = match_addr(policy->dfa, state, addr, addrlen);
+ state = match_addr(policy->dfa, state, name, name_len);
if (state) {
/* todo: local label matching */
state = aa_dfa_null_transition(policy->dfa, state);
@@ -105,7 +104,7 @@ static aa_state_t match_to_local(struct aa_policydb *policy,
return state;
}
-struct sockaddr_un *aa_sunaddr(const struct unix_sock *u, int *addrlen)
+const char *aa_unix_addr_name(const struct unix_sock *u, int *addrlen)
{
struct unix_address *addr;
@@ -124,11 +123,11 @@ static aa_state_t match_to_sk(struct aa_policydb *policy,
struct unix_sock *u, struct aa_perms **p,
const char **info)
{
- int addrlen;
- struct sockaddr_un *addr = aa_sunaddr(u, &addrlen);
+ int name_len;
+ const char *name = aa_unix_addr_name(u, &name_len);
return match_to_local(policy, state, request, u->sk.sk_type,
- u->sk.sk_protocol, addr, addrlen, p, info);
+ u->sk.sk_protocol, name, name_len, p, info);
}
#define CMD_ADDR 1
@@ -154,7 +153,7 @@ static aa_state_t match_to_cmd(struct aa_policydb *policy, aa_state_t state,
static aa_state_t match_to_peer(struct aa_policydb *policy, aa_state_t state,
u32 request, struct unix_sock *u,
- struct sockaddr_un *peer_addr, int peer_addrlen,
+ const char *peer_addr, int peer_addrlen,
struct aa_perms **p, const char **info)
{
AA_BUG(!p);
@@ -271,8 +270,7 @@ static int profile_bind_perm(struct aa_profile *profile, struct sock *sk,
/* bind for abstract socket */
state = match_to_local(rules->policy, state, AA_MAY_BIND,
sk->sk_type, sk->sk_protocol,
- unix_addr(ad->net.addr),
- ad->net.addrlen,
+ ad->net.addr, ad->net.addrlen,
&p, &ad->info);
return aa_do_perms(profile, rules->policy, state, AA_MAY_BIND,
@@ -387,7 +385,7 @@ static int profile_opt_perm(struct aa_profile *profile, u32 request,
/* null peer_label is allowed, in which case the peer_sk label is used */
static int profile_peer_perm(struct aa_profile *profile, u32 request,
struct sock *sk, const struct path *path,
- struct sockaddr_un *peer_addr,
+ const char *peer_addr,
int peer_addrlen, const struct path *peer_path,
struct aa_label *peer_label,
struct apparmor_audit_data *ad)
@@ -500,8 +498,8 @@ int aa_unix_bind_perm(struct socket *sock, struct sockaddr *addr,
if (!unconfined(label)) {
DEFINE_AUDIT_SK(ad, OP_BIND, current_cred(), sock->sk);
- ad.net.addr = unix_addr(addr);
- ad.net.addrlen = addrlen;
+ ad.net.addr = unix_addr(addr)->sun_path;
+ ad.net.addrlen = addrlen - offsetof(struct sockaddr_un, sun_path);
error = fn_for_each_confined(label, profile,
profile_bind_perm(profile, sock->sk, &ad));
@@ -600,7 +598,7 @@ int aa_unix_opt_perm(const char *op, u32 request, struct socket *sock,
static int unix_peer_perm(const struct cred *subj_cred,
struct aa_label *label, const char *op, u32 request,
struct sock *sk, const struct path *path,
- struct sockaddr_un *peer_addr, int peer_addrlen,
+ const char *peer_addr, int peer_addrlen,
const struct path *peer_path, struct aa_label *peer_label)
{
struct aa_profile *profile;
@@ -628,7 +626,7 @@ int aa_unix_peer_perm(const struct cred *subj_cred,
struct unix_sock *peeru = unix_sk(peer_sk);
struct unix_sock *u = unix_sk(sk);
int plen;
- struct sockaddr_un *paddr = aa_sunaddr(unix_sk(peer_sk), &plen);
+ const char *paddr = aa_unix_addr_name(unix_sk(peer_sk), &plen);
AA_BUG(!label);
AA_BUG(!sk);
@@ -710,7 +708,7 @@ int aa_unix_file_perm(const struct cred *subj_cred, struct aa_label *label,
const char *op, u32 request, struct file *file)
{
struct socket *sock = (struct socket *) file->private_data;
- struct sockaddr_un *addr, *peer_addr;
+ const char *addr, *peer_addr;
int addrlen, peer_addrlen;
struct aa_label *plabel = NULL;
struct sock *peer_sk = NULL;
@@ -734,7 +732,7 @@ int aa_unix_file_perm(const struct cred *subj_cred, struct aa_label *label,
sock_hold(peer_sk);
is_sk_fs = is_unix_fs(sock->sk);
- addr = aa_sunaddr(unix_sk(sock->sk), &addrlen);
+ addr = aa_unix_addr_name(unix_sk(sock->sk), &addrlen);
path = unix_sk(sock->sk)->path;
unix_state_unlock(sock->sk);
@@ -748,7 +746,7 @@ int aa_unix_file_perm(const struct cred *subj_cred, struct aa_label *label,
if (!peer_sk)
goto out;
- peer_addr = aa_sunaddr(unix_sk(peer_sk), &peer_addrlen);
+ peer_addr = aa_unix_addr_name(unix_sk(peer_sk), &peer_addrlen);
struct path peer_path;
diff --git a/security/apparmor/include/af_unix.h b/security/apparmor/include/af_unix.h
index 4a62e600d82b..b498f05e0871 100644
--- a/security/apparmor/include/af_unix.h
+++ b/security/apparmor/include/af_unix.h
@@ -18,20 +18,19 @@
#include "label.h"
#define unix_addr(A) ((struct sockaddr_un *)(A))
-#define unix_addr_len(L) ((L) - sizeof(sa_family_t))
#define unix_peer(sk) (unix_sk(sk)->peer)
#define is_unix_addr_abstract_name(B) ((B)[0] == 0)
-#define is_unix_addr_anon(A, L) ((A) && unix_addr_len(L) <= 0)
+#define is_unix_addr_anon(A, L) ((A) && (L) <= 0)
#define is_unix_addr_fs(A, L) (!is_unix_addr_anon(A, L) && \
- !is_unix_addr_abstract_name(unix_addr(A)->sun_path))
+ !is_unix_addr_abstract_name(A))
#define is_unix_anonymous(U) (!unix_sk(U)->addr)
#define is_unix_fs(U) (!is_unix_anonymous(U) && \
- unix_sk(U)->addr->name->sun_path[0])
+ unix_sk(U)->addr->name[0])
#define is_unix_connected(S) ((S)->state == SS_CONNECTED)
-struct sockaddr_un *aa_sunaddr(const struct unix_sock *u, int *addrlen);
+const char *aa_unix_addr_name(const struct unix_sock *u, int *addrlen);
int aa_unix_peer_perm(const struct cred *subj_cred,
struct aa_label *label, const char *op, u32 request,
struct sock *sk, struct sock *peer_sk,
diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index 44c04102062f..c1736d93af62 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -74,22 +74,19 @@ static const char * const net_mask_names[] = {
};
static void audit_unix_addr(struct audit_buffer *ab, const char *str,
- struct sockaddr_un *addr, int addrlen)
+ const char *name, int name_len)
{
- int len = unix_addr_len(addrlen);
-
- if (!addr || len <= 0) {
+ if (!name || name_len <= 0) {
audit_log_format(ab, " %s=none", str);
- } else if (addr->sun_path[0]) {
+ } else if (name[0]) {
audit_log_format(ab, " %s=", str);
- audit_log_untrustedstring(ab, addr->sun_path);
+ audit_log_untrustedstring(ab, name);
} else {
audit_log_format(ab, " %s=\"@", str);
- if (audit_string_contains_control(&addr->sun_path[1], len - 1))
- audit_log_n_hex(ab, &addr->sun_path[1], len - 1);
+ if (audit_string_contains_control(&name[1], name_len - 1))
+ audit_log_n_hex(ab, &name[1], name_len - 1);
else
- audit_log_format(ab, "%.*s", len - 1,
- &addr->sun_path[1]);
+ audit_log_format(ab, "%.*s", name_len - 1, &name[1]);
audit_log_format(ab, "\"");
}
}
@@ -100,13 +97,12 @@ static void audit_unix_sk_addr(struct audit_buffer *ab, const char *str,
const struct unix_sock *u = unix_sk(sk);
if (u && u->addr) {
- int addrlen;
- struct sockaddr_un *addr = aa_sunaddr(u, &addrlen);
+ int name_len;
+ const char *name = aa_unix_addr_name(u, &name_len);
- audit_unix_addr(ab, str, addr, addrlen);
+ audit_unix_addr(ab, str, name, name_len);
} else {
audit_unix_addr(ab, str, NULL, 0);
-
}
}
@@ -144,13 +140,12 @@ void audit_net_cb(struct audit_buffer *ab, void *va)
if (ad->common.u.net->family == PF_UNIX) {
if (ad->net.addr || !ad->common.u.net->sk)
audit_unix_addr(ab, "addr",
- unix_addr(ad->net.addr),
- ad->net.addrlen);
+ ad->net.addr, ad->net.addrlen);
else
audit_unix_sk_addr(ab, "addr", ad->common.u.net->sk);
if (ad->request & NET_PEER_MASK) {
audit_unix_addr(ab, "peer_addr",
- unix_addr(ad->net.peer.addr),
+ ad->net.peer.addr,
ad->net.peer.addrlen);
}
}
diff --git a/security/landlock/task.c b/security/landlock/task.c
index 6d46042132ce..2d668564f45a 100644
--- a/security/landlock/task.c
+++ b/security/landlock/task.c
@@ -255,8 +255,7 @@ static bool is_abstract_socket(struct sock *const sock)
if (!addr)
return false;
- if (addr->len >= offsetof(struct sockaddr_un, sun_path) + 1 &&
- addr->name->sun_path[0] == '\0')
+ if (addr->len >= 1 && addr->name[0] == '\0')
return true;
return false;
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 737f5a263a8f..8bea2b88b94f 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -324,8 +324,8 @@ void audit_log_lsm_data(struct audit_buffer *ab,
audit_log_d_path(ab, " path=", &u->path);
break;
}
- len = addr->len-sizeof(short);
- p = &addr->name->sun_path[0];
+ len = addr->len;
+ p = addr->name;
audit_log_format(ab, " path=");
if (*p)
audit_log_untrustedstring(ab, p);
--
2.51.2
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: 0002-WIP-no-submit-AF_UNIX2-alternative-to-connectat-and-.patch --]
[-- Type: text/x-patch; name="0002-WIP-no-submit-AF_UNIX2-alternative-to-connectat-and-.patch", Size: 18642 bytes --]
From d183785e63d0b224de21fc4f7e505eef99b27592 Mon Sep 17 00:00:00 2001
From: John Ericson <John.Ericson@Obsidian.Systems>
Date: Mon, 18 May 2026 12:03:37 -0400
Subject: [PATCH 2/2] [WIP, no-submit] `AF_UNIX2` alternative to `connectat`
and `bindat`
A quick exploratory refactor with an LLM.
The `flags` param is new here. It would allow supporting e.g.
`AT_SYMLINK_NOFOLLOW` and `AT_EMPTY_PATH` in the future.
N.B. Tests have not been run yet. I don't know if those are good at all.
I just read and reviewed the refactors themselves as they are easy to
read.
Signed-off-by: John Ericson <John.Ericson@Obsidian.Systems>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/socket.h | 13 +-
include/uapi/linux/un.h | 7 +
net/unix/af_unix.c | 246 +++++++++++++-----
tools/testing/selftests/net/af_unix/Makefile | 1 +
.../selftests/net/af_unix/unix_connect_at.c | 215 +++++++++++++++
5 files changed, 411 insertions(+), 71 deletions(-)
create mode 100644 tools/testing/selftests/net/af_unix/unix_connect_at.c
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ec4a0a025793..021c12326ae2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -255,8 +255,17 @@ struct ucred {
#define AF_MCTP 45 /* Management component
* transport protocol
*/
-
-#define AF_MAX 46 /* For now.. */
+#define AF_UNIX2 46 /* AF_UNIX with extended addressing.
+ * Note this is *not* an actual distinct
+ * address family, but just here to
+ * distinguish different initialization
+ * structs.Any internal non-syscall-API
+ * usage should normalize AF_UNIX2 to
+ * AF_UNIX as there is no difference
+ * between sockets initialized the two
+ * different ways. */
+
+#define AF_MAX 47 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
index 0ad59dc8b686..c51a857064c0 100644
--- a/include/uapi/linux/un.h
+++ b/include/uapi/linux/un.h
@@ -11,6 +11,13 @@ struct sockaddr_un {
char sun_path[UNIX_PATH_MAX]; /* pathname */
};
+struct sockaddr_un2 {
+ __kernel_sa_family_t sun2_family; /* AF_UNIX2 */
+ __s32 sun2_dfd; /* directory fd, or AT_FDCWD */
+ const char *sun2_path; /* pointer to null-terminated path */
+ __u32 sun2_flags; /* reserved, must be 0 */
+};
+
#define SIOCUNIXFILE (SIOCPROTOPRIVATE + 0) /* open a socket file with O_PATH */
#endif /* _LINUX_UN_H */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e690cb88f54e..2835f7c7d893 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1185,42 +1185,20 @@ static int unix_release(struct socket *sock)
return 0;
}
-static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
- int type, int flags)
+static struct sock *unix_find_bsd_path(struct path *path, int type, int flags)
{
struct inode *inode;
- struct path path;
struct sock *sk;
int err;
- unix_mkname_bsd(sunaddr, addr_len);
-
- if (flags & SOCK_COREDUMP) {
- struct path root;
-
- task_lock(&init_task);
- get_fs_root(init_task.fs, &root);
- task_unlock(&init_task);
-
- scoped_with_kernel_creds()
- err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
- LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
- LOOKUP_NO_MAGICLINKS, &path);
- path_put(&root);
- if (err)
- goto fail;
- } else {
- err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
- if (err)
- goto fail;
-
- err = path_permission(&path, MAY_WRITE);
+ if (!(flags & SOCK_COREDUMP)) {
+ err = path_permission(path, MAY_WRITE);
if (err)
goto path_put;
}
err = -ECONNREFUSED;
- inode = d_backing_inode(path.dentry);
+ inode = d_backing_inode(path->dentry);
if (!S_ISSOCK(inode->i_mode))
goto path_put;
@@ -1232,24 +1210,67 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
if (sk->sk_type != type)
goto sock_put;
- err = security_unix_find(&path, sk, flags);
+ err = security_unix_find(path, sk, flags);
if (err)
goto sock_put;
- touch_atime(&path);
+ touch_atime(path);
- path_put(&path);
+ path_put(path);
return sk;
sock_put:
sock_put(sk);
path_put:
- path_put(&path);
-fail:
+ path_put(path);
return ERR_PTR(err);
}
+static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
+ int type, int flags)
+{
+ struct path path;
+ int err;
+
+ unix_mkname_bsd(sunaddr, addr_len);
+
+ if (flags & SOCK_COREDUMP) {
+ struct path root;
+
+ task_lock(&init_task);
+ get_fs_root(init_task.fs, &root);
+ task_unlock(&init_task);
+
+ scoped_with_kernel_creds()
+ err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
+ LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
+ LOOKUP_NO_MAGICLINKS, &path);
+ path_put(&root);
+ } else {
+ err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
+ }
+
+ if (err)
+ return ERR_PTR(err);
+
+ return unix_find_bsd_path(&path, type, flags);
+}
+
+static struct sock *unix_find_bsd2(int dfd, const char __user *user_path,
+ int type)
+{
+ struct path path;
+ int err;
+
+ err = user_path_at(dfd, user_path, LOOKUP_FOLLOW, &path);
+
+ if (err)
+ return ERR_PTR(err);
+
+ return unix_find_bsd_path(&path, type, 0);
+}
+
static struct sock *unix_find_abstract(struct net *net,
struct sockaddr_un *sunaddr,
int addr_len, int type)
@@ -1344,8 +1365,8 @@ out: mutex_unlock(&u->bindlock);
return err;
}
-static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
- int addr_len)
+static int unix_bind_bsd_create(struct sock *sk, struct unix_address *addr,
+ struct dentry *dentry, struct path *parent)
{
umode_t mode = S_IFSOCK |
(SOCK_INODE(sk->sk_socket)->i_mode & ~current_umask());
@@ -1353,34 +1374,15 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
unsigned int new_hash, old_hash;
struct net *net = sock_net(sk);
struct mnt_idmap *idmap;
- struct unix_address *addr;
- struct dentry *dentry;
- struct path parent;
int err;
- addr_len = unix_mkname_bsd(sunaddr, addr_len);
- addr = unix_create_addr(sunaddr->sun_path,
- addr_len - offsetof(struct sockaddr_un, sun_path));
- if (!addr)
- return -ENOMEM;
-
- /*
- * Get the parent directory, calculate the hash for last
- * component.
- */
- dentry = start_creating_path(AT_FDCWD, addr->name, &parent, 0);
- if (IS_ERR(dentry)) {
- err = PTR_ERR(dentry);
- goto out;
- }
-
/*
* All right, let's create it.
*/
- idmap = mnt_idmap(parent.mnt);
- err = security_path_mknod(&parent, dentry, mode, 0);
+ idmap = mnt_idmap(parent->mnt);
+ err = security_path_mknod(parent, dentry, mode, 0);
if (!err)
- err = vfs_mknod(idmap, d_inode(parent.dentry), dentry, mode, 0, NULL);
+ err = vfs_mknod(idmap, d_inode(parent->dentry), dentry, mode, 0, NULL);
if (err)
goto out_path;
err = mutex_lock_interruptible(&u->bindlock);
@@ -1392,13 +1394,13 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
old_hash = sk->sk_hash;
new_hash = unix_bsd_hash(d_backing_inode(dentry));
unix_table_double_lock(net, old_hash, new_hash);
- u->path.mnt = mntget(parent.mnt);
+ u->path.mnt = mntget(parent->mnt);
u->path.dentry = dget(dentry);
__unix_set_addr_hash(net, sk, addr, new_hash);
unix_table_double_unlock(net, old_hash, new_hash);
unix_insert_bsd_socket(sk);
mutex_unlock(&u->bindlock);
- end_creating_path(&parent, dentry);
+ end_creating_path(parent, dentry);
return 0;
out_unlock:
@@ -1406,13 +1408,76 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
err = -EINVAL;
out_unlink:
/* failed after successful mknod? unlink what we'd created... */
- vfs_unlink(idmap, d_inode(parent.dentry), dentry, NULL);
+ vfs_unlink(idmap, d_inode(parent->dentry), dentry, NULL);
out_path:
- end_creating_path(&parent, dentry);
-out:
- unix_release_addr(addr);
+ end_creating_path(parent, dentry);
return err == -EEXIST ? -EADDRINUSE : err;
}
+
+static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
+ int addr_len)
+{
+ struct unix_address *addr;
+ struct dentry *dentry;
+ struct path parent;
+ int err;
+
+ addr_len = unix_mkname_bsd(sunaddr, addr_len);
+ addr = unix_create_addr(sunaddr->sun_path,
+ addr_len - offsetof(struct sockaddr_un, sun_path));
+ if (!addr)
+ return -ENOMEM;
+
+ dentry = start_creating_path(AT_FDCWD, addr->name, &parent, 0);
+ if (IS_ERR(dentry)) {
+ unix_release_addr(addr);
+ return PTR_ERR(dentry);
+ }
+
+ err = unix_bind_bsd_create(sk, addr, dentry, &parent);
+ if (err)
+ unix_release_addr(addr);
+ return err;
+}
+
+static int unix_bind_bsd2(struct sock *sk, int dfd,
+ const char __user *user_path)
+{
+ struct sockaddr_un sunaddr;
+ struct unix_address *addr;
+ struct dentry *dentry;
+ struct path parent;
+ int addr_len, err;
+
+ /* Copy path for internal address storage */
+ sunaddr.sun_family = AF_UNIX;
+ err = strncpy_from_user(sunaddr.sun_path, user_path,
+ sizeof(sunaddr.sun_path));
+ if (err < 0)
+ return err;
+ if (err == sizeof(sunaddr.sun_path))
+ return -ENAMETOOLONG;
+ addr_len = unix_mkname_bsd(&sunaddr,
+ offsetof(struct sockaddr_un, sun_path) + err + 1);
+
+ addr = unix_create_addr(sunaddr.sun_path,
+ addr_len - offsetof(struct sockaddr_un, sun_path));
+ if (!addr)
+ return -ENOMEM;
+
+ /* Use the original user pointer for VFS path resolution */
+ dentry = start_creating_user_path(dfd, user_path, &parent, 0);
+ if (IS_ERR(dentry)) {
+ unix_release_addr(addr);
+ return PTR_ERR(dentry);
+ }
+
+ err = unix_bind_bsd_create(sk, addr, dentry, &parent);
+ if (err)
+ unix_release_addr(addr);
+ return err;
+}
+
static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
int addr_len)
{
@@ -1464,6 +1529,17 @@ static int unix_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int ad
struct sock *sk = sock->sk;
int err;
+ if (uaddr->sa_family == AF_UNIX2) {
+ struct sockaddr_un2 *un2 = (struct sockaddr_un2 *)uaddr;
+
+ if (addr_len != sizeof(*un2))
+ return -EINVAL;
+ if (un2->sun2_flags)
+ return -EINVAL;
+
+ return unix_bind_bsd2(sk, un2->sun2_dfd, un2->sun2_path);
+ }
+
if (addr_len == offsetof(struct sockaddr_un, sun_path) &&
sunaddr->sun_family == AF_UNIX)
return unix_autobind(sk);
@@ -1508,6 +1584,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr
int alen, int flags)
{
struct sockaddr_un *sunaddr = (struct sockaddr_un *)addr;
+ struct sockaddr_un2 *un2 = NULL;
struct sock *sk = sock->sk;
struct sock *other;
int err;
@@ -1516,10 +1593,21 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr
if (alen < offsetofend(struct sockaddr, sa_family))
goto out;
- if (addr->sa_family != AF_UNSPEC) {
- err = unix_validate_addr(sunaddr, alen);
- if (err)
+ if (addr->sa_family == AF_UNIX2) {
+ un2 = (struct sockaddr_un2 *)addr;
+
+ if (alen != sizeof(*un2))
+ goto out;
+ if (un2->sun2_flags)
goto out;
+ }
+
+ if (addr->sa_family != AF_UNSPEC) {
+ if (!un2) {
+ err = unix_validate_addr(sunaddr, alen);
+ if (err)
+ goto out;
+ }
err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, addr, &alen);
if (err)
@@ -1532,7 +1620,12 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr
}
restart:
- other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
+ if (un2)
+ other = unix_find_bsd2(un2->sun2_dfd, un2->sun2_path,
+ sock->type);
+ else
+ other = unix_find_other(sock_net(sk), sunaddr, alen,
+ sock->type, 0);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out;
@@ -1629,6 +1722,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
struct sock *sk = sock->sk, *newsk = NULL, *other = NULL;
struct unix_sock *u = unix_sk(sk), *newu, *otheru;
+ struct sockaddr_un2 *un2 = NULL;
struct unix_peercred peercred = {};
struct net *net = sock_net(sk);
struct sk_buff *skb = NULL;
@@ -1636,9 +1730,18 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
long timeo;
int err;
- err = unix_validate_addr(sunaddr, addr_len);
- if (err)
- goto out;
+ if (uaddr->sa_family == AF_UNIX2) {
+ un2 = (struct sockaddr_un2 *)uaddr;
+
+ if (addr_len != sizeof(*un2))
+ return -EINVAL;
+ if (un2->sun2_flags)
+ return -EINVAL;
+ } else {
+ err = unix_validate_addr(sunaddr, addr_len);
+ if (err)
+ goto out;
+ }
err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, uaddr, &addr_len);
if (err)
@@ -1672,7 +1775,12 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uad
restart:
/* Find listening sock. */
- other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
+ if (un2)
+ other = unix_find_bsd2(un2->sun2_dfd, un2->sun2_path,
+ sk->sk_type);
+ else
+ other = unix_find_other(net, sunaddr, addr_len, sk->sk_type,
+ flags);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out_free_skb;
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 4c0375e28bbe..13bb7a339e0e 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -13,6 +13,7 @@ TEST_GEN_PROGS := \
scm_rights \
so_peek_off \
unix_connect \
+ unix_connect_at \
unix_connreset \
# end of TEST_GEN_PROGS
diff --git a/tools/testing/selftests/net/af_unix/unix_connect_at.c b/tools/testing/selftests/net/af_unix/unix_connect_at.c
new file mode 100644
index 000000000000..54eada7780f0
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/unix_connect_at.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef AF_UNIX2
+#define AF_UNIX2 46
+#endif
+
+struct sockaddr_un2 {
+ sa_family_t sun2_family;
+ int sun2_dfd;
+ const char *sun2_path;
+ unsigned int sun2_flags;
+};
+
+FIXTURE(af_unix2)
+{
+ int server, client;
+ int dirfd;
+ char tmpdir[64];
+};
+
+FIXTURE_VARIANT(af_unix2)
+{
+ int type;
+};
+
+FIXTURE_VARIANT_ADD(af_unix2, stream)
+{
+ .type = SOCK_STREAM,
+};
+
+FIXTURE_VARIANT_ADD(af_unix2, dgram)
+{
+ .type = SOCK_DGRAM,
+};
+
+FIXTURE_VARIANT_ADD(af_unix2, seqpacket)
+{
+ .type = SOCK_SEQPACKET,
+};
+
+FIXTURE_SETUP(af_unix2)
+{
+ snprintf(self->tmpdir, sizeof(self->tmpdir),
+ "/tmp/af_unix2_test.%d", getpid());
+ ASSERT_EQ(0, mkdir(self->tmpdir, 0700));
+ self->dirfd = open(self->tmpdir, O_RDONLY | O_DIRECTORY);
+ ASSERT_LE(0, self->dirfd);
+ self->server = -1;
+ self->client = -1;
+}
+
+FIXTURE_TEARDOWN(af_unix2)
+{
+ if (self->client >= 0)
+ close(self->client);
+ if (self->server >= 0)
+ close(self->server);
+ unlinkat(self->dirfd, "sock", 0);
+ close(self->dirfd);
+ rmdir(self->tmpdir);
+}
+
+/* bind with AF_UNIX2, dirfd, and relative path works */
+TEST_F(af_unix2, bind_relative)
+{
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = self->dirfd,
+ .sun2_path = "sock",
+ .sun2_flags = 0,
+ };
+ struct stat st;
+ char path[128];
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+
+ /* Verify socket file was created in the right directory */
+ snprintf(path, sizeof(path), "%s/sock", self->tmpdir);
+ ASSERT_EQ(0, stat(path, &st));
+ ASSERT_TRUE(S_ISSOCK(st.st_mode));
+}
+
+/* connect with AF_UNIX2 and dirfd works */
+TEST_F(af_unix2, connect_relative)
+{
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = self->dirfd,
+ .sun2_path = "sock",
+ .sun2_flags = 0,
+ };
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+
+ if (variant->type == SOCK_STREAM || variant->type == SOCK_SEQPACKET) {
+ ASSERT_EQ(0, listen(self->server, 1));
+ }
+
+ self->client = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->client);
+
+ ASSERT_EQ(0, connect(self->client, (struct sockaddr *)&addr,
+ sizeof(addr)));
+}
+
+/* AT_FDCWD with AF_UNIX2 behaves like regular bind/connect */
+TEST_F(af_unix2, at_fdcwd)
+{
+ char path[128];
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = AT_FDCWD,
+ .sun2_flags = 0,
+ };
+
+ snprintf(path, sizeof(path), "%s/sock", self->tmpdir);
+ addr.sun2_path = path;
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(0, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+
+ if (variant->type == SOCK_STREAM || variant->type == SOCK_SEQPACKET) {
+ ASSERT_EQ(0, listen(self->server, 1));
+ }
+
+ self->client = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->client);
+
+ ASSERT_EQ(0, connect(self->client, (struct sockaddr *)&addr,
+ sizeof(addr)));
+}
+
+/* Non-zero flags are rejected */
+TEST_F(af_unix2, bad_flags)
+{
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = self->dirfd,
+ .sun2_path = "sock",
+ .sun2_flags = 1,
+ };
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(-1, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+ ASSERT_EQ(EINVAL, errno);
+
+ ASSERT_EQ(-1, connect(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+ ASSERT_EQ(EINVAL, errno);
+}
+
+/* Wrong addrlen is rejected */
+TEST_F(af_unix2, bad_addrlen)
+{
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = self->dirfd,
+ .sun2_path = "sock",
+ .sun2_flags = 0,
+ };
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(-1, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr) - 1));
+ ASSERT_EQ(EINVAL, errno);
+}
+
+/* bind with a bad dirfd fails */
+TEST_F(af_unix2, bad_dirfd)
+{
+ struct sockaddr_un2 addr = {
+ .sun2_family = AF_UNIX2,
+ .sun2_dfd = 9999,
+ .sun2_path = "sock",
+ .sun2_flags = 0,
+ };
+
+ self->server = socket(AF_UNIX, variant->type, 0);
+ ASSERT_LE(0, self->server);
+
+ ASSERT_EQ(-1, bind(self->server, (struct sockaddr *)&addr,
+ sizeof(addr)));
+ ASSERT_EQ(EBADF, errno);
+}
+
+TEST_HARNESS_MAIN
--
2.51.2
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [RFC] connectat()/bindat() or an alternative design
2026-05-18 19:09 [RFC] connectat()/bindat() or an alternative design John Ericson
@ 2026-06-08 19:45 ` Cong Wang
2026-06-11 2:08 ` John Ericson
0 siblings, 1 reply; 4+ messages in thread
From: Cong Wang @ 2026-06-08 19:45 UTC (permalink / raw)
To: John Ericson; +Cc: network dev
Hi John,
On Mon, May 18, 2026 at 03:09:43PM -0400, John Ericson wrote:
> Hello, I am interested in seeing something like the BSD's connectat/bindat in
> Linux.
>
> The deficiencies with traditional `bind`/`connect` for unix sockets are
> well-known. The advantages of openat2 over openat over open are also well-known, so I
> won't waste anyone's time restating either.
>
> https://lore.kernel.org/netdev/4FCF171B.8000207@parallels.com/
> https://lore.kernel.org/all/CAEnbY+co6YLXANfeMnfBOBs8Ba_Sbdqz0Ahm8RzAhRF7MrxL4Q@mail.gmail.com/
>
> Here are two previous times this was raised, unfortunately with no replies.
> Hoping this third time is the charm!
>
> To hopefully give the discussion a bit more concreteness, I have taken a stab at
> two refactors (with an LLM, full disclosure) to get a sense of what the
> approaches look like. (To be clear, these are not patches I am trying to
> formally submit, these are just sketches to guide the conversation. I haven't
> built or tested them, just read them.)
>
Thanks for bringing this up.
I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
sockets. However, it would be nicer to describe your use case here,
especially what the problems are without it. This would help more to
jusify your proposal here than just getting aligned with openat() or
BSD.
Hope this helps.
Regards,
Cong
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC] connectat()/bindat() or an alternative design
2026-06-08 19:45 ` Cong Wang
@ 2026-06-11 2:08 ` John Ericson
2026-06-12 18:50 ` Cong Wang
0 siblings, 1 reply; 4+ messages in thread
From: John Ericson @ 2026-06-11 2:08 UTC (permalink / raw)
To: Cong Wang; +Cc: network dev, Li Chen
[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]
Hi Cong,
On Mon, Jun 8, 2026, at 3:45 PM, Cong Wang wrote:
> Hi John,
>
> [...]
>
> Thanks for bringing this up.
Sure, thanks for replying to me!
> I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
> sockets. However, it would be nicer to describe your use case here,
> especially what the problems are without it. This would help more to
> jusify your proposal here than just getting aligned with openat() or
> BSD.
>
> Hope this helps.
>
> Regards,
> Cong
Yeah, happy to talk about that. Hope this is not too long a reply!
First, for some background context, I am a developer of the Nix package
manager. And this, plus my own personal taste, always has me thinking
about ways we can run processes with fewer privileges. The
no-ambient-authority capsicum/cloudabi/wasi/whatever dream has lived in
my head rent-free for many years :). Now these days, with LLMs, it feels
like these nice-to-have yak shaves of mine are finally worth dusting off
and striking off the bucket list.
Also in recent months, we Nix developers have been putting a bunch of
work into using more `openat2` and friends, and I have no doubt that we
will continue down this path (even on Windows!). We aim to be an
exemplar program for following the "always work relative to a file
descriptor" discipline. It's good for security, but also makes for code
that --- I believe --- is just more elegant and nicer to read.
----
Nearer term use case: slightly less ugly long path socket opening in
Nix:
If you look at [1] you can see a PR I've asked my coworker to draft to
improve binding and connecting code to cope with longer file paths,
something which does come up in practice when we are running multiple
tests with multiple daemons in parallel.
Now, I think it is safe to say that this code was already quite complex,
and in this patch only gets *more* complex. The current interfaces make
supporting longer paths quite annoying. (Though, once we remove the
`open` and switch to an `*at`-style interface in the wrapper (if macOS
lets us), it will get less bad.)
So the first use case would be getting something nicer than the
`/proc/self/fd/<N>` dance the linked code falls back to. It is good that
`/proc/self/fd/<N>` exists for legacy code, but it is an unergonomic way
to do file-descriptor-relative paths, and should be a fallback, never
the first choice. A real fd parameter along with a regular path pointer
would buy two concrete wins:
1. A clean, separate file descriptor parameter, the way `openat` has one
--- rather than assembling a `/proc` path by hand.
2. Normal `PATH_MAX` room for the real pathname, rather than cramming
`/proc/self/fd/<N>` (plus any residual path after it) into the small
`sun_path` field of `struct sockaddr_un`.
----
Longer term use case: anonymous listening sockets, avoiding advertising
sockets to potential clients using ambient authority mechanisms
altogether:
Some more background: I think this whole business of listening
unix sockets necessarily living in the file system is a bit silly, since
there is nothing to put on disk --- it's just a mechanism to communicate
to clients where they should connect. Now ostensibly, Linux agrees ---
that is why Linux's *abstract* Unix domain sockets were created. But I
really don't like this because we have just replaced one ambient
authority contraption (the root filesystem) with another (the abstract
socket name space in the network namespace). The problems with ambient
authority remain all the same (and indeed, our experience with Nix has
been that network namespace unsharing when you do want to do some
outside world network access is much more work than filesystem namespace
unsharing).
What I would really like to do is go further than what I proposed, and
separate the binding of a unix socket from the placing in the file
system.
Today, with only existing UAPIs, the closest you can get is a scratch
path you pin with `O_PATH` and immediately unlink:
/* server */
int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un a = { .sun_family = AF_UNIX };
strcpy(a.sun_path, "/tmp/scratchXXXXXX");
bind(lfd, (struct sockaddr *)&a, sizeof a);
int addrfd = open(a.sun_path, O_PATH | O_CLOEXEC); /* pin the socket inode */
unlink(a.sun_path); /* nameless now */
listen(lfd, 64);
/* client, handed `addrfd` -- but still has to *name* it, via /proc magic */
struct sockaddr_un c = { .sun_family = AF_UNIX };
sprintf(c.sun_path, "/proc/self/fd/%d", addrfd);
connect(cfd, (struct sockaddr *)&c, sizeof c);
So even though I hold the socket by descriptor, I still route a pathname
(`/proc/self/fd/...`) to reach it, and I have to deal with the
`/tmp/scratchXXXXXX` proper temp file usage.
What I'd actually want is to sidestep all those nuisances entirely.
The important piece is a `bind` variation: like binding an abstract unix
socket, except that it publishes no abstract socket name, so the *only*
way to connect to the socket is to be given an fd referring to it.
A matching `connect` variation is more of a nice-to-have: it lets a
client connect straight through that fd, rather than having to name it
via `/proc/self/fd` as above.
Put together:
/* server */
int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
int addrfd = bind_anon(lfd, /*flags, for the future*/0); /* proposed: no filesystem or abstract name */
listen(lfd, 64);
/* client, handed `addrfd` -- connect straight to the descriptor */
connectat(addrfd, cfd, NULL, 0, AT_EMPTY_PATH); /* proposed */
I would use this *a lot*! First of all, in our testing code, I would use
this, and not even bother (on Linux at least) putting the test daemon
socket on a (probably quite long) path; I would just rig up the test
harness to pass the fd to the client process with an environment
variable (local not global naming!) indicating to the process which file
descriptor it should connect to.
If that sounds vaguely like systemd socket activation, yes it should.
Socket activating *servers*, as we do today, is great, but I would also
modify my init system to pass these listening sockets to *client*
services. At that point, servers should ditch any sort of `getsockopt`
authentication (which they are likely to implement incorrectly or in an
ad-hoc manner), and instead rely on the init system to make sure only
services/users which are authorized to connect to a given server have
been given its listening socket file descriptor.
----
Misc notes:
[Note 1]: I didn't specify what `bind_anon` should do for `getpeername`
but frankly, I don't really care. `getpeername` already doesn't identify
unix sockets uniquely, since one can bind using relative paths.
[Note 2]: Insofar as we are designing new interfaces, we might ask
whether the division of labor across 3 system calls --- `socket`,
`bind`/`bind_anon`, and `listen` --- is really carrying its weight, but
this is orthogonal tech debt.
[Note 3]: As a bonus, `bind_anon` subsumes the traditional pathname
`bind`, with a nice separation of concerns: bind first, name later (if
ever).
/* server, bonus before listening */
linkat(addrfd, "", AT_FDCWD, "/run/myservice.sock", AT_EMPTY_PATH);
This needs `bind_anon` to create the socket `O_TMPFILE`-style --- `nlink
== 0` but materializable by `linkat`, reusing the existing
`may_linkat()` carve-out. (I checked: today you *can* `linkat` a bound
socket that still has a name, but not once it has been unlinked to
namelessness --- which is exactly the anonymous case --- so this really
does need the new bind.)
[Note 4]: If you look at [2] (another example of one of my old dreams
perhaps finally coming true, this time better process spawning), you
will see I mention that I would like null namespaces to ratchet down
process privileges even further than we can today, and also have a nicer
default state for a new process creation UAPI. In the case of a null
mount namespace / null root fs, `/proc/self/fd/<M>` would no longer
work, but explicit file-descriptor-relative APIs would.
Finally, I've attached a little test program I used to double-check some
of my points, in case that is useful to anyone.
[1]: https://github.com/NixOS/nix/pull/15867
[2]: https://lore.kernel.org/all/48594f3a-2ae9-4e1c-a575-ae54a6e1536d@app.fastmail.com/
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: sockfdtest.c --]
[-- Type: text/x-csrc; name="sockfdtest.c", Size: 3578 bytes --]
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/un.h>
static int make_listener(const char *path)
{
int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
if (lfd < 0) { perror("socket lfd"); return -1; }
struct sockaddr_un a = { .sun_family = AF_UNIX };
snprintf(a.sun_path, sizeof a.sun_path, "%s", path);
unlink(path);
if (bind(lfd, (struct sockaddr *)&a, sizeof a) < 0) { perror("bind"); return -1; }
if (listen(lfd, 64) < 0) { perror("listen"); return -1; }
return lfd;
}
/* connect to a unix socket by pathname; returns 0 on success */
static int connect_path(const char *path)
{
int cfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un c = { .sun_family = AF_UNIX };
snprintf(c.sun_path, sizeof c.sun_path, "%s", path);
int r = connect(cfd, (struct sockaddr *)&c, sizeof c);
close(cfd);
return r;
}
static void try_connect(const char *label, int targetfd, int lfd)
{
char p[64];
snprintf(p, sizeof p, "/proc/self/fd/%d", targetfd);
if (connect_path(p) < 0) {
printf("%-30s via %-18s -> FAILED: %s\n", label, p, strerror(errno));
} else {
int s = accept(lfd, NULL, NULL);
printf("%-30s via %-18s -> SUCCEEDED (accept %s)\n",
label, p, s >= 0 ? "ok" : "FAILED");
if (s >= 0) close(s);
}
}
/* linkat the socket referred to by (olddirfd, oldpath, flags) to newpath,
* then prove it by connecting through newpath. */
static void try_link(const char *label, int olddirfd, const char *oldpath,
int flags, const char *newpath)
{
unlink(newpath);
if (linkat(olddirfd, oldpath, AT_FDCWD, newpath, flags) < 0) {
printf("%-30s -> linkat FAILED: %s\n", label, strerror(errno));
return;
}
if (connect_path(newpath) < 0)
printf("%-30s -> linked, but connect FAILED: %s\n", label, strerror(errno));
else
printf("%-30s -> linked AND connectable\n", label);
unlink(newpath);
}
int main(void)
{
char pa[64], pb[64], pc[64], pl[80];
snprintf(pa, sizeof pa, "/tmp/sockfdtest.a.%d", getpid());
snprintf(pb, sizeof pb, "/tmp/sockfdtest.b.%d", getpid());
snprintf(pc, sizeof pc, "/tmp/sockfdtest.c.%d", getpid());
snprintf(pl, sizeof pl, "/tmp/sockfdtest.link.%d", getpid());
/* A: pin the bind path's inode with O_PATH, unlink, connect via the pin fd */
int lfd_a = make_listener(pa);
int pin = open(pa, O_PATH | O_CLOEXEC);
if (pin < 0) perror("open O_PATH");
unlink(pa);
try_connect("A: O_PATH pin", pin, lfd_a);
/* B: skip the pin -- connect via the listening socket fd itself */
int lfd_b = make_listener(pb);
/* int pin_b = open(pb, O_PATH | O_CLOEXEC); */ /* <-- skipped on purpose */
try_connect("B: listen fd direct", lfd_b, lfd_b);
unlink(pb);
/* C..E: can we *materialize* a bound socket into the fs via link/linkat? */
int lfd_c = make_listener(pc); /* pc: bound socket file, nlink=1 */
int pin_c = open(pc, O_PATH | O_CLOEXEC); /* fd to that inode */
(void)lfd_c; /* kept open to keep the listener alive */
/* C: plain hardlink of the socket *pathname* (nlink 1 -> 2) */
try_link("C: link(path) hardlink", AT_FDCWD, pc, 0, pl);
/* D: linkat the *fd* via AT_EMPTY_PATH, inode still has a name (nlink=1) */
try_link("D: linkat fd AT_EMPTY_PATH", pin_c, "", AT_EMPTY_PATH, pl);
/* E: now make it nameless (nlink=0, sock still bound), then linkat the fd
* -- this is the O_TMPFILE-style "name an anonymous inode" move. */
unlink(pc);
try_link("E: linkat fd, nlink==0", pin_c, "", AT_EMPTY_PATH, pl);
return 0;
}
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [RFC] connectat()/bindat() or an alternative design
2026-06-11 2:08 ` John Ericson
@ 2026-06-12 18:50 ` Cong Wang
0 siblings, 0 replies; 4+ messages in thread
From: Cong Wang @ 2026-06-12 18:50 UTC (permalink / raw)
To: John Ericson; +Cc: network dev, Li Chen
On Wed, Jun 10, 2026 at 10:08:57PM -0400, John Ericson wrote:
> Hi Cong,
>
> On Mon, Jun 8, 2026, at 3:45 PM, Cong Wang wrote:
> > Hi John,
> >
> > [...]
> >
> > Thanks for bringing this up.
>
> Sure, thanks for replying to me!
>
> > I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
> > sockets. However, it would be nicer to describe your use case here,
> > especially what the problems are without it. This would help more to
> > jusify your proposal here than just getting aligned with openat() or
> > BSD.
> >
> > Hope this helps.
> >
> > Regards,
> > Cong
>
> Yeah, happy to talk about that. Hope this is not too long a reply!
>
> First, for some background context, I am a developer of the Nix package
> manager. And this, plus my own personal taste, always has me thinking
> about ways we can run processes with fewer privileges. The
> no-ambient-authority capsicum/cloudabi/wasi/whatever dream has lived in
> my head rent-free for many years :). Now these days, with LLMs, it feels
> like these nice-to-have yak shaves of mine are finally worth dusting off
> and striking off the bucket list.
>
> Also in recent months, we Nix developers have been putting a bunch of
> work into using more `openat2` and friends, and I have no doubt that we
> will continue down this path (even on Windows!). We aim to be an
> exemplar program for following the "always work relative to a file
> descriptor" discipline. It's good for security, but also makes for code
> that --- I believe --- is just more elegant and nicer to read.
>
> ----
>
> Nearer term use case: slightly less ugly long path socket opening in
> Nix:
"Nix needs it" is a much better justification than "BSD already has it".
:) So please add this to your patch description/cover letter.
>
> If you look at [1] you can see a PR I've asked my coworker to draft to
> improve binding and connecting code to cope with longer file paths,
> something which does come up in practice when we are running multiple
> tests with multiple daemons in parallel.
>
> Now, I think it is safe to say that this code was already quite complex,
> and in this patch only gets *more* complex. The current interfaces make
> supporting longer paths quite annoying. (Though, once we remove the
> `open` and switch to an `*at`-style interface in the wrapper (if macOS
> lets us), it will get less bad.)
>
> So the first use case would be getting something nicer than the
> `/proc/self/fd/<N>` dance the linked code falls back to. It is good that
> `/proc/self/fd/<N>` exists for legacy code, but it is an unergonomic way
> to do file-descriptor-relative paths, and should be a fallback, never
> the first choice. A real fd parameter along with a regular path pointer
> would buy two concrete wins:
>
> 1. A clean, separate file descriptor parameter, the way `openat` has one
> --- rather than assembling a `/proc` path by hand.
>
> 2. Normal `PATH_MAX` room for the real pathname, rather than cramming
> `/proc/self/fd/<N>` (plus any residual path after it) into the small
> `sun_path` field of `struct sockaddr_un`.
>
> ----
>
> Longer term use case: anonymous listening sockets, avoiding advertising
> sockets to potential clients using ambient authority mechanisms
> altogether:
>
> Some more background: I think this whole business of listening
> unix sockets necessarily living in the file system is a bit silly, since
> there is nothing to put on disk --- it's just a mechanism to communicate
> to clients where they should connect. Now ostensibly, Linux agrees ---
> that is why Linux's *abstract* Unix domain sockets were created. But I
> really don't like this because we have just replaced one ambient
> authority contraption (the root filesystem) with another (the abstract
> socket name space in the network namespace). The problems with ambient
> authority remain all the same (and indeed, our experience with Nix has
> been that network namespace unsharing when you do want to do some
> outside world network access is much more work than filesystem namespace
> unsharing).
Indeed, it would be very hard to change since it is coded in UDS API since
probably day 1.
Just curious: any reason not to use TCP loopback here?
>
> What I would really like to do is go further than what I proposed, and
> separate the binding of a unix socket from the placing in the file
> system.
>
> Today, with only existing UAPIs, the closest you can get is a scratch
> path you pin with `O_PATH` and immediately unlink:
>
> /* server */
> int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> struct sockaddr_un a = { .sun_family = AF_UNIX };
> strcpy(a.sun_path, "/tmp/scratchXXXXXX");
> bind(lfd, (struct sockaddr *)&a, sizeof a);
Any reason not to use abstract socket?
abstract
an abstract socket address is distinguished (from a
pathname socket) by the fact that sun_path[0] is a null
byte ('\0'). The socket's address in this namespace is
given by the additional bytes in sun_path that are covered
by the specified length of the address structure. (Null
bytes in the name have no special significance.) The name
has no connection with filesystem pathnames. When the
address of an abstract socket is returned, the returned
addrlen is greater than sizeof(sa_family_t) (i.e., greater
than 2), and the name of the socket is contained in the
first (addrlen - sizeof(sa_family_t)) bytes of sun_path.
> int addrfd = open(a.sun_path, O_PATH | O_CLOEXEC); /* pin the socket inode */
> unlink(a.sun_path); /* nameless now */
> listen(lfd, 64);
>
> /* client, handed `addrfd` -- but still has to *name* it, via /proc magic */
> struct sockaddr_un c = { .sun_family = AF_UNIX };
> sprintf(c.sun_path, "/proc/self/fd/%d", addrfd);
> connect(cfd, (struct sockaddr *)&c, sizeof c);
>
> So even though I hold the socket by descriptor, I still route a pathname
> (`/proc/self/fd/...`) to reach it, and I have to deal with the
> `/tmp/scratchXXXXXX` proper temp file usage.
>
> What I'd actually want is to sidestep all those nuisances entirely.
>
> The important piece is a `bind` variation: like binding an abstract unix
> socket, except that it publishes no abstract socket name, so the *only*
> way to connect to the socket is to be given an fd referring to it.
>
> A matching `connect` variation is more of a nice-to-have: it lets a
> client connect straight through that fd, rather than having to name it
> via `/proc/self/fd` as above.
>
> Put together:
>
> /* server */
> int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> int addrfd = bind_anon(lfd, /*flags, for the future*/0); /* proposed: no filesystem or abstract name */
> listen(lfd, 64);
>
> /* client, handed `addrfd` -- connect straight to the descriptor */
> connectat(addrfd, cfd, NULL, 0, AT_EMPTY_PATH); /* proposed */
>
> I would use this *a lot*! First of all, in our testing code, I would use
> this, and not even bother (on Linux at least) putting the test daemon
> socket on a (probably quite long) path; I would just rig up the test
> harness to pass the fd to the client process with an environment
> variable (local not global naming!) indicating to the process which file
> descriptor it should connect to.
>
> If that sounds vaguely like systemd socket activation, yes it should.
> Socket activating *servers*, as we do today, is great, but I would also
> modify my init system to pass these listening sockets to *client*
> services. At that point, servers should ditch any sort of `getsockopt`
> authentication (which they are likely to implement incorrectly or in an
> ad-hoc manner), and instead rely on the init system to make sure only
> services/users which are authorized to connect to a given server have
> been given its listening socket file descriptor.
>
Thanks,
Cong
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-12 18:50 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 19:09 [RFC] connectat()/bindat() or an alternative design John Ericson
2026-06-08 19:45 ` Cong Wang
2026-06-11 2:08 ` John Ericson
2026-06-12 18:50 ` Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox