Netdev List
 help / color / mirror / Atom feed
From: "John Ericson" <mail@johnericson.me>
To: "Cong Wang" <cwang@multikernel.io>
Cc: "network dev" <netdev@vger.kernel.org>, "Li Chen" <me@linux.beauty>
Subject: Re: [RFC] connectat()/bindat() or an alternative design
Date: Wed, 10 Jun 2026 22:08:57 -0400	[thread overview]
Message-ID: <455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com> (raw)
In-Reply-To: <aicbyQIC2qEveNd-@pop-os.localdomain>

[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]

Hi Cong,

On Mon, Jun 8, 2026, at 3:45 PM, Cong Wang wrote:
> Hi John,
>
> [...]
>
> Thanks for bringing this up.

Sure, thanks for replying to me!

> I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
> sockets. However, it would be nicer to describe your use case here,
> especially what the problems are without it. This would help more to
> jusify your proposal here than just getting aligned with openat() or
> BSD.
>
> Hope this helps.
>
> Regards,
> Cong

Yeah, happy to talk about that. Hope this is not too long a reply!

First, for some background context, I am a developer of the Nix package
manager. And this, plus my own personal taste, always has me thinking
about ways we can run processes with fewer privileges. The
no-ambient-authority capsicum/cloudabi/wasi/whatever dream has lived in
my head rent-free for many years :). Now these days, with LLMs, it feels
like these nice-to-have yak shaves of mine are finally worth dusting off
and striking off the bucket list.

Also in recent months, we Nix developers have been putting a bunch of
work into using more `openat2` and friends, and I have no doubt that we
will continue down this path (even on Windows!). We aim to be an
exemplar program for following the "always work relative to a file
descriptor" discipline. It's good for security, but also makes for code
that --- I believe --- is just more elegant and nicer to read.

----

Nearer term use case: slightly less ugly long path socket opening in
Nix:

If you look at [1] you can see a PR I've asked my coworker to draft to
improve binding and connecting code to cope with longer file paths,
something which does come up in practice when we are running multiple
tests with multiple daemons in parallel.

Now, I think it is safe to say that this code was already quite complex,
and in this patch only gets *more* complex. The current interfaces make
supporting longer paths quite annoying. (Though, once we remove the
`open` and switch to an `*at`-style interface in the wrapper (if macOS
lets us), it will get less bad.)

So the first use case would be getting something nicer than the
`/proc/self/fd/<N>` dance the linked code falls back to. It is good that
`/proc/self/fd/<N>` exists for legacy code, but it is an unergonomic way
to do file-descriptor-relative paths, and should be a fallback, never
the first choice. A real fd parameter along with a regular path pointer
would buy two concrete wins:

1. A clean, separate file descriptor parameter, the way `openat` has one
   --- rather than assembling a `/proc` path by hand.

2. Normal `PATH_MAX` room for the real pathname, rather than cramming
   `/proc/self/fd/<N>` (plus any residual path after it) into the small
   `sun_path` field of `struct sockaddr_un`.

----

Longer term use case: anonymous listening sockets, avoiding advertising
sockets to potential clients using ambient authority mechanisms
altogether:

Some more background: I think this whole business of listening
unix sockets necessarily living in the file system is a bit silly, since
there is nothing to put on disk --- it's just a mechanism to communicate
to clients where they should connect. Now ostensibly, Linux agrees ---
that is why Linux's *abstract* Unix domain sockets were created. But I
really don't like this because we have just replaced one ambient
authority contraption (the root filesystem) with another (the abstract
socket name space in the network namespace). The problems with ambient
authority remain all the same (and indeed, our experience with Nix has
been that network namespace unsharing when you do want to do some
outside world network access is much more work than filesystem namespace
unsharing).

What I would really like to do is go further than what I proposed, and
separate the binding of a unix socket from the placing in the file
system.

Today, with only existing UAPIs, the closest you can get is a scratch
path you pin with `O_PATH` and immediately unlink:

    /* server */
    int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
    struct sockaddr_un a = { .sun_family = AF_UNIX };
    strcpy(a.sun_path, "/tmp/scratchXXXXXX");
    bind(lfd, (struct sockaddr *)&a, sizeof a);
    int addrfd = open(a.sun_path, O_PATH | O_CLOEXEC); /* pin the socket inode */
    unlink(a.sun_path);                                /* nameless now */
    listen(lfd, 64);

    /* client, handed `addrfd` -- but still has to *name* it, via /proc magic */
    struct sockaddr_un c = { .sun_family = AF_UNIX };
    sprintf(c.sun_path, "/proc/self/fd/%d", addrfd);
    connect(cfd, (struct sockaddr *)&c, sizeof c);

So even though I hold the socket by descriptor, I still route a pathname
(`/proc/self/fd/...`) to reach it, and I have to deal with the
`/tmp/scratchXXXXXX` proper temp file usage.

What I'd actually want is to sidestep all those nuisances entirely.

The important piece is a `bind` variation: like binding an abstract unix
socket, except that it publishes no abstract socket name, so the *only*
way to connect to the socket is to be given an fd referring to it.

A matching `connect` variation is more of a nice-to-have: it lets a
client connect straight through that fd, rather than having to name it
via `/proc/self/fd` as above.

Put together:

    /* server */
    int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
    int addrfd = bind_anon(lfd, /*flags, for the future*/0);  /* proposed: no filesystem or abstract name */
    listen(lfd, 64);

    /* client, handed `addrfd` -- connect straight to the descriptor */
    connectat(addrfd, cfd, NULL, 0, AT_EMPTY_PATH);   /* proposed */

I would use this *a lot*! First of all, in our testing code, I would use
this, and not even bother (on Linux at least) putting the test daemon
socket on a (probably quite long) path; I would just rig up the test
harness to pass the fd to the client process with an environment
variable (local not global naming!) indicating to the process which file
descriptor it should connect to.

If that sounds vaguely like systemd socket activation, yes it should.
Socket activating *servers*, as we do today, is great, but I would also
modify my init system to pass these listening sockets to *client*
services. At that point, servers should ditch any sort of `getsockopt`
authentication (which they are likely to implement incorrectly or in an
ad-hoc manner), and instead rely on the init system to make sure only
services/users which are authorized to connect to a given server have
been given its listening socket file descriptor.

----

Misc notes:

[Note 1]: I didn't specify what `bind_anon` should do for `getpeername`
but frankly, I don't really care. `getpeername` already doesn't identify
unix sockets uniquely, since one can bind using relative paths.

[Note 2]: Insofar as we are designing new interfaces, we might ask
whether the division of labor across 3 system calls --- `socket`,
`bind`/`bind_anon`, and `listen` --- is really carrying its weight, but
this is orthogonal tech debt.

[Note 3]: As a bonus, `bind_anon` subsumes the traditional pathname
`bind`, with a nice separation of concerns: bind first, name later (if
ever).

    /* server, bonus before listening */
    linkat(addrfd, "", AT_FDCWD, "/run/myservice.sock", AT_EMPTY_PATH);

This needs `bind_anon` to create the socket `O_TMPFILE`-style --- `nlink
== 0` but materializable by `linkat`, reusing the existing
`may_linkat()` carve-out. (I checked: today you *can* `linkat` a bound
socket that still has a name, but not once it has been unlinked to
namelessness --- which is exactly the anonymous case --- so this really
does need the new bind.)

[Note 4]: If you look at [2] (another example of one of my old dreams
perhaps finally coming true, this time better process spawning), you
will see I mention that I would like null namespaces to ratchet down
process privileges even further than we can today, and also have a nicer
default state for a new process creation UAPI. In the case of a null
mount namespace / null root fs, `/proc/self/fd/<M>` would no longer
work, but explicit file-descriptor-relative APIs would.

Finally, I've attached a little test program I used to double-check some
of my points, in case that is useful to anyone.

[1]: https://github.com/NixOS/nix/pull/15867

[2]: https://lore.kernel.org/all/48594f3a-2ae9-4e1c-a575-ae54a6e1536d@app.fastmail.com/

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: sockfdtest.c --]
[-- Type: text/x-csrc; name="sockfdtest.c", Size: 3578 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/un.h>

static int make_listener(const char *path)
{
	int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
	if (lfd < 0) { perror("socket lfd"); return -1; }
	struct sockaddr_un a = { .sun_family = AF_UNIX };
	snprintf(a.sun_path, sizeof a.sun_path, "%s", path);
	unlink(path);
	if (bind(lfd, (struct sockaddr *)&a, sizeof a) < 0) { perror("bind"); return -1; }
	if (listen(lfd, 64) < 0) { perror("listen"); return -1; }
	return lfd;
}

/* connect to a unix socket by pathname; returns 0 on success */
static int connect_path(const char *path)
{
	int cfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
	struct sockaddr_un c = { .sun_family = AF_UNIX };
	snprintf(c.sun_path, sizeof c.sun_path, "%s", path);
	int r = connect(cfd, (struct sockaddr *)&c, sizeof c);
	close(cfd);
	return r;
}

static void try_connect(const char *label, int targetfd, int lfd)
{
	char p[64];
	snprintf(p, sizeof p, "/proc/self/fd/%d", targetfd);
	if (connect_path(p) < 0) {
		printf("%-30s via %-18s -> FAILED: %s\n", label, p, strerror(errno));
	} else {
		int s = accept(lfd, NULL, NULL);
		printf("%-30s via %-18s -> SUCCEEDED (accept %s)\n",
		       label, p, s >= 0 ? "ok" : "FAILED");
		if (s >= 0) close(s);
	}
}

/* linkat the socket referred to by (olddirfd, oldpath, flags) to newpath,
 * then prove it by connecting through newpath. */
static void try_link(const char *label, int olddirfd, const char *oldpath,
		     int flags, const char *newpath)
{
	unlink(newpath);
	if (linkat(olddirfd, oldpath, AT_FDCWD, newpath, flags) < 0) {
		printf("%-30s -> linkat FAILED: %s\n", label, strerror(errno));
		return;
	}
	if (connect_path(newpath) < 0)
		printf("%-30s -> linked, but connect FAILED: %s\n", label, strerror(errno));
	else
		printf("%-30s -> linked AND connectable\n", label);
	unlink(newpath);
}

int main(void)
{
	char pa[64], pb[64], pc[64], pl[80];
	snprintf(pa, sizeof pa, "/tmp/sockfdtest.a.%d", getpid());
	snprintf(pb, sizeof pb, "/tmp/sockfdtest.b.%d", getpid());
	snprintf(pc, sizeof pc, "/tmp/sockfdtest.c.%d", getpid());
	snprintf(pl, sizeof pl, "/tmp/sockfdtest.link.%d", getpid());

	/* A: pin the bind path's inode with O_PATH, unlink, connect via the pin fd */
	int lfd_a = make_listener(pa);
	int pin = open(pa, O_PATH | O_CLOEXEC);
	if (pin < 0) perror("open O_PATH");
	unlink(pa);
	try_connect("A: O_PATH pin", pin, lfd_a);

	/* B: skip the pin -- connect via the listening socket fd itself */
	int lfd_b = make_listener(pb);
	/* int pin_b = open(pb, O_PATH | O_CLOEXEC); */   /* <-- skipped on purpose */
	try_connect("B: listen fd direct", lfd_b, lfd_b);
	unlink(pb);

	/* C..E: can we *materialize* a bound socket into the fs via link/linkat? */
	int lfd_c = make_listener(pc);            /* pc: bound socket file, nlink=1 */
	int pin_c = open(pc, O_PATH | O_CLOEXEC); /* fd to that inode */
	(void)lfd_c;                              /* kept open to keep the listener alive */

	/* C: plain hardlink of the socket *pathname* (nlink 1 -> 2) */
	try_link("C: link(path) hardlink", AT_FDCWD, pc, 0, pl);

	/* D: linkat the *fd* via AT_EMPTY_PATH, inode still has a name (nlink=1) */
	try_link("D: linkat fd AT_EMPTY_PATH", pin_c, "", AT_EMPTY_PATH, pl);

	/* E: now make it nameless (nlink=0, sock still bound), then linkat the fd
	 *    -- this is the O_TMPFILE-style "name an anonymous inode" move. */
	unlink(pc);
	try_link("E: linkat fd, nlink==0", pin_c, "", AT_EMPTY_PATH, pl);

	return 0;
}

  reply	other threads:[~2026-06-11  2:09 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18 19:09 [RFC] connectat()/bindat() or an alternative design John Ericson
2026-06-08 19:45 ` Cong Wang
2026-06-11  2:08   ` John Ericson [this message]
2026-06-12 18:50     ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com \
    --to=mail@johnericson.me \
    --cc=cwang@multikernel.io \
    --cc=me@linux.beauty \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox