From: Cong Wang <cwang@multikernel.io>
To: John Ericson <mail@johnericson.me>
Cc: network dev <netdev@vger.kernel.org>, Li Chen <me@linux.beauty>
Subject: Re: [RFC] connectat()/bindat() or an alternative design
Date: Fri, 12 Jun 2026 11:50:37 -0700 [thread overview]
Message-ID: <aixU_RccxS_BKhgX@pop-os.localdomain> (raw)
In-Reply-To: <455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com>
On Wed, Jun 10, 2026 at 10:08:57PM -0400, John Ericson wrote:
> Hi Cong,
>
> On Mon, Jun 8, 2026, at 3:45 PM, Cong Wang wrote:
> > Hi John,
> >
> > [...]
> >
> > Thanks for bringing this up.
>
> Sure, thanks for replying to me!
>
> > I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
> > sockets. However, it would be nicer to describe your use case here,
> > especially what the problems are without it. This would help more to
> > jusify your proposal here than just getting aligned with openat() or
> > BSD.
> >
> > Hope this helps.
> >
> > Regards,
> > Cong
>
> Yeah, happy to talk about that. Hope this is not too long a reply!
>
> First, for some background context, I am a developer of the Nix package
> manager. And this, plus my own personal taste, always has me thinking
> about ways we can run processes with fewer privileges. The
> no-ambient-authority capsicum/cloudabi/wasi/whatever dream has lived in
> my head rent-free for many years :). Now these days, with LLMs, it feels
> like these nice-to-have yak shaves of mine are finally worth dusting off
> and striking off the bucket list.
>
> Also in recent months, we Nix developers have been putting a bunch of
> work into using more `openat2` and friends, and I have no doubt that we
> will continue down this path (even on Windows!). We aim to be an
> exemplar program for following the "always work relative to a file
> descriptor" discipline. It's good for security, but also makes for code
> that --- I believe --- is just more elegant and nicer to read.
>
> ----
>
> Nearer term use case: slightly less ugly long path socket opening in
> Nix:
"Nix needs it" is a much better justification than "BSD already has it".
:) So please add this to your patch description/cover letter.
>
> If you look at [1] you can see a PR I've asked my coworker to draft to
> improve binding and connecting code to cope with longer file paths,
> something which does come up in practice when we are running multiple
> tests with multiple daemons in parallel.
>
> Now, I think it is safe to say that this code was already quite complex,
> and in this patch only gets *more* complex. The current interfaces make
> supporting longer paths quite annoying. (Though, once we remove the
> `open` and switch to an `*at`-style interface in the wrapper (if macOS
> lets us), it will get less bad.)
>
> So the first use case would be getting something nicer than the
> `/proc/self/fd/<N>` dance the linked code falls back to. It is good that
> `/proc/self/fd/<N>` exists for legacy code, but it is an unergonomic way
> to do file-descriptor-relative paths, and should be a fallback, never
> the first choice. A real fd parameter along with a regular path pointer
> would buy two concrete wins:
>
> 1. A clean, separate file descriptor parameter, the way `openat` has one
> --- rather than assembling a `/proc` path by hand.
>
> 2. Normal `PATH_MAX` room for the real pathname, rather than cramming
> `/proc/self/fd/<N>` (plus any residual path after it) into the small
> `sun_path` field of `struct sockaddr_un`.
>
> ----
>
> Longer term use case: anonymous listening sockets, avoiding advertising
> sockets to potential clients using ambient authority mechanisms
> altogether:
>
> Some more background: I think this whole business of listening
> unix sockets necessarily living in the file system is a bit silly, since
> there is nothing to put on disk --- it's just a mechanism to communicate
> to clients where they should connect. Now ostensibly, Linux agrees ---
> that is why Linux's *abstract* Unix domain sockets were created. But I
> really don't like this because we have just replaced one ambient
> authority contraption (the root filesystem) with another (the abstract
> socket name space in the network namespace). The problems with ambient
> authority remain all the same (and indeed, our experience with Nix has
> been that network namespace unsharing when you do want to do some
> outside world network access is much more work than filesystem namespace
> unsharing).
Indeed, it would be very hard to change since it is coded in UDS API since
probably day 1.
Just curious: any reason not to use TCP loopback here?
>
> What I would really like to do is go further than what I proposed, and
> separate the binding of a unix socket from the placing in the file
> system.
>
> Today, with only existing UAPIs, the closest you can get is a scratch
> path you pin with `O_PATH` and immediately unlink:
>
> /* server */
> int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> struct sockaddr_un a = { .sun_family = AF_UNIX };
> strcpy(a.sun_path, "/tmp/scratchXXXXXX");
> bind(lfd, (struct sockaddr *)&a, sizeof a);
Any reason not to use abstract socket?
abstract
an abstract socket address is distinguished (from a
pathname socket) by the fact that sun_path[0] is a null
byte ('\0'). The socket's address in this namespace is
given by the additional bytes in sun_path that are covered
by the specified length of the address structure. (Null
bytes in the name have no special significance.) The name
has no connection with filesystem pathnames. When the
address of an abstract socket is returned, the returned
addrlen is greater than sizeof(sa_family_t) (i.e., greater
than 2), and the name of the socket is contained in the
first (addrlen - sizeof(sa_family_t)) bytes of sun_path.
> int addrfd = open(a.sun_path, O_PATH | O_CLOEXEC); /* pin the socket inode */
> unlink(a.sun_path); /* nameless now */
> listen(lfd, 64);
>
> /* client, handed `addrfd` -- but still has to *name* it, via /proc magic */
> struct sockaddr_un c = { .sun_family = AF_UNIX };
> sprintf(c.sun_path, "/proc/self/fd/%d", addrfd);
> connect(cfd, (struct sockaddr *)&c, sizeof c);
>
> So even though I hold the socket by descriptor, I still route a pathname
> (`/proc/self/fd/...`) to reach it, and I have to deal with the
> `/tmp/scratchXXXXXX` proper temp file usage.
>
> What I'd actually want is to sidestep all those nuisances entirely.
>
> The important piece is a `bind` variation: like binding an abstract unix
> socket, except that it publishes no abstract socket name, so the *only*
> way to connect to the socket is to be given an fd referring to it.
>
> A matching `connect` variation is more of a nice-to-have: it lets a
> client connect straight through that fd, rather than having to name it
> via `/proc/self/fd` as above.
>
> Put together:
>
> /* server */
> int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> int addrfd = bind_anon(lfd, /*flags, for the future*/0); /* proposed: no filesystem or abstract name */
> listen(lfd, 64);
>
> /* client, handed `addrfd` -- connect straight to the descriptor */
> connectat(addrfd, cfd, NULL, 0, AT_EMPTY_PATH); /* proposed */
>
> I would use this *a lot*! First of all, in our testing code, I would use
> this, and not even bother (on Linux at least) putting the test daemon
> socket on a (probably quite long) path; I would just rig up the test
> harness to pass the fd to the client process with an environment
> variable (local not global naming!) indicating to the process which file
> descriptor it should connect to.
>
> If that sounds vaguely like systemd socket activation, yes it should.
> Socket activating *servers*, as we do today, is great, but I would also
> modify my init system to pass these listening sockets to *client*
> services. At that point, servers should ditch any sort of `getsockopt`
> authentication (which they are likely to implement incorrectly or in an
> ad-hoc manner), and instead rely on the init system to make sure only
> services/users which are authorized to connect to a given server have
> been given its listening socket file descriptor.
>
Thanks,
Cong
prev parent reply other threads:[~2026-06-12 18:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-18 19:09 [RFC] connectat()/bindat() or an alternative design John Ericson
2026-06-08 19:45 ` Cong Wang
2026-06-11 2:08 ` John Ericson
2026-06-12 18:50 ` Cong Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aixU_RccxS_BKhgX@pop-os.localdomain \
--to=cwang@multikernel.io \
--cc=mail@johnericson.me \
--cc=me@linux.beauty \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox