Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 00/12] Add kdbus implementation
From: Eric W. Biederman @ 2014-10-30  4:04 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c,
	Andy Lutomirski
In-Reply-To: <20141029221505.GA7812-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

Greg KH <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:

> On Wed, Oct 29, 2014 at 03:00:44PM -0700, Greg Kroah-Hartman wrote:
>> kdbus is a kernel-level IPC implementation that aims for resemblance to
>> the the protocol layer with the existing userspace D-Bus daemon while
>> enabling some features that couldn't be implemented before in userspace.
>
> {sigh}
>
> I'll blame it on the jet-lag for the lack of [XX/12] markings on the
> patches.  I'll give it a day for review before resending if people
> really want to know the ordering.  It doesn't matter except for the
> final patch that adds the code to the build file.
>
> sorry about that,

For what it is worth these patches are also poorly split up.  Every
patch I looked at in detail had functions that were being introduced
that did not have callers.

That poor split up of the patches makes it difficult to see how
the functionality that is being introduced is being used.

Eric

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Eric W. Biederman @ 2014-10-30  4:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Djalal Harouni, Arnd Bergmann, Ryan Lortie, Greg Kroah-Hartman,
	Marcel Holtmann, David Herrmann,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Tom Gundersen, simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ, John Stultz,
	Bastien Nocera, Linux API, Tejun Heo, Linux Containers,
	Linus Torvalds, javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A
In-Reply-To: <CALCETrVxvF2ie=vVgpjeqikn+nci_9jyKfU4s3t=4cjyNZNaNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

The userspace API breaks userspace in an unfixable way.

Nacked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Problem the first.
- Using global names for containers makes it impossible to create
  unprivileged containers.

  This is a back to the drawing board problem, and makes device
  nodes fundamentally unsuited to what you are doing.

  There is no way that I can see to make it safe for an unprivileged
  user to create arbitrary named busses.  Especially in the presence
  of allowing unprivileged checkpoint/restart.

  This is particularly bad as kdbus explicitly allows unprivielged
  creation of new kdbus instances.

  This problem is a userspace regression.

Problem the second.
- The security checks in the code are not based on who opens the
  file descriptors but instead based on who is used the file
  descriptors at any give moment.

  That pattern has been shown to be exploitable.

  I expect the policy database makes this poor choice of permission
  checks even worse.  Pass a more privileged user a kdbus file
  descriptor and all of sudden things that were not possible on
  that file descriptor become possible.

Problem the third.
- You are using device numbers for things created by unprivileged
  users.  That breaks checkpoint/restart.  Aka CRIU.

  We can not migrate a container to a new machine and preserve the
  device numbers.  

  We can not migrate a container to a new machine and have any hope
  of preserving the container patsh under /dev/kdbus/...

  Both of which look like fundamental show stoppers for
  checkpoint/restart.

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Wed, Oct 29, 2014 at 3:27 PM, Greg Kroah-Hartman
> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>> On Wed, Oct 29, 2014 at 03:15:51PM -0700, Andy Lutomirski wrote:
>>> (reply 1/2 -- I'm replying twice to keep the threading sane)
>>>
>>> On Wed, Oct 29, 2014 at 3:00 PM, Greg Kroah-Hartman
>>> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>>> > kdbus is a kernel-level IPC implementation that aims for resemblance to
>>> > the the protocol layer with the existing userspace D-Bus daemon while
>>> > enabling some features that couldn't be implemented before in userspace.
>>> >
>>>
>>> >  * Support for multiple domains, completely separated from each other,
>>> >    allowing multiple virtualized instances to be used at the same time.
>>>
>>> Given that there is no such thing as a device namespace, how does this work?
>>
>> See the document for the details.
>
> They seem insufficient to me, so I tried to dig in to the code.  My
> understanding is:
>
> The parent container has /dev mounted.  It sends an IOCTL (which
> requires global capabilities).  In response, kdbus creates a whole
> bunch of devices that get put (by udev or devtmpfs, I presume) in a
> subdirectory.  Then the parent container mounts that subdirectory in
> the new container.
>
> This is IMO rather problematic.
>
> First, it enforces the existence of a kdbus domain hierarchy where
> none should be needed.
>
> Second, it's incompatible with nested user namespaces.  The middle
> namespace can't issue the ioctl.
>
> Third, it requires a devtmpfs submount in the child container.  This
> scares me, especially since there are no device namespaces.  Also, the
> child container appears to be dependent on the host udev to arbitrate
> everything, which seems totally wrong to me.  (Also, now we're exposed
> to attacks where the child container creates busses or endpoints or
> whatever with malicious names to try to trick the host into screwing
> up.)
>
> ISTM this should be solved either with device namespaces (which is
> well known to be a giant can of worms) or by abandoning the concept of
> kdbus using device nodes entirely.
>
> What if kdbus were kdbusfs?  If you want to use it in a container, you
> mount a brand-new kdbusfs there.  No weird domain hierarchy, no global
> privilege, no need to name containers, obvious migration semantics, no
> dependence on udev/devtmpfs at all, etc.
>
> Eric, any thoughts here?

I think a kdbusfs modeled on devpts with newinstance at
mount time would solve the naming problems.

That would break one of the current kdbus use cases that allows an
unprivileged user to create a bus.

Eric

p.s.  Please excuse my brevity I have am in the middle of packing up my
possessions (including my main machine), as I move this week.

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Daniel Mack @ 2014-10-30  7:12 UTC (permalink / raw)
  To: Eric W. Biederman, Greg KH
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c,
	Andy Lutomirski
In-Reply-To: <87egtqurrp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On 10/30/2014 05:04 AM, Eric W. Biederman wrote:
> For what it is worth these patches are also poorly split up.  Every
> patch I looked at in detail had functions that were being introduced
> that did not have callers.

Yes, we wanted to keep the reply threading cleaner and the individual
patches short. With a patch set that avoids introducing functions
without callers, each patch would have grown substantially. But I know
that's unusual to do it that way.

> That poor split up of the patches makes it difficult to see how
> the functionality that is being introduced is being used.

Ok, I see. For now, I think it's probably easiest to pull the patches
from here, and then look at the resulting files directly:


https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/log/?h=kdbus

Other than that, please give us some time to respond to your longer
reply. Thanks for taking the time to write this up!


Daniel

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Daniel Mack @ 2014-10-30  7:44 UTC (permalink / raw)
  To: Andy Lutomirski, Greg Kroah-Hartman
  Cc: Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	John Stultz, Arnd Bergmann, Tejun Heo, Marcel Holtmann,
	Ryan Lortie, Bastien Nocera, David Herrmann, Djalal Harouni,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, Tom Gundersen
In-Reply-To: <CALCETrX6vf7cKy=XDhDtn9hn1W930MRxBa=pk93RnyuZ-EaNyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/29/2014 11:28 PM, Andy Lutomirski wrote:
> On Wed, Oct 29, 2014 at 3:25 PM, Greg Kroah-Hartman

>> You do have to opt-in for this information at time of capture, so
>> I don't understand the issue here.  This is the same type of thing
>> that dbus does today, and I don't see the information leaks
>> happening there, do you?
> 
> The docs suggest that the *receiver* opts in.

Yes, that's true.

> I don't think that current dbus has severe information leaks because 
> the total scope for information transparently sent to dbus is rather 
> small (struct ucred only, presumably).

Which piece of credential information are you concerned about,
particularly? I might miss something, but AFAICS, all of that
information can be queried by a remote peer anyway, through /proc for
instance. The reason why we (optionally) attach them to messages is that
we want to let the other side know which information was authoritative,
precisely at the time the message was sent. Current implementation can't
do that in a race-free way.

Also note that we currently drop all such metadata whenever a message
crosses a PID or user namespace boundary. This is because we currently
don't know yet which information we would want to transport in such
cases, and how the translation in both directions would look like, from
a semantic perspective. Hence, we decided to leave that for later.

I'll go through your other replies during the day. Thanks for your input
on that RFC, everyone.

Daniel

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Daniel Mack @ 2014-10-30  8:09 UTC (permalink / raw)
  To: Andy Lutomirski, Greg Kroah-Hartman
  Cc: Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	John Stultz, Arnd Bergmann, Tejun Heo, Marcel Holtmann,
	Ryan Lortie, Bastien Nocera, David Herrmann, Djalal Harouni,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, Tom Gundersen
In-Reply-To: <CALCETrWqbpxk83L0k0_78JZCO+ntZhx_hHMcRu=vxs6VE2f5JQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/29/2014 11:33 PM, Andy Lutomirski wrote:
> On Wed, Oct 29, 2014 at 3:00 PM, Greg Kroah-Hartman

>> +/**
>> + * kdbus_meta_new() - create new metadata object
>> + * @meta:              New metadata object
>> + *
>> + * Return: 0 on success, negative errno on failure.
>> + */
>> +int kdbus_meta_new(struct kdbus_meta **meta)
>> +{
>> +       struct kdbus_meta *m;
>> +
>> +       BUG_ON(*meta);
>> +
>> +       m = kzalloc(sizeof(*m), GFP_KERNEL);
>> +       if (!m)
>> +               return -ENOMEM;
>> +
>> +       /*
>> +        * Remember the PID and user namespaces our credentials belong to;
>> +        * we need to prevent leaking authorization and security-relevant
>> +        * data across different namespaces.
>> +        */
>> +       m->pid_namespace = get_pid_ns(task_active_pid_ns(current));
>> +       m->user_namespace = get_user_ns(current_user_ns());
>> +
> 
> This is unusual, and it could be very expensive (it will serialize
> essentially everyone on an exclusive cacheline).  What attack is it
> protecting against?

As mentioned before, we currently prevent metadata from crossing over
user and pid namespace boundaries. In order to detect such situations,
we need to pin the namespaces of the the task creating such a metadata
object, so we can compare them later, even when the original task is not
alive anymore. But I'm open for cheaper solutions for this, as I'm
admittedly not an expert in these APIs.

>> +static int kdbus_meta_append_cred(struct kdbus_meta *meta)
>> +{
>> +       struct kdbus_creds creds = {
>> +               .uid = from_kuid_munged(current_user_ns(), current_uid()),
>> +               .gid = from_kgid_munged(current_user_ns(), current_gid()),
>> +               .pid = task_pid_vnr(current),
>> +               .tid = task_tgid_vnr(current),
>> +               .starttime = current->start_time,
>> +       };
>> +
>> +       return kdbus_meta_append_data(meta, KDBUS_ITEM_CREDS,
>> +                                     &creds, sizeof(creds));
>> +}
> 
> This seems wrong to me.  Shouldn't this store kuid_t, etc. directly?

The metadata item's memory that is appended here is directly copied into
the final message in the receiver's pool later, so the information has
to be authoritative and translated at this point. This is currently not
a problem as in cases where we cross namespaces, the metadata will not
be added to the final message anyway.

But you're right, if we support translation between namespaces later, we
need to store the kuid_t here, and patch in the the translated version
later, when the message is installed by the receiving peer (which is
when we know which namespace to translate the kuid_t for).

> Also, why pid, tid, and starttime?

Because pid is also part of struct ucred, and starttime seemed to fit in
here as well. After all, an item has some overhead with its header, so
we tried to group information that will most probably be needed
together. Any strong reason not to store it here?

Thanks,
Daniel

^ permalink raw reply

* Re: kdbus: add header file
From: Arnd Bergmann @ 2014-10-30  8:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	marcel-kz+m5ild9QBg9hUCZPvPmw, desrt-0xnayjDhYQY,
	hadess-0MeiytkfxGOsTnJN9+BGXg, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c
In-Reply-To: <1414620056-6675-3-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On Wednesday 29 October 2014 15:00:46 Greg Kroah-Hartman wrote:
> +enum kdbus_ioctl_type {
> +       KDBUS_CMD_BUS_MAKE =            _IOW(KDBUS_IOCTL_MAGIC, 0x00,
> +                                            struct kdbus_cmd_make),
> +       KDBUS_CMD_DOMAIN_MAKE =         _IOW(KDBUS_IOCTL_MAGIC, 0x10,
> +                                            struct kdbus_cmd_make),
> +       KDBUS_CMD_ENDPOINT_MAKE =       _IOW(KDBUS_IOCTL_MAGIC, 0x20,
> +                                            struct kdbus_cmd_make),
> +
> +       KDBUS_CMD_HELLO =               _IOWR(KDBUS_IOCTL_MAGIC, 0x30,
> +                                             struct kdbus_cmd_hello),
> +       KDBUS_CMD_BYEBYE =              _IO(KDBUS_IOCTL_MAGIC, 0x31),
> +
> +       KDBUS_CMD_MSG_SEND =            _IOWR(KDBUS_IOCTL_MAGIC, 0x40,
> +                                             struct kdbus_msg),
> +       KDBUS_CMD_MSG_RECV =            _IOWR(KDBUS_IOCTL_MAGIC, 0x41,
> +                                             struct kdbus_cmd_recv),
> +       KDBUS_CMD_MSG_CANCEL =          _IOW(KDBUS_IOCTL_MAGIC, 0x42,
> +                                            struct kdbus_cmd_cancel),
> +       KDBUS_CMD_FREE =                _IOW(KDBUS_IOCTL_MAGIC, 0x43,
> +                                            struct kdbus_cmd_free),
> 

I think in general, using enum is great, but for ioctl command numbers,
we probably want to have defines so the user space implementation can
use #ifdef to see if the kernel version that it is being built for
knows a particular command.

You could do that using 

#define KDBUS_CMD_BUS_MAKE KDBUS_CMD_BUS_MAKE

while keeping the enum, or do it like everybody else using

#define KDBUS_CMD_BUS_MAKE _IOW(KDBUS_IOCTL_MAGIC, 0x00, struct kdbus_cmd_make)

which might in fact help some tools that try to do automated parsing
of header files to find ioctl commands.

	Arnd

^ permalink raw reply

* Re: kdbus: add selftests
From: Arnd Bergmann @ 2014-10-30  8:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	marcel-kz+m5ild9QBg9hUCZPvPmw, desrt-0xnayjDhYQY,
	hadess-0MeiytkfxGOsTnJN9+BGXg, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c
In-Reply-To: <1414620056-6675-13-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On Wednesday 29 October 2014 15:00:56 Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> 
> This patch adds a quite extensive test suite for kdbus that checks
> the most important code pathes in the driver. The idea is to extend
> the test suite over time.
> 
> Also, this code can serve as an example implementation to show how to
> use the kernel API from userspace.
> 
> Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

Ah, new kernel code that comes with selftests, I'm impressed!

	Arnd

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Arnd Bergmann @ 2014-10-30  8:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	marcel-kz+m5ild9QBg9hUCZPvPmw, desrt-0xnayjDhYQY,
	hadess-0MeiytkfxGOsTnJN9+BGXg, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c
In-Reply-To: <1414620056-6675-1-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On Wednesday 29 October 2014 15:00:44 Greg Kroah-Hartman wrote:
>  drivers/misc/Kconfig                             |    1 +
>  drivers/misc/Makefile                            |    1 +
>  drivers/misc/kdbus/Kconfig                       |   11 +
>  drivers/misc/kdbus/Makefile                      |   19 +
>  drivers/misc/kdbus/bus.c                         |  450 ++++++
>  drivers/misc/kdbus/bus.h                         |  107 ++
>  drivers/misc/kdbus/connection.c                  | 1751 +++++++++++++++++++++
>  drivers/misc/kdbus/connection.h                  |  177 +++
>  drivers/misc/kdbus/domain.c                      |  477 ++++++
> 

One very high-level common:

Since this is going to be a very commonly used IPC mechanism, I don't
like the idea of stuffing it into drivers/misc.

How about putting it into drivers/kdbus or ipc/kdbus instead?

	Arnd

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Daniel Mack @ 2014-10-30  8:45 UTC (permalink / raw)
  To: Andy Lutomirski, Greg Kroah-Hartman
  Cc: Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	John Stultz, Arnd Bergmann, Tejun Heo, Marcel Holtmann,
	Ryan Lortie, Bastien Nocera, David Herrmann, Djalal Harouni,
	Simon McVittie, alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <CALCETrVkuKxMMEw3HBEOZoFUuw8PndXtB13+bLWmcp_E34SaFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/30/2014 01:13 AM, Andy Lutomirski wrote:
> On Wed, Oct 29, 2014 at 3:33 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Wed, Oct 29, 2014 at 3:00 PM, Greg Kroah-Hartman
>> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>>
>>> A connection chooses which metadata it wants to have attached to each
>>> message it receives with kdbus_cmd_hello.attach_flags. The metadata
>>> will be attached as items to the messages. All metadata refers to
>>> information about the sending task at sending time, unless otherwise
>>> stated. Also, the metadata is copied, not referenced, so even if the
>>> sending task doesn't exist anymore at the time the message is received,
>>> the information is still preserved.
>>>
> 
> Also, in general, the comments seem to talk about capturing metadata
> at the time that a connection is opened, but the actual code seems to
> capture metadata all over the place.  I think it needs to be very
> clear, both in the code and the interface, when metadata is captured.

Ok, so we should make that cleaner in the comments then.

To clarify, we currently take metadata at the following occasions:

1. At open() time, So we can tell peers (through KDBUS_CMD_CONN_INFO)
about the credentials a connection had when it was created with
KDBUS_CMD_HELLO.

2. When a new bus is created through KDBUS_CMD_BUS_MAKE, so peers can
later query the credentials of the owner of the bus they're connected to.

3. When we dispatch a KDBUS_CMD_MSG_SEND ioctl(), because we want to
attach the metadata that was authoritative when the message was sent.
IOW: We want metadata that actually matches the message payload.

4. We create faked metadata to pass around in messages in case the
connection was created 'on behalf' of another task. This case we need to
cover so we can implement a daemon in userspace that translates between
existing D-Bus clients and kdbus. In such cases, we want the receiving
peers to see the creds of the proxied task, rather than the proxy, so we
pass the small amount of reliably credential information that we can get
with SO_PEERCRED into the KDBUS_CMD_HELLO ioctl. In the kernel, we
create a metadata object out of that, so we can reuse when a message is
sent. This case, however, is an considered an exception and limited to
privileged clients.

In all such cases, we share some implementation in metadata.c, and we
operate on the same kdbus_metadata object, even though the origin of the
data varies in the individual cases. I agree that this should be better
documented, so I've put that on my TODO list.

> And the ns_eq stuff is too far buried (and not even contained in this
> patch!) to be easily verified as being correct, whatever correct means
> in that context.

I see that. As I explained earlier in my reply to Eric, we've chosen to
submit the patch set this way to keep the reply threading clean, so it
was some sort of a trade-off. Still, I think the best way to review it
is to pull in Greg's patches and look at the actual files.

Thanks,
Daniel

^ permalink raw reply

* Re: kdbus: add connection, queue handling and message validation code
From: Djalal Harouni @ 2014-10-30  9:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Linux API,
	linux-kernel@vger.kernel.org, John Stultz, Arnd Bergmann,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Simon McVittie, daniel, alban.crequy,
	Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <CALCETrXm116+eRvYY7QNbHcrOpZYOCqvC_WPguPZm-G+UEHeGw@mail.gmail.com>

On Wed, Oct 29, 2014 at 08:55:58PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 29, 2014 at 8:47 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> >
> >> From: Daniel Mack <daniel@zonque.org>
> >>
> >> This patch adds code to create and destroy connections, to validate
> >> incoming messages and to maintain the queue of messages that are
> >> associated with a connection.
> >>
> >> Note that connection and queue have a 1:1 relation, the code is only
> >> split in two parts for cleaner separation and better readability.
> >
> > You are not performing capability checks at open time.
> >
> > As such this API is suceptible to a host of file descriptor passing attacks.
> 
> To be fair, write(2) doesn't work on these fds, so the usual attacks
> don't work.  But who knows what absurd things kdbus clients will do
> with fd passing?
Yes, we use ioctl() so we are safe here! if there is a a suid process
that does perform arbitrary ioctl() on intrusted passed fds,
then we are already in truble given all the already available ioctl()
(not only kdbus, all available ioctl()... we blame the client), so yes
usual write()/read() do not work here.

But we do perform the creds check against the cred of connection
creation time, if you open the fd you do not have the connection,
you still need a KDBUS_CMD_HELLO ioctl() on the fd, and during that time
we store the creds, and we perform all the TALK, SEE and OWN against
those creds (uid/gid). It is like a second connect() call, unless you
perform the KDBUS_CMD_HELLO you are not connected, and after turning
your fd to a connection, a service can restrict its access (TALK, OWN
and SEE) policies, not all connected peers can TALK (send messages) to
a service.


-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Karol Lewandowski @ 2014-10-30  9:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api
In-Reply-To: <20141029234001.GB16520@kroah.com>

On 2014-10-30 00:40, Greg Kroah-Hartman wrote:

> There is a 1815 line documentation file in this series, so we aren't
> trying to not provide this type of information here at all.  But yes,
> more background, about why this can't be done in userspace (zero copy,
> less context switches, proper credential passing, timestamping, availble
> at early-boot, LSM hooks for security models to tie into

While you're at it... I have worked on proof-of-concept LSM patches for
kdbus some time ago, see [1][2].  Currently, these are completely of date.

 [1] https://github.com/lmctl/linux/commits/kdbus-lsm-v4.for-systemd-v212
 [2] https://github.com/lmctl/kdbus/commit/aa0885489d19be92fa41c6f0a71df28763228a40

May I ask if you guys have your own plan for LSM or maybe it would be
worth to resurrect [1]?

Cheers,
-- 
Karol Lewandowski, Samsung R&D Institute Poland

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Djalal Harouni @ 2014-10-30  9:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, linux-api, linux-kernel, john.stultz, arnd,
	tj, marcel, desrt, hadess, dh.herrmann, simon.mcvittie, daniel,
	alban.crequy, javier.martinez, teg, Andy Lutomirski
In-Reply-To: <8738a6w6kv.fsf@x220.int.ebiederm.org>

On Wed, Oct 29, 2014 at 08:59:44PM -0700, Eric W. Biederman wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> 
> The way capabilities are checked in this patch make me very nervous.
> 
> We are not checking permissions at open time.  Every other location
> of calling capable on file like objects has been show to be suceptible
> to file descriptor pass attacks.
Yes, I do understand the concern, this is valid for some cases! but we
can't apply it on the ioctl API ?! please see below:

All (perhaps not all) the current ioctl do not check for fd passing
attacks! if a privileged do arbitrary ioctl on untrusted fds we are
already owned... the dumb privileged process is the one to blame, right?


Example:
1) fs/ext4/ioctl.c:ext4_ioctl()
   they have:
   inode_owner_or_capable() + capable() checks

   for all the restricted ioctl()

2) fs/xfs/xfs_ioctl.c:xfs_file_ioctl()
   they have:
   capable() checks

3) fs/btrfs/ioctl.c:btrfs_ioctl()
   they have capable() + inode_owner_or_capable()

... long list

These are sensible API and they do not care at all about fd passing,
so I don't think we should care either ?! or perhaps I'm missing
something ?


The capable() is done as it is, and for the inode_owner_or_capable() you
will notice that we followed the same logic and did use it in our
kdbus_bus_uid_is_privileged() to stay safe and follow what other API are
doing.

Thank you for the comments!


> > See Documentation/kdbus.txt for more details.
> >
> > Signed-off-by: Daniel Mack <daniel@zonque.org>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > ---
> 
> > diff --git a/drivers/misc/kdbus/bus.c b/drivers/misc/kdbus/bus.c
> > new file mode 100644
> > index 000000000000..6dcaf22f5d59
> > --- /dev/null
> > +++ b/drivers/misc/kdbus/bus.c
> > @@ -0,0 +1,450 @@
> 
> > +/**
> > + * kdbus_bus_cred_is_privileged() - check whether the given credentials in
> > + *				    combination with the capabilities of the
> > + *				    current thead are privileged on the bus
> > + * @bus:		The bus to check
> > + * @cred:		The credentials to match
> > + *
> > + * Return: true if the credentials are privileged, otherwise false.
> > + */
> > +bool kdbus_bus_cred_is_privileged(const struct kdbus_bus *bus,
> > +				  const struct cred *cred)
> > +{
> > +	/* Capabilities are *ALWAYS* tested against the current thread, they're
> > +	 * never remembered from conn-credentials. */
> > +	if (ns_capable(&init_user_ns, CAP_IPC_OWNER))
> > +		return true;
> > +
> > +	return uid_eq(bus->uid_owner, cred->fsuid);
> > +}
> > +
> > +/**
> > + * kdbus_bus_uid_is_privileged() - check whether the current user is a
> > + *				   priviledged bus user
> > + * @bus:		The bus to check
> > + *
> > + * Return: true if the current user has CAP_IPC_OWNER capabilities, or
> > + * if it has the same UID as the user that created the bus. Otherwise,
> > + * false is returned.
> > + */
> > +bool kdbus_bus_uid_is_privileged(const struct kdbus_bus *bus)
> > +{
> > +	return kdbus_bus_cred_is_privileged(bus, current_cred());
> > +}
> 
> 
> > +/**
> > + * kdbus_bus_new() - create a new bus
> > + * @domain:		The domain to work on
> > + * @make:		Pointer to a struct kdbus_cmd_make containing the
> > + *			details for the bus creation
> > + * @name:		Name of the bus
> > + * @bloom:		Bloom parameters for this bus
> > + * @mode:		The access mode for the device node
> > + * @uid:		The uid of the device node
> > + * @gid:		The gid of the device node
> > + * @bus:		Pointer to a reference where the new bus is stored
> > + *
> > + * This function will allocate a new kdbus_bus and link it to the given
> > + * domain.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int kdbus_bus_new(struct kdbus_domain *domain,
> > +		  const struct kdbus_cmd_make *make,
> > +		  const char *name,
> > +		  const struct kdbus_bloom_parameter *bloom,
> > +		  umode_t mode, kuid_t uid, kgid_t gid,
> > +		  struct kdbus_bus **bus)
> > +{
> [snip]
> > +
> > +	if (!capable(CAP_IPC_OWNER) &&
> > +	    atomic_inc_return(&b->user->buses) > KDBUS_USER_MAX_BUSES) {
> > +		atomic_dec(&b->user->buses);
> > +		ret = -EMFILE;
> > +		goto exit_unref_user_unlock;
> > +	}
> > +

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Tom Gundersen @ 2014-10-30 10:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Djalal Harouni, Arnd Bergmann, Ryan Lortie, Greg Kroah-Hartman,
	Marcel Holtmann, David Herrmann,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Simon McVittie, John Stultz, Bastien Nocera,
	Linux API, Tejun Heo, Linux Containers, Linus Torvalds,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, Daniel Mack
In-Reply-To: <87bnourxx4.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Hi Eric,

On Thu, Oct 30, 2014 at 5:20 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> The userspace API breaks userspace in an unfixable way.
>
> Nacked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
> Problem the first.
> - Using global names for containers makes it impossible to create
>   unprivileged containers.

I don't follow.

Just so we are on the same page:
  - creating a domain per container is only a convention, and has to
be done manually. I.e., the worst case scenario is that you are able
to create some container which cannot get a corresponding kdbus
domain.
  - domain names are only unique per parent-domain, and domains are
fully recursive. We explicitly tested recursive domains by running
kdbus-enabled containers within kdbus-enabled containers, a number of
iterations deep.

Could you explain the problem you see in more detail? This might just
be a documenation issue, after all.

>   This is a back to the drawing board problem, and makes device
>   nodes fundamentally unsuited to what you are doing.
>
>   There is no way that I can see to make it safe for an unprivileged
>   user to create arbitrary named busses.  Especially in the presence
>   of allowing unprivileged checkpoint/restart.

Note that unprivileged users cannot create arbitrary named busses, the
names must have the format $PID-<arbitrary name>. Do you see a problem
with this?

>   This is particularly bad as kdbus explicitly allows unprivielged
>   creation of new kdbus instances.

What do you mean by kdbus instance? A new domain? This is not allowed
by unprivileged processes. Or do you mean a new bus, in which case see
above.

>   This problem is a userspace regression.

This is all new functionality, how does it affect current code?

> Problem the second.
> - The security checks in the code are not based on who opens the
>   file descriptors but instead based on who is used the file
>   descriptors at any give moment.
>
>   That pattern has been shown to be exploitable.
>
>   I expect the policy database makes this poor choice of permission
>   checks even worse.  Pass a more privileged user a kdbus file
>   descriptor and all of sudden things that were not possible on
>   that file descriptor become possible.

Djalal already commented on this point in another thread. But just to
recap: Please note that we do not do read()/write() at all, only
ioctl's, so the most common exploits do not apply. Moreover, we are
following the same API pattern as used by other similar APIs in the
kernel. With that in mind, could you give some more specific
information about what kind of exploits you imagine?

> Problem the third.
> - You are using device numbers for things created by unprivileged
>   users.  That breaks checkpoint/restart.  Aka CRIU.
>
>   We can not migrate a container to a new machine and preserve the
>   device numbers.

I must admit to not being too familiar with checkpoint/restart. What
precisely is the problem with unprivileged users?

>   We can not migrate a container to a new machine and have any hope
>   of preserving the container patsh under /dev/kdbus/...

You may not be able to preserve the full path, no, but the container
should not know/care about the parent paths anyway.  Note that the
containers only see their own domain subtree mounted to /dev/kdbus,
they see nothing from the parent. Hence when you migrate containers
you can change the naming of the parent freely, but the processes
inside the containers won't see that, they'll have stable paths.   I'm
not seeing the problem here, care to elaborate?

> I think a kdbusfs modeled on devpts with newinstance at
> mount time would solve the naming problems.

Effectively, what we have in place in the current patch set delivers
similar semantics, however without introducing a new file system. You
just create a new domain and get a new subdir in /dev/kdbus/ for it,
and then inside the container you mount that subdir of /dev/kdbus onto
/dev/kdbus itself.

Do I understand you correctly that what you want is unnamed/anonymous
domains? Considering that domain creation is anyway privileged, why is
this necessary?

> That would break one of the current kdbus use cases that allows an
> unprivileged user to create a bus.

That is a fundamental usecase, so I don't think it makes much sense to
do anything that precludes that.

Cheers,

Tom

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Karol Lewandowski @ 2014-10-30 10:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Kosina, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	John Stultz, Arnd Bergmann, Tejun Heo, Ryan Lortie,
	Simon McVittie, daniel-cYrQPVfZoowdnm+yROfE0A, David Herrmann,
	Paul Moore,
	casey.schaufler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
	marcel-kz+m5ild9QBg9hUCZPvPmw, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ
In-Reply-To: <54520A21.20404-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

[ Sorry for breaking thread and resend - gmane rejected my original message
  due to too long list of recipients... ]

On 2014-10-30 00:40, Greg Kroah-Hartman wrote:

> There is a 1815 line documentation file in this series, so we aren't
> trying to not provide this type of information here at all.  But yes,
> more background, about why this can't be done in userspace (zero copy,
> less context switches, proper credential passing, timestamping, availble
> at early-boot, LSM hooks for security models to tie into

While you're at it... I did some work on proof-of-concept LSM patches for
kdbus some time ago, see [1][2].  Currently, these are completely of date.

 [1] https://github.com/lmctl/linux/commits/kdbus-lsm-v4.for-systemd-v212
 [2] https://github.com/lmctl/kdbus/commit/aa0885489d19be92fa41c6f0a71df28763228a40

May I ask if you guys have your own plan for LSM or maybe it would be
worth to resurrect [1]?

Cheers,
-- 
Karol Lewandowski, Samsung R&D Institute Poland

^ permalink raw reply

* Re: kdbus: add header file
From: Tom Gundersen @ 2014-10-30 11:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Greg Kroah-Hartman, Linux API, LKML, John Stultz, Tejun Heo,
	Marcel Holtmann, Ryan Lortie, Bastien Nocera, David Herrmann,
	Djalal Harouni, Simon McVittie, Daniel Mack, alban.crequy,
	javier.martinez
In-Reply-To: <3546486.lOZcZMmXYe@wuerfel>

On Thu, Oct 30, 2014 at 9:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think in general, using enum is great, but for ioctl command numbers,
> we probably want to have defines so the user space implementation can
> use #ifdef to see if the kernel version that it is being built for
> knows a particular command.

Does that make sense for the first version? I agree that we should use
 #define to allow #ifdef for when we add more ioctls in the future,
but these ioctls will always exist...

The nice thing about enums is of course that it helps with debugging
as gdb can show the string representation rather than the number,
because in contrast to #defines, an enum is something the compliler
knows about.

Cheers,

Tom

^ permalink raw reply

* Re: kdbus: add header file
From: Arnd Bergmann @ 2014-10-30 11:26 UTC (permalink / raw)
  To: Tom Gundersen
  Cc: Greg Kroah-Hartman, Linux API, LKML, John Stultz, Tejun Heo,
	Marcel Holtmann, Ryan Lortie, Bastien Nocera, David Herrmann,
	Djalal Harouni, Simon McVittie, Daniel Mack, alban.crequy,
	javier.martinez
In-Reply-To: <CAG-2HqV_mXyARMs=9GOpbCBHPU+2XMkDcm=sGekYnnujNPAYqQ@mail.gmail.com>

On Thursday 30 October 2014 12:02:39 Tom Gundersen wrote:
> On Thu, Oct 30, 2014 at 9:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think in general, using enum is great, but for ioctl command numbers,
> > we probably want to have defines so the user space implementation can
> > use #ifdef to see if the kernel version that it is being built for
> > knows a particular command.
> 
> Does that make sense for the first version? I agree that we should use
>  #define to allow #ifdef for when we add more ioctls in the future,
> but these ioctls will always exist...

It's mainly for consistency really.

> The nice thing about enums is of course that it helps with debugging
> as gdb can show the string representation rather than the number,
> because in contrast to #defines, an enum is something the compliler
> knows about.

This doesn't get passed as an enum in user space though, and when debugging
the kernel it only helps within one function.

	Arnd

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Tom Gundersen @ 2014-10-30 11:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, John Stultz, Arnd Bergmann,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Djalal Harouni, Simon McVittie, Daniel Mack,
	alban.crequy, Javier Martinez Canillas
In-Reply-To: <CALCETrXY90YAXA9-GLc0EowmRCtWZzh8br8seiqrzOkmNn8_Hw@mail.gmail.com>

On 10/30/2014 12:55 AM, Andy Lutomirski wrote:> It's worth noting that:
>
>  - Proper credential passing could be added to UNIX sockets, and we
> may want to do that anyway.  Also, the current kdbus semantics seem to
> be "spew lots of credentials and other miscellaneous
> potentially-sensitive and sometime spoofable information all over the
> place", which isn't obviously an improvement.  (This is fixable, but
> it will almost certainly not be compatible with current systemd kdbus
> code if fixed.)

Care to elaborate on what you think is spoofable, and what needs to be fixed?

Anyway, the idea is that by simply connecting to the bus and sending a
message to some service, you implicitly agree to passing some metadata
along to the service (and to a lesser extent to the bus). It's not
that this information is leaked, or that the peer could actively
access any of the sender's private memory. Also note that this kind of
metadata information is also available via /proc/$PID, and via
SCM_CREDENTIALS/SO_PEERCRED and the socket seclabel APIs. What the
kdbus API allows users to do is to get a lot more of this information
in a race-free way. For example, if you want to get the audit identity
bits, you can now get this attached securely by the kernel, at the
time the message is sent, rather than having to firest get the peer's
$PID from SCM_CREDENTIALS and then read the audit identity bits racily
from /proc/$PID/loginuid and /proc/$PID/sessionid.

>  - The current kdbus patches seem to be worse than UNIX sockets from a
> namespace perspective, but maybe I'm misunderstanding how it's
> supposed to work.  UNIX sockets work quite nicely in containers.

kdbus is recusively stackable for containers. You can run
kdbus-enabled containers within kdbus-enabled containers within
kdbus-enabled containers, with the full functionality available for
each container, and each container isolated from each other.

When credential information is passed between processes of different
(PID) namespaces most of the attached metadata is suppressed. This
isn't too different from how SCM_CREDENTIALS works, which will zero
out the bits it cannot translate as well.

>  - There's an obvious interface to add timestamping to UNIX sockets
> (it could work exactly the way it does for UDP / PTP).

Timestamping on AF_UNIX/SOCK_DGRAM already exists, but that's not
enough for the use-cases we want to support.

>  - I'm unconvinced by this performance argument without numbers.  The
> kdbus credential code, at least, looks to be quite heavy on allocation
> and atomics.  This isn't to say that the current userspace D-Bus
> daemon doesn't also serialize everything, but it could be made
> multithreaded.

There are some major benefits regarding performance:

* fewer userspace context switches. For a full-duplex method call it's
down from five to two: instead of sender -> dbus daemon -> service ->
dbus daemon -> sender it's just sender -> service -> sender.
* fewer message copies in userspace. For a full-duplex method call
it's down from eight to two: instead of copying the method call data
into a socket, out of a socket, into a socket, out of a socket, and
the same for the method reply, we just copy one message directly to
the receiver, and the reply back.
* generally fewer syscalls involved. A synchronous method call is now
doable in a single ioctl on the sender side.
* memfds can be used for transport purposes of larger payload. This
way, we can cover substantial payload sizes instead of just small
control messages, with no extra copies. kdbus, in its transport layer,
makes sure only sealed memfds are passed in as payload, so the sender
cannot modify the contents while the receiver is already parsing it.

>  - Race-free?  What are the races that are inherent to UNIX sockets?

Does the above explain what we have in mind?

Note that the aim is not necessarily that kdbus should be better than
UNIX sockets in every way, nor that it should be favoured in all
cases. What we are trying to address is a common case in environments
where peers don't necessarily trust each other.

Cheers,

Tom

^ permalink raw reply

* Re: kdbus: add header file
From: Daniel Mack @ 2014-10-30 11:52 UTC (permalink / raw)
  To: Arnd Bergmann, Tom Gundersen
  Cc: Greg Kroah-Hartman, Linux API, LKML, John Stultz, Tejun Heo,
	Marcel Holtmann, Ryan Lortie, Bastien Nocera, David Herrmann,
	Djalal Harouni, Simon McVittie, alban.crequy, javier.martinez
In-Reply-To: <6078917.F7Y7rNpK9C@wuerfel>

On 10/30/2014 12:26 PM, Arnd Bergmann wrote:
> On Thursday 30 October 2014 12:02:39 Tom Gundersen wrote:

>> The nice thing about enums is of course that it helps with debugging
>> as gdb can show the string representation rather than the number,
>> because in contrast to #defines, an enum is something the compliler
>> knows about.
> 
> This doesn't get passed as an enum in user space though, and when debugging
> the kernel it only helps within one function.

Hmm, this is the header exported to userspace, so having enums in would
make our lives easier, right?

Hence, for now, I'd propose we keep it the way it is, and add new ioctls
with defines once they are implemented. Are you okay with this? I'll add
a comment to the file to give a heads-up.


Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Eric W. Biederman @ 2014-10-30 12:02 UTC (permalink / raw)
  To: Tom Gundersen
  Cc: Djalal Harouni, Arnd Bergmann, Ryan Lortie, Greg Kroah-Hartman,
	Marcel Holtmann, David Herrmann,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Simon McVittie, John Stultz, Bastien Nocera,
	Linux API, Tejun Heo, Linux Containers, Linus Torvalds,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, Daniel Mack
In-Reply-To: <CAG-2HqUChohNrRSdXzckSiv8ZUYwFLMvRTc41Uo7-b-qmkSFMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Tom Gundersen <teg-B22kvLQNl6c@public.gmane.org> writes:

> Hi Eric,
>
> On Thu, Oct 30, 2014 at 5:20 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> The userspace API breaks userspace in an unfixable way.
>>
>> Nacked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>
>> Problem the first.
>> - Using global names for containers makes it impossible to create
>>   unprivileged containers.
>
> I don't follow.
>
> Just so we are on the same page:
>   - creating a domain per container is only a convention, and has to
> be done manually. I.e., the worst case scenario is that you are able
> to create some container which cannot get a corresponding kdbus
> domain.

Which is the classic definition of failure to restore a checkpoint.  You
can't get the name you needed.

>   - domain names are only unique per parent-domain, and domains are
> fully recursive. We explicitly tested recursive domains by running
> kdbus-enabled containers within kdbus-enabled containers, a number of
> iterations deep.
>
> Could you explain the problem you see in more detail? This might just
> be a documenation issue, after all.

Partly there is just a ridiculous amount of complexity in having
hiearchical names when there is fundamentally no hierarchy.

The problem I see is that creating a kdbus requires someone to grant you
privilege to do it.  You have to ask permission from the system
administrator.  For unprivileged containers you don't have to ask
permission to create one, you just need the appropriate support in your
kernel.

Given the fact you smash all of the names together in a hierarchy I
can't see how you can avoid requiring privilege for part of the
hierarchy creation.

>>   This is a back to the drawing board problem, and makes device
>>   nodes fundamentally unsuited to what you are doing.
>>
>>   There is no way that I can see to make it safe for an unprivileged
>>   user to create arbitrary named busses.  Especially in the presence
>>   of allowing unprivileged checkpoint/restart.
>
> Note that unprivileged users cannot create arbitrary named busses, the
> names must have the format $PID-<arbitrary name>. Do you see a problem
> with this?

Yes.  What pid namespace is that in?

How do I restore a checkpoint?

>>   This is particularly bad as kdbus explicitly allows unprivielged
>>   creation of new kdbus instances.
>
> What do you mean by kdbus instance? A new domain? This is not allowed
> by unprivileged processes. Or do you mean a new bus, in which case see
> above.

Oh great two concepts domains and busses.  The bottom line if I can't
create both unprivileged it is a regression in the functionality of
unprivileged containers.

>>   This problem is a userspace regression.
>
> This is all new functionality, how does it affect current code?

If you simply change the existing dbus users to use kdbus you get a
regression in containers.  Furthermore you get a regression in what
kinds of userspace a container can contain.

>> Problem the second.
>> - The security checks in the code are not based on who opens the
>>   file descriptors but instead based on who is used the file
>>   descriptors at any give moment.
>>
>>   That pattern has been shown to be exploitable.
>>
>>   I expect the policy database makes this poor choice of permission
>>   checks even worse.  Pass a more privileged user a kdbus file
>>   descriptor and all of sudden things that were not possible on
>>   that file descriptor become possible.
>
> Djalal already commented on this point in another thread. But just to
> recap: Please note that we do not do read()/write() at all, only
> ioctl's, so the most common exploits do not apply. Moreover, we are
> following the same API pattern as used by other similar APIs in the
> kernel. 

A pattern that has led to an exploitable kernel, because it breaks the
principle of least surprise.

> With that in mind, could you give some more specific
> information about what kind of exploits you imagine?

I don't know if it is exploitable or simply a maintenance disaster.  But
the behavior of file descriptors changing based on who is performing
operations on it is wrong.  It breaks the common unix expectations.

It means I can not pass a file descriptor into a strongly sandboxed
application and be able to predict what can be done with the file
descriptor in the sand box.

I suspect what you really want are system calls.  As system calls are
both less overhead and easier to understand what is going on.
Especially for something as commonly used as kdbus is aiming to be
ioctls seem like code obfuscation.

The easiest problem to trigger that I can imagine is an application that
calls setresuid will have unpredicatable behavior if the change their
effective uid happens between one call and the next of your ioctl.
Which can create subtle and difficult to find bugs.

There are also all kinds of issues with respect to namespaces that if
you care about the namespace you are referring to has to be pinned at
open time.

>> Problem the third.
>> - You are using device numbers for things created by unprivileged
>>   users.  That breaks checkpoint/restart.  Aka CRIU.
>>
>>   We can not migrate a container to a new machine and preserve the
>>   device numbers.
>
> I must admit to not being too familiar with checkpoint/restart. What
> precisely is the problem with unprivileged users?

>>   We can not migrate a container to a new machine and have any hope
>>   of preserving the container patsh under /dev/kdbus/...
>
> You may not be able to preserve the full path, no, but the container
> should not know/care about the parent paths anyway.  Note that the
> containers only see their own domain subtree mounted to /dev/kdbus,
> they see nothing from the parent. Hence when you migrate containers
> you can change the naming of the parent freely, but the processes
> inside the containers won't see that, they'll have stable paths.   I'm
> not seeing the problem here, care to elaborate?

Domain creation.
Random path conflicts for no reason except we have two machines.

>> I think a kdbusfs modeled on devpts with newinstance at
>> mount time would solve the naming problems.
>
> Effectively, what we have in place in the current patch set delivers
> similar semantics, however without introducing a new file system. You
> just create a new domain and get a new subdir in /dev/kdbus/ for it,
> and then inside the container you mount that subdir of /dev/kdbus onto
> /dev/kdbus itself.
>
> Do I understand you correctly that what you want is unnamed/anonymous
> domains? Considering that domain creation is anyway privileged, why is
> this necessary?

When an unprivileged user needs a new domain?  If domains are unnamed
it is possible that their creation not require privilege.

Anything that requires stopping and asking the system administrator
for something so that I can do today with an unprivileged container
winds up being a regression, a design bug, and a showstopper.

Unless there is a massive miscommunication you have those kinds of
issues with the kbus design.

I would love to hear different but it sounds like domains are a weird
partial solution for the fact you have crammed everything into a
hierarchy for no good reason.

>> That would break one of the current kdbus use cases that allows an
>> unprivileged user to create a bus.
>
> That is a fundamental usecase, so I don't think it makes much sense to
> do anything that precludes that.

Eric

^ permalink raw reply

* Re: kdbus: add header file
From: Arnd Bergmann @ 2014-10-30 12:03 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Tom Gundersen, Greg Kroah-Hartman, Linux API, LKML, John Stultz,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Djalal Harouni, Simon McVittie, alban.crequy,
	javier.martinez
In-Reply-To: <5452269A.9050003@zonque.org>

On Thursday 30 October 2014 12:52:58 Daniel Mack wrote:
> On 10/30/2014 12:26 PM, Arnd Bergmann wrote:
> > On Thursday 30 October 2014 12:02:39 Tom Gundersen wrote:
> 
> >> The nice thing about enums is of course that it helps with debugging
> >> as gdb can show the string representation rather than the number,
> >> because in contrast to #defines, an enum is something the compliler
> >> knows about.
> > 
> > This doesn't get passed as an enum in user space though, and when debugging
> > the kernel it only helps within one function.
> 
> Hmm, this is the header exported to userspace, so having enums in would
> make our lives easier, right?

My point was that you never use the enum by type and the only place in
user space where it's referenced would be something like

	ret = ioctl(fd, KDBUS_CMD_BUS_MAKE, &make);

In the debugger, you will see the source line here. If you trace into the
glibc ioctl function, you no longer know the type because that just
has an 'int'.

> Hence, for now, I'd propose we keep it the way it is, and add new ioctls
> with defines once they are implemented. Are you okay with this? I'll add
> a comment to the file to give a heads-up.

It's certainly not a show-stopped, but I have yet to see a good reason
why it would help anyone.

	Arnd

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Eric W. Biederman @ 2014-10-30 12:15 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Greg Kroah-Hartman, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c,
	Andy Lutomirski
In-Reply-To: <20141030095854.GA4716@dztty>

Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org> writes:

> On Wed, Oct 29, 2014 at 08:59:44PM -0700, Eric W. Biederman wrote:
>> Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
>> 
>> The way capabilities are checked in this patch make me very nervous.
>> 
>> We are not checking permissions at open time.  Every other location
>> of calling capable on file like objects has been show to be suceptible
>> to file descriptor pass attacks.
> Yes, I do understand the concern, this is valid for some cases! but we
> can't apply it on the ioctl API ?! please see below:
>
> All (perhaps not all) the current ioctl do not check for fd passing
> attacks! if a privileged do arbitrary ioctl on untrusted fds we are
> already owned... the dumb privileged process is the one to blame, right?
>
>
> Example:
> 1) fs/ext4/ioctl.c:ext4_ioctl()
>    they have:
>    inode_owner_or_capable() + capable() checks
>
>    for all the restricted ioctl()
>
> 2) fs/xfs/xfs_ioctl.c:xfs_file_ioctl()
>    they have:
>    capable() checks
>
> 3) fs/btrfs/ioctl.c:btrfs_ioctl()
>    they have capable() + inode_owner_or_capable()
>
> ... long list
>
> These are sensible API and they do not care at all about fd passing,
> so I don't think we should care either ?! or perhaps I'm missing
> something ?

- It is an easy mistake to make.
- We have not performed extensive audits of the capable calls at this
  time to veryify that fd passing is safe.
- Unless it is egregious we are likely to grandfather the existing usage
  in to avoid breaking userspace.

None of that is an excuse for kdbus to get it wrong once it has been
pointed out in review.
 
> The capable() is done as it is, and for the inode_owner_or_capable() you
> will notice that we followed the same logic and did use it in our
> kdbus_bus_uid_is_privileged() to stay safe and follow what other API are
> doing.

What others are doing makes it very hard to safely use allow those
ioctls in a tightly sandboxed application, as it is unpredictable
what the sandboxed ioctl can do with the file descriptor.

Further an application that calls setresuid at different times during
it's application will behave differently.  Which makes ioctls that do
not have consistent behavior after open time inappropriate for use in
userspace libraries.

Eric


> Thank you for the comments!
>
>
>> > See Documentation/kdbus.txt for more details.
>> >
>> > Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>> > Signed-off-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
>> > ---
>> 
>> > diff --git a/drivers/misc/kdbus/bus.c b/drivers/misc/kdbus/bus.c
>> > new file mode 100644
>> > index 000000000000..6dcaf22f5d59
>> > --- /dev/null
>> > +++ b/drivers/misc/kdbus/bus.c
>> > @@ -0,0 +1,450 @@
>> 
>> > +/**
>> > + * kdbus_bus_cred_is_privileged() - check whether the given credentials in
>> > + *				    combination with the capabilities of the
>> > + *				    current thead are privileged on the bus
>> > + * @bus:		The bus to check
>> > + * @cred:		The credentials to match
>> > + *
>> > + * Return: true if the credentials are privileged, otherwise false.
>> > + */
>> > +bool kdbus_bus_cred_is_privileged(const struct kdbus_bus *bus,
>> > +				  const struct cred *cred)
>> > +{
>> > +	/* Capabilities are *ALWAYS* tested against the current thread, they're
>> > +	 * never remembered from conn-credentials. */
>> > +	if (ns_capable(&init_user_ns, CAP_IPC_OWNER))
>> > +		return true;
>> > +
>> > +	return uid_eq(bus->uid_owner, cred->fsuid);
>> > +}
>> > +
>> > +/**
>> > + * kdbus_bus_uid_is_privileged() - check whether the current user is a
>> > + *				   priviledged bus user
>> > + * @bus:		The bus to check
>> > + *
>> > + * Return: true if the current user has CAP_IPC_OWNER capabilities, or
>> > + * if it has the same UID as the user that created the bus. Otherwise,
>> > + * false is returned.
>> > + */
>> > +bool kdbus_bus_uid_is_privileged(const struct kdbus_bus *bus)
>> > +{
>> > +	return kdbus_bus_cred_is_privileged(bus, current_cred());
>> > +}
>> 
>> 
>> > +/**
>> > + * kdbus_bus_new() - create a new bus
>> > + * @domain:		The domain to work on
>> > + * @make:		Pointer to a struct kdbus_cmd_make containing the
>> > + *			details for the bus creation
>> > + * @name:		Name of the bus
>> > + * @bloom:		Bloom parameters for this bus
>> > + * @mode:		The access mode for the device node
>> > + * @uid:		The uid of the device node
>> > + * @gid:		The gid of the device node
>> > + * @bus:		Pointer to a reference where the new bus is stored
>> > + *
>> > + * This function will allocate a new kdbus_bus and link it to the given
>> > + * domain.
>> > + *
>> > + * Return: 0 on success, negative errno on failure.
>> > + */
>> > +int kdbus_bus_new(struct kdbus_domain *domain,
>> > +		  const struct kdbus_cmd_make *make,
>> > +		  const char *name,
>> > +		  const struct kdbus_bloom_parameter *bloom,
>> > +		  umode_t mode, kuid_t uid, kgid_t gid,
>> > +		  struct kdbus_bus **bus)
>> > +{
>> [snip]
>> > +
>> > +	if (!capable(CAP_IPC_OWNER) &&
>> > +	    atomic_inc_return(&b->user->buses) > KDBUS_USER_MAX_BUSES) {
>> > +		atomic_dec(&b->user->buses);
>> > +		ret = -EMFILE;
>> > +		goto exit_unref_user_unlock;
>> > +	}
>> > +

^ permalink raw reply

* Re: kdbus: add documentation
From: Peter Meerwald @ 2014-10-30 12:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-api, linux-kernel
In-Reply-To: <1414620056-6675-2-git-send-email-gregkh@linuxfoundation.org>


> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
> 
> The interface to all functions in this driver is implemented through ioctls
> on /dev nodes.  This patch adds detailed documentation about the kernel
> level API design.

just some typos below

> Signed-off-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  Documentation/kdbus.txt | 1815 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 1815 insertions(+)
>  create mode 100644 Documentation/kdbus.txt
> 
> diff --git a/Documentation/kdbus.txt b/Documentation/kdbus.txt
> new file mode 100644
> index 000000000000..ac1a18908976
> --- /dev/null
> +++ b/Documentation/kdbus.txt
> @@ -0,0 +1,1815 @@
> +D-Bus is a system for powerful, easy to use interprocess communication (IPC).
> +
> +The focus of this document is an overview of the low-level, native kernel D-Bus
> +transport called kdbus. Kdbus in the kernel acts similar to a device driver,
> +all communication between processes take place over special character device

takes

> +nodes in /dev/kdbus/.
> +
> +For the general D-Bus protocol specification, the payload format, the
> +marshaling, and the communication semantics, please refer to:
> +  http://dbus.freedesktop.org/doc/dbus-specification.html
> +
> +For a kdbus specific userspace library implementation please refer to:
> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
> +
> +Articles about D-Bus and kdbus:
> +  http://lwn.net/Articles/580194/
> +
> +
> +1. Terminology
> +===============================================================================
> +
> +  Domain:
> +    A domain is a named object containing a number of buses. A system
> +    container that contains its own init system and users usually also
> +    runs in its own kdbus domain. The /dev/kdbus/domain/<container-name>/
> +    directory shows up inside the domain as /dev/kdbus/. Every domain offers
> +    its own "control" device node to create new buses or new sub-domains.
> +    Domains have no connection to each other and cannot see nor talk to
> +    each other. See section 5 for more details.
> +
> +  Bus:
> +    A bus is a named object inside a domain. Clients exchange messages
> +    over a bus. Multiple buses themselves have no connection to each other;
> +    messages can only be exchanged on the same bus. The default entry point to
> +    a bus, where clients establish the connection to, is the "bus" device node
> +    /dev/kdbus/<bus name>/bus.
> +    Common operating system setups create one "system bus" per system, and one
> +    "user bus" for every logged-in user. Applications or services may create
> +    their own private named buses. See section 5 for more details.
> +
> +  Endpoint:
> +    An endpoint provides the device node to talk to a bus. Opening an
> +    endpoint creates a new connection to the bus to which the endpoint belongs.
> +    Every bus has a default endpoint called "bus".
> +    A bus can optionally offer additional endpoints with custom names to
> +    provide a restricted access to the same bus. Custom endpoints carry
> +    additional policy which can be used to give sandboxed processes only
> +    a locked-down, limited, filtered access to the same bus.
> +    See section 5 for more details.
> +
> +  Connection:
> +    A connection to a bus is created by opening an endpoint device node of
> +    a bus and becoming an active client with the HELLO exchange. Every
> +    connected client connection has a unique identifier on the bus and can
> +    address messages to every other connection on the same bus by using
> +    the peer's connection id as the destination.
> +    See section 6 for more details.
> +
> +  Pool:
> +    Each connection allocates a piece of shmem-backed memory that is used
> +    to receive messages and answers to ioctl command from the kernel. It is
> +    never used to send anything to the kernel. In order to access that memory,
> +    userspace must mmap() it into its task.
> +    See section 12 for more details.
> +
> +  Well-known Name:
> +    A connection can, in addition to its implicit unique connection id, request
> +    the ownership of a textual well-known name. Well-known names are noted in
> +    reverse-domain notation, such as com.example.service1. Connections offering
> +    a service on a bus are usually reached by its well-known name. The analogy
> +    of connection id and well-known name is an IP address and a DNS name
> +    associated with that address.
> +
> +  Message:
> +    Connections can exchange messages with other connections by addressing
> +    the peers with their connection id or well-known name. A message consists
> +    of a message header with kernel-specific information on how to route the
> +    message, and the message payload, which is a logical byte stream of
> +    arbitrary size. Messages can carry additional file descriptors to be passed
> +    from one connection to another. Every connection can specify which set of
> +    metadata the kernel should attach to the message when it is delivered
> +    to the receiving connection. Metadata contains information like: system
> +    timestamps, uid, gid, tid, proc-starttime, well-known-names, process comm,
> +    process exe, process argv, cgroup, capabilities, seclabel, audit session,
> +    loginuid and the connection's human-readable name.
> +    See section 7 and 13 for more details.
> +
> +  Item:
> +    The API of kdbus implements a notion of items, submitted through and
> +    returned by most ioctls, and stored inside data structures in the
> +    connection's pool. See section 4 for more details.
> +
> +  Broadcast and Match:
> +    Broadcast messages are potentially sent to all connections of a bus. By
> +    default, the connections will not actually receive any of the sent
> +    broadcast messages; only after installing a match for specific message
> +    properties, a broadcast message passes this filter.
> +    See section 10 for more details.
> +
> +  Policy:
> +    A policy is a set of rules that define which connections can see, talk to,
> +    or register a well-know name on the bus. A policy is attached to buses and
> +    custom endpoints, and modified by policy holder connection or owners of
> +    custom endpoints. See section 11 for more details.
> +
> +    Access rules to allow who can see a name on the bus are only checked on
> +    custom endpoints. Policies may be defined with names that end with '.*'.
> +    When matching a well-known name against such a wildcard entry, the last
> +    part of the name is ignored and checked against the wildcard name without
> +    the trailing '.*'. See section 11 for more details.
> +
> +  Privileged bus users:
> +    A user connecting to the bus is considered privileged if it is either the
> +    creator of the bus, or if it has the CAP_IPC_OWNER capability flag set.
> +
> +
> +2. Device Node Layout
> +===============================================================================
> +
> +The kdbus interface is exposed through device nodes in /dev.
> +
> +  /sys/bus/kdbus
> +  `-- devices
> +    |-- kdbus!0-system!bus -> ../../../devices/virtual/kdbus/kdbus!0-system!bus
> +    |-- kdbus!2702-user!bus -> ../../../devices/virtual/kdbus/kdbus!2702-user!bus
> +    |-- kdbus!2702-user!ep.app -> ../../../devices/virtual/kdbus/kdbus!2702-user!ep.app
> +    `-- kdbus!control -> ../../../devices/kdbus!control
> +
> +  /dev/kdbus
> +  |-- control
> +  |-- 0-system
> +  |   |-- bus
> +  |   `-- ep.apache
> +  |-- 1000-user
> +  |   `-- bus
> +  |-- 2702-user
> +  |   |-- bus
> +  |   `-- ep.app
> +  `-- domain
> +      |-- fedoracontainer
> +      |   |-- control
> +      |   |-- 0-system
> +      |   |   `-- bus
> +      |   `-- 1000-user
> +      |       `-- bus
> +      `-- mydebiancontainer
> +          |-- control
> +          `-- 0-system
> +              `-- bus
> +
> +Note:
> +  The device node subdirectory layout is arranged that a future version of
> +  kdbus could be implemented as a file system with a separate instance mounted
> +  for each domain. For any future changes, this always needs to be kept
> +  in mind. Also the dependency on udev's userspace hookups or sysfs attribute
> +  use should be limited to the absolute minimum for the same reason.
> +
> +
> +3. Data Structures and flags
> +===============================================================================
> +
> +3.1 Data structures and interconnections
> +----------------------------------------
> +
> +  +-------------------------------------------------------------------------+
> +  | Domain (Init Domain)                                                    |
> +  | /dev/kdbus/control                                                      |
> +  | +---------------------------------------------------------------------+ |
> +  | | Bus (System Bus)                                                    | |
> +  | | /dev/kdbus/0-system/                                                | |
> +  | | +-------------------------------+ +-------------------------------+ | |
> +  | | | Endpoint                      | | Endpoint                      | | |
> +  | | | /dev/kdbus/0-system/bus       | | /dev/kdbus/0-system/ep.app    | | |
> +  | | +-------------------------------+ +-------------------------------+ | |
> +  | | +--------------+ +--------------+ +--------------+ +--------------+ | |
> +  | | | Connection   | | Connection   | | Connection   | | Connection   | | |
> +  | | | :1.22        | | :1.25        | | :1.55        | | :1.81        | | |
> +  | | +--------------+ +--------------+ +--------------+ +--------------+ | |
> +  | +---------------------------------------------------------------------+ |
> +  |                                                                         |
> +  | +---------------------------------------------------------------------+ |
> +  | | Bus (User Bus for UID 2702)                                         | |
> +  | | /dev/kdbus/2702-user/                                               | |
> +  | | +-------------------------------+ +-------------------------------+ | |
> +  | | | Endpoint                      | | Endpoint                      | | |
> +  | | | /dev/kdbus/2702-user/bus      | | /dev/kdbus/2702-user/ep.app   | | |
> +  | | +-------------------------------+ +-------------------------------+ | |
> +  | | +--------------+ +--------------+ +--------------+ +--------------+ | |
> +  | | | Connection   | | Connection   | | Connection   | | Connection   | | |
> +  | | | :1.22        | | :1.25        | | :1.55        | | :1.81        | | |
> +  | | +--------------+ +--------------+ +-------------------------------+ | |
> +  | +---------------------------------------------------------------------+ |
> +  |                                                                         |
> +  | +---------------------------------------------------------------------+ |
> +  | | Domain (Container; inside it, fedoracontainer/ becomes /dev/kdbus/) | |
> +  | | /dev/kdbus/domain/fedoracontainer/control                           | |
> +  | | +-----------------------------------------------------------------+ | |
> +  | | | Bus (System Bus of "fedoracontainer")                           | | |
> +  | | | /dev/kdbus/domain/fedoracontainer/0-system/                     | | |
> +  | | | +-----------------------------+                                 | | |
> +  | | | | Endpoint                    |                                 | | |
> +  | | | | /dev/.../0-system/bus       |                                 | | |
> +  | | | +-----------------------------+                                 | | |
> +  | | | +-------------+ +-------------+                                 | | |
> +  | | | | Connection  | | Connection  |                                 | | |
> +  | | | | :1.22       | | :1.25       |                                 | | |
> +  | | | +-------------+ +-------------+                                 | | |
> +  | | +-----------------------------------------------------------------+ | |
> +  | |                                                                     | |
> +  | | +-----------------------------------------------------------------+ | |
> +  | | | Bus (User Bus for UID 270 of "fedoracontainer")                 | | |
> +  | | | /dev/kdbus/domain/fedoracontainer/2702-user/                    | | |
> +  | | | +-----------------------------+                                 | | |
> +  | | | | Endpoint                    |                                 | | |
> +  | | | | /dev/.../2702-user/bus      |                                 | | |
> +  | | | +-----------------------------+                                 | | |
> +  | | | +-------------+ +-------------+                                 | | |
> +  | | | | Connection  | | Connection  |                                 | | |
> +  | | | | :1.22       | | :1.25       |                                 | | |
> +  | | | +-------------+ +-------------+                                 | | |
> +  | | +-----------------------------------------------------------------+ | |
> +  | +---------------------------------------------------------------------+ |
> +  +-------------------------------------------------------------------------+
> +
> +The above description uses the D-Bus notation of unique connection names that
> +adds a ":1." prefix to the connection's unique ID. kbus itself doesn't
> +use that notation, neither internally nor externally. However, libraries and
> +other usespace code that aims for compatibility to D-Bus might.
> +
> +3.2 Flags
> +---------
> +
> +All ioctls used in the communication with the driver contain two 64-bit fields,
> +'flags' and 'kernel_flags'. In 'flags', the behavior of the command can be
> +tweaked, whereas in 'kernel_flags', the kernel driver writes back the mask of
> +supported bits upon each call, and sets the KDBUS_FLAGS_KERNEL bit. This is a
> +way to probe possible kernel features and make code forward and backward
> +compatible.
> +
> +All bits that are not recognized by the kernel in 'flags' are rejected, and the
> +ioctl fails with -EINVAL.
> +
> +
> +4. Items
> +===============================================================================
> +
> +To flexibly augment transport structures used by kdbus, data blobs of type
> +struct kdbus_item are used. An item has a fixed-sized header that only stores
> +the type of the item and the overall size. The total size is variable and is
> +in some cases defined by the item type, in other cases, they can be of
> +arbitrary length (for instance, a string).
> +
> +In the external kernel API, items are used for many ioctls to transport
> +optional information from userspace to kernelspace. They are also used for
> +information stored in a connection's pool, such as messages, name lists or
> +requested connection information.
> +
> +In all such occasions where items are used as part of the kdbus kernel API,
> +they are embedded in structs that have an overall size of their own, so there
> +can be many of them.
> +
> +The kernel expects all items to be aligned to 8-byte boundaries.
> +
> +A simple iterator in userspace would iterate over the items until the items
> +have reached the embedding structure's overall size. An example implementation
> +of such an iterator can be found in tools/testing/selftests/kdbus/kdbus-util.h.
> +
> +
> +5. Creation of new domains, buses and endpoints
> +===============================================================================
> +
> +The initial kdbus domain is unconditionally created by the kernel module. A
> +domain contains a "control" device node which allows to create a new bus or
> +domain. New domains do not have any buses created by default.
> +
> +
> +5.1 Domains and buses
> +---------------------
> +
> +Opening the control device node returns a file descriptor, it accepts the
> +ioctls KDBUS_CMD_BUS_MAKE and KDBUS_CMD_DOMAIN_MAKE which specify the name of
> +the new bus or domain to create. The control file descriptor needs to be kept
> +open for the entire life-time of the created bus or domain, closing it will
> +immediately cleanup the entire bus or domain and all its associated
> +resources and connections. Every control file descriptor can only be used once
> +to create a new bus or domain; from that point, it is not used for any
> +further communication until the final close().
> +
> +Each bus will generate a random, 128-bit UUID upon creation. It will be
> +returned to the creators of connections through kdbus_cmd_hello.id128 and can
> +be used by userspace to uniquely identify buses, even across different machines
> +or containers. The UUID will have its its variant bits set to 'DCE', and denote

its its

> +version 4 (random).
> +
> +When a new domain is created, its structure in /dev/kdbus/<name>/ is a
> +replication of what's initially created in /dev/kdbus. In fact, internally,
> +a dummy default domain is set up when the driver is loaded. This allows
> +userspace to bind-mount domain subtrees of /dev/kdbus into a container's
> +filesystem view, and hence achieve complete isolation from the host's domain
> +and those of other containers.
> +
> +
> +5.2 Endpoints
> +-------------
> +
> +Endpoints are entry points to a bus. By default, each bus has a default
> +endpoint called 'bus'. The bus owner has the ability to create custom
> +endpoints with specific names, permissions, and policy databases (see below).
> +
> +To create a custom endpoint, use the KDBUS_CMD_ENDPOINT_MAKE ioctl with struct
> +kdbus_cmd_make. Custom endpoints always have a policy db that, by default,

db -> database

> +does not allow anything. Everything that users of this new endpoint should be
> +able to do has to be explicitly specified through KDBUS_ITEM_NAME and
> +KDBUS_ITEM_POLICY_ACCESS items.
> +
> +5.3 Creating domains, buses and endpoints
> +-----------------------------------------
> +
> +KDBUS_CMD_BUS_MAKE, KDBUS_CMD_DOMAIN_MAKE and KDBUS_CMD_ENDPOINT_MAKE take a
> +struct kdbus_cmd_make argument.
> +
> +struct kdbus_cmd_make {
> +  __u64 size;
> +    The overall size of the struct, including its items.
> +
> +  __u64 flags;
> +    The flags for creation.
> +
> +    KDBUS_MAKE_ACCESS_GROUP
> +      Make the device node group-accessible
> +
> +    KDBUS_MAKE_ACCESS_WORLD
> +      Make the device node world-accessible
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    A list of items, only used for creating custom endpoints. Ignored for
> +    buses and domains.
> +};
> +
> +
> +6. Connections
> +===============================================================================
> +
> +
> +6.1 Connection IDs and well-known connection names
> +--------------------------------------------------
> +
> +Connections are identified by their connection id, internally implemented as a
> +uint64_t counter. The IDs of every newly created bus start at 1, and every new
> +connection will increment the counter by 1. The ids are not reused.
> +
> +In higher level tools, the user visible representation of a connection is
> +defined by the D-Bus protocol specification as ":1.<id>".
> +
> +Messages with a specific uint64_t destination id are directly delivered to
> +the connection with the corresponding id. Messages with the special destination
> +id KDBUS_DST_ID_BROADCAST are broadcast messages and are potentially delivered
> +to all known connections on the bus; clients interested in broadcast messages
> +need to subscribe to the specific messages they are interested though, before

comma before though

> +any broadcast message reaches them.
> +
> +Messages synthesized and sent directly by the kernel will carry the special
> +source id KDBUS_SRC_ID_KERNEL (0).
> +
> +In addition to the unique uint64_t connection id, established connections can
> +request the ownership of well-known names, under which they can be found and
> +addressed by other bus clients. A well-known name is associated with one and
> +only one connection at a time. See section 8 on name acquisition and the
> +name registry, and the validity of names.
> +
> +Messages can specify the special destination id 0 and carry a well-known name
> +in the message data. Such a message is delivered to the destination connection
> +which owns that well-known name.
> +
> +  +-------------------------------------------------------------------------+
> +  | +---------------+     +---------------------------+                     |
> +  | | Connection    |     | Message                   | -----------------+  |
> +  | | :1.22         | --> | src: 22                   |                  |  |
> +  | |               |     | dst: 25                   |                  |  |
> +  | |               |     |                           |                  |  |
> +  | |               |     |                           |                  |  |
> +  | |               |     +---------------------------+                  |  |
> +  | |               |                                                    |  |
> +  | |               | <--------------------------------------+           |  |
> +  | +---------------+                                        |           |  |
> +  |                                                          |           |  |
> +  | +---------------+     +---------------------------+      |           |  |
> +  | | Connection    |     | Message                   | -----+           |  |
> +  | | :1.25         | --> | src: 25                   |                  |  |
> +  | |               |     | dst: 0xffffffffffffffff   | -------------+   |  |
> +  | |               |     |  (KDBUS_DST_ID_BROADCAST) |              |   |  |
> +  | |               |     |                           | ---------+   |   |  |
> +  | |               |     +---------------------------+          |   |   |  |
> +  | |               |                                            |   |   |  |
> +  | |               | <--------------------------------------------------+  |
> +  | +---------------+                                            |   |      |
> +  |                                                              |   |      |
> +  | +---------------+     +---------------------------+          |   |      |
> +  | | Connection    |     | Message                   | --+      |   |      |
> +  | | :1.55         | --> | src: 55                   |   |      |   |      |
> +  | |               |     | dst: 0 / org.foo.bar      |   |      |   |      |
> +  | |               |     |                           |   |      |   |      |
> +  | |               |     |                           |   |      |   |      |
> +  | |               |     +---------------------------+   |      |   |      |
> +  | |               |                                     |      |   |      |
> +  | |               | <------------------------------------------+   |      |
> +  | +---------------+                                     |          |      |
> +  |                                                       |          |      |
> +  | +---------------+                                     |          |      |
> +  | | Connection    |                                     |          |      |
> +  | | :1.81         |                                     |          |      |
> +  | | org.foo.bar   |                                     |          |      |
> +  | |               |                                     |          |      |
> +  | |               |                                     |          |      |
> +  | |               | <-----------------------------------+          |      |
> +  | |               |                                                |      |
> +  | |               | <----------------------------------------------+      |
> +  | +---------------+                                                       |
> +  +-------------------------------------------------------------------------+
> +
> +
> +6.2 Creating connections
> +------------------------
> +
> +A connection to a bus is created by opening an endpoint device node of
> +a bus and becoming an active client with the KDBUS_CMD_HELLO ioctl. Every
> +connected client connection has a unique identifier on the bus and can
> +address messages to every other connection on the same bus by using
> +the peer's connection id as the destination.
> +
> +The KDBUS_CMD_HELLO ioctl takes the following struct as argument.
> +
> +struct kdbus_cmd_hello {
> +  __u64 size;
> +    The overall size of the struct, including all attached items.
> +
> +  __u64 conn_flags;
> +    Flags to apply to this connection:
> +
> +    KDBUS_HELLO_ACCEPT_FD
> +      When this flag is set, the connection can be sent file descriptors
> +      as message payload. If it's not set, any attempt of doing so will
> +      result in -ECOMM on the sender's side.
> +
> +    KDBUS_HELLO_ACTIVATOR
> +      Make this connection an activator (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes
> +      the well-known name this connection should be an activator for.
> +
> +    KDBUS_HELLO_POLICY_HOLDER
> +      Make this connection a policy holder (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes
> +      the well-known name this connection should hold a policy for.
> +
> +    KDBUS_HELLO_MONITOR
> +      Make this connection an eaves-dropping connection that receives all
> +      unicast messages sent on the bus. To also receive broadcast messages,
> +      the connection has to upload appropriate matches as well.
> +      This flag is only valid for privileged bus connections.
> +
> +  __u64 attach_flags;
> +      Request the attachment of metadata for each message received by this
> +      connection. The metadata actually attached may actually augment the list
> +      of requested items. See section 13 for more details.
> +
> +  __u64 bus_flags;
> +      Upon successful completion of the ioctl, this member will contain the
> +      flags of the bus it connected to.
> +
> +  __u64 id;
> +      Upon successful completion of the ioctl, this member will contain the
> +      id of the new connection.
> +
> +  __u64 pool_size;
> +      The size of the communication pool, in bytes. The pool can be accessed
> +      by calling mmap() on the file descriptor that was used to issue the
> +      KDBUS_CMD_HELLO ioctl.
> +
> +  struct kdbus_bloom_parameter bloom;
> +      Bloom filter parameter (see below).
> +
> +  __u8 id128[16];
> +      Upon successful completion of the ioctl, this member will contain the
> +      128 bit wide UUID of the connected bus.
> +
> +  struct kdbus_item items[0];
> +      Variable list of items to add optional additional information. The
> +      following items are currently expected/valid:
> +
> +      KDBUS_ITEM_CONN_NAME
> +        Contains a string to describes this connection's name, so it can be
> +        identified later.
> +
> +      KDBUS_ITEM_NAME
> +      KDBUS_ITEM_POLICY_ACCESS
> +        For activators and policy holders only, combinations of these two
> +        items describe policy access entries (see section about policy db).

the section is titled 'Policy', not policy db

> +
> +      KDBUS_ITEM_CREDS
> +      KDBUS_ITEM_SECLABEL
> +        Privileged bus users may submit these types in order to create
> +        connections with faked credentials. The only real use case for this
> +        is a proxy service which acts on behalf of some other tasks. For a
> +        connection that runs in that mode, the message's metadata items will
> +        be limited to what's specified here. See section 13 for more
> +        information.
> +
> +      Items of other types are silently ignored.
> +};
> +
> +
> +6.3 Activator and policy holder connection
> +------------------------------------------
> +
> +An activator connection is a placeholder for a well-known name. Messages sent
> +to such a connection can be used by userspace to start an implementor
> +connection, which will then get all the messages from the activator copied
> +over. An activator connection cannot be used to send any message.
> +
> +A policy holder connection only installs a policy for one or more names.
> +These policy entries are kept active as long as the connection is alive, and
> +are removed once it terminates. Such a policy connection type can be used to
> +deploy restrictions for names that are not yet active on the bus. A policy
> +holder connection cannot be used to send any message.
> +
> +The creation of activator, policy holder or monitor connections is an operation
> +restricted to privileged users on the bus (see section "Terminology").
> +
> +
> +6.4 Retrieving information on a connection
> +------------------------------------------
> +
> +The KDBUS_CMD_CONN_INFO ioctl can be used to retrieve credentials and
> +properties of the initial creator of a connection. This ioctl uses the
> +following struct:
> +
> +struct kdbus_cmd_info {
> +  __u64 size;
> +    The overall size of the struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    Specify which items should be attached to the answer.
> +    The following flags can be used:
> +
> +    KDBUS_ATTACH_NAMES
> +      Add an item to the answer containing all the names the connection
> +      currently owns.
> +
> +    KDBUS_ATTACH_CONN_NAME
> +      Add an item to the answer containing the connection's name.
> +
> +    After the ioctl returns, this field will contain the current metadata
> +    attach flags of the connection.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 id;
> +    The connection's numerical ID to retrieve information for. If set to
> +    non-zero value, the 'name' field is ignored.
> +
> +  __u64 offset;
> +    When the ioctl returns, this value will yield the offset of the connection
> +    information inside the caller's pool.
> +
> +  struct kdbus_item items[0];
> +    The optional item list, containing the well-known name to look up as
> +    a KDBUS_ITEM_NAME. Only required if the 'id' field is set to 0.
> +    All other items are currently ignored.
> +};
> +
> +After the ioctl returns, the following struct  will be stored in the caller's

extra space after struct

> +pool at 'offset'.
> +
> +struct kdbus_info {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 id;
> +    The connection's unique ID.
> +
> +  __u64 flags;
> +    The connection's flags as specified when it was created.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Depending on the 'flags' field in struct kdbus_cmd_info, items of
> +    types KDBUS_ITEM_NAME and KDBUS_ITEM_CONN_NAME are followed here.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.5 Getting information about a connection's bus creator
> +--------------------------------------------------------
> +
> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
> +the bus the connection is attached to. The metadata returned by this call is
> +collected during the creation of the bus and is never altered afterwards, so
> +it provides pristine information on the task that created the bus, at the
> +moment when it did so.
> +
> +In response to this call, a slice in the connection's pool is allocated and
> +filled with an object of type struct kdbus_info, pointed to by the ioctl's
> +'offset' field.
> +
> +struct kdbus_info {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 id;
> +    The bus' ID
> +
> +  __u64 flags;
> +    The bus' flags as specified when it was created.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Metadata information is stored in items here.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.6 Updating connection details
> +-------------------------------
> +
> +Some of a connection's details can be updated with the KDBUS_CMD_CONN_UPDATE
> +ioctl, using the file descriptor that was used to create the connection.
> +The update command uses the following struct.
> +
> +struct kdbus_cmd_update {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  struct kdbus_item items[0];
> +    Items to describe the connection details to be updated. The following item
> +    types are supported:
> +
> +    KDBUS_ITEM_ATTACH_FLAGS
> +      Supply a new set of items to be attached to each message.
> +
> +    KDBUS_ITEM_NAME
> +    KDBUS_ITEM_POLICY_ACCESS
> +      Policy holder connections may supply a new set of policy information
> +      with these items. For other connection types, -EOPNOTSUPP is returned.
> +};
> +
> +
> +6.6 Termination
> +---------------
> +
> +A connection can be terminated by simply closing the file descriptor that was
> +used to start the connection. All pending incoming messages will be discarded,
> +and the memory in the pool will be freed.
> +
> +An alternative way of way of closing down a connection is calling the

way of way

> +KDBUS_CMD_BYEBYE ioctl on it, which will only succeed if the message queue
> +of the connection is empty at the time of closing, otherwise, -EBUSY is
> +returned.
> +
> +When this ioctl returns successfully, the connection has been terminated and
> +won't accept any new messages from remote peers. This way, a connection can
> +be terminated race-free, without losing any messages.
> +
> +
> +7. Messages
> +===============================================================================
> +
> +Messages consist of a fixed-size header followed directly by a list of
> +variable-sized data 'items'. The overall message size is specified in the
> +header of the message. The chain of data items can contain well-defined
> +message metadata fields, raw data, references to data, or file descriptors.
> +
> +
> +7.1 Sending messages
> +--------------------
> +
> +Messages are passed to the kernel with the KDBUS_CMD_MSG_SEND ioctl. Depending
> +on the the destination address of the message, the kernel delivers the message

the the

> +to the specific destination connection or to all connections on the same bus.
> +Sending messages across buses is not possible. Messages are always queued in
> +the memory pool of the destination connection (see below).
> +
> +The KDBUS_CMD_MSG_SEND ioctl uses struct kdbus_msg to describe the message to
> +be sent.
> +
> +struct kdbus_msg {
> +  __u64 size;
> +    The over all size of the struct, including the attached items.

overall

> +
> +  __u64 flags;
> +    Flags for message delivery:
> +
> +    KDBUS_MSG_FLAGS_EXPECT_REPLY
> +      Expect a reply from the remote peer to this message. With this bit set,
> +      the timeout_ns field must be set to a non-zero number of nanoseconds in
> +      which the receiving peer is expected to reply. If such a reply is not
> +      received in time, the sender will be notified with a timeout message
> +      (see below). The value must be an absolute value, in nanoseconds and
> +      based on CLOCK_MONOTONIC.
> +
> +      For a message to be accepted as reply, it must be a direct message to
> +      the original sender (not a broadcast), and its kdbus_msg.reply_cookie
> +      must match the previous message's kdbus_msg.cookie.
> +
> +      Expected replies also temporarily open the policy of the sending
> +      connection, so the other peer is allowed to respond within the given
> +      time window.
> +
> +    KDBUS_MSG_FLAGS_SYNC_REPLY
> +      By default, all calls to kdbus are considered asynchronous,
> +      non-blocking. However, as there are many use cases that need to wait
> +      for a remote peer to answer a method call, there's a way to send a
> +      message and wait for a reply in a synchronous fashion. This is what
> +      the KDBUS_MSG_FLAGS_SYNC_REPLY controls. The KDBUS_CMD_MSG_SEND ioctl
> +      will block until the reply has arrived, the timeout limit is reached,
> +      in case the remote connection was shut down, or if interrupted by
> +      a signal before any reply; see signal(7).
> +
> +      The offset of the reply message in the sender's pool is stored in
> +      in 'offset_reply' when the ioctl has returned without error. Hence,
> +      there is no need for another KDBUS_CMD_MSG_RECV ioctl or anything else
> +      to receive the reply.
> +
> +    KDBUS_MSG_FLAGS_NO_AUTO_START
> +      By default, when a message is sent to an activator connection, the
> +      activator notified and will start an implementor. This flag inhibits
> +      that behavior. With this bit set, and the remote being an activator,
> +      -EADDRNOTAVAIL is returned from the ioctl.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call of
> +    KDBUS_MSG_SEND.
> +
> +  __s64 priority;
> +    The priority of this message. Receiving messages (see below) may
> +    optionally be constrained to messages of a minimal priority. This
> +    allows for use cases where timing critical data is interleaved with
> +    control data on the same connection. If unused, the priority should be
> +    set to zero.
> +
> +  __u64 dst_id;
> +    The numeric ID of the destination connection, or KDBUS_DST_ID_BROADCAST
> +    (~0ULL) to address every peer on the bus, or KDBUS_DST_ID_NAME (0) to look
> +    it up dynamically from the bus' name registry. In the latter case, an item
> +    of type KDBUS_ITEM_DST_NAME is mandatory.
> +
> +  __u64 src_id;
> +    Upon return of the ioctl, this member will contain the sending
> +    connection's numerical ID. Should be 0 at send time.
> +
> +  __u64 payload_type;
> +    Type of the payload in the actual data records. Currently, only
> +    KDBUS_PAYLOAD_DBUS is accepted as input value of this field. When
> +    receiving messages that are generated by the kernel (notifications),
> +    this field will yield KDBUS_PAYLOAD_KERNEL.
> +
> +  __u64 cookie;
> +    Cookie of this message, for later recognition. Also, when replying
> +    to a message (see above), the cookie_reply field must match this value.
> +
> +  __u64 timeout_ns;
> +    If the message sent requires a reply from the remote peer (see above),
> +    this field contains the timeout in absolute nanoseconds based on
> +    CLOCK_MONOTONIC.
> +
> +  __u64 cookie_reply;
> +    If the message sent is a reply to another message, this field must
> +    match the cookie of the formerly received message.
> +
> +  __u64 offset_reply;
> +    If the message successfully got a synchronous reply (see above), this
> +    field will yield the offset of the reply message in the sender's pool.
> +    Is is what KDBUS_CMD_MSG_RECV usually does for asynchronous messages.
> +
> +  struct kdbus_item items[0];
> +    A dynamically sized list of items to contain additional information.
> +    The following items are expected/valid:
> +
> +    KDBUS_ITEM_PAYLOAD_VEC
> +    KDBUS_ITEM_PAYLOAD_MEMFD
> +    KDBUS_ITEM_FDS
> +      Actual data records containing the payload. See section "Passing of
> +      Payload Data".
> +
> +    KDBUS_ITEM_BLOOM_FILTER
> +      Bloom filter for matches (see below).
> +
> +    KDBUS_ITEM_DST_NAME
> +      Well-known name to send this message to. Required if dst_id is set
> +      to KDBUS_DST_ID_NAME. If a connection holding the given name can't
> +      be found, -ESRCH is returned.
> +      For messages to a unique name (ID), this item is optional. If present,
> +      the kernel will make sure the name owner matches the given unique name.
> +      This allows userspace tie the message sending to the condition that a
> +      name is currently owned by a certain unique name.
> +};
> +
> +The message will be augmented by the requested metadata items when queued into
> +the receiver's pool. See also section 13.1 ("Metadata and namespaces").
> +
> +
> +7.2 Message layout
> +------------------
> +
> +The layout of a message is shown below.
> +
> +  +-------------------------------------------------------------------------+
> +  | Message                                                                 |
> +  | +---------------------------------------------------------------------+ |
> +  | | Header                                                              | |
> +  | | size: overall message size, including the data records              | |
> +  | | destination: connection id of the receiver                          | |
> +  | | source: connection id of the sender (set by kernel)                 | |
> +  | | payload_type: "DBusDBus" textual identifier stored as uint64_t      | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size: overall record size (without padding)                         | |
> +  | | type: type of data                                                  | |
> +  | | data: reference to data (address or file descriptor)                | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | padding bytes to the next 8 byte alignment                          | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size: overall record size (without padding)                         | |
> +  | | ...                                                                 | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | padding bytes to the next 8 byte alignment                          | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size: overall record size                                           | |
> +  | | ...                                                                 | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | padding bytes to the next 8 byte alignment                          | |
> +  | +---------------------------------------------------------------------+ |
> +  +-------------------------------------------------------------------------+
> +
> +
> +7.3 Passing of Payload Data
> +---------------------------
> +
> +When connecting to the bus, receivers request a memory pool of a given size,
> +large enough to carry all backlog of data enqueued for the connection. The
> +pool is internally backed by a shared memory file which can be mmap()ed by
> +the receiver.
> +
> +KDBUS_MSG_PAYLOAD_VEC:
> +  Messages are directly copied by the sending process into the receiver's pool,
> +  that way two peers can exchange data by effectively doing a single-copy from
> +  one process to another, the kernel will not buffer the data anywhere else.
> +
> +KDBUS_MSG_PAYLOAD_MEMFD:
> +  Messages can reference memfd files which contain the data.
> +  memfd files are tmpfs-backed files that allow sealing of the content of the
> +  file, which prevents all writable access to the file content.
> +  Only sealed memfd files are accepted as payload data, which enforces
> +  reliable passing of data; the receiver can assume that neither the sender nor
> +  anyone else can alter the content after the message is sent.
> +
> +Apart from the sender filling-in the content into memfd files, the data will
> +be passed as zero-copy from one process to another, read-only, shared between
> +the peers.
> +
> +
> +7.4 Receiving messages
> +----------------------
> +
> +Messages are received by the client with the KDBUS_CMD_MSG_RECV ioctl. The
> +endpoint device node of the bus supports poll() to wake up the receiving
> +process when new messages are queued up to be received.
> +
> +With the KDBUS_CMD_MSG_RECV ioctl, a struct kdbus_cmd_recv is used.
> +
> +struct kdbus_cmd_recv {
> +  __u64 flags;
> +    Flags to control the receive command.
> +
> +    KDBUS_RECV_PEEK
> +      Just return the location of the next message. Do not install file
> +      descriptors or anything else. This is usually used to determine the
> +      sender of the next queued message.
> +
> +    KDBUS_RECV_DROP
> +      Drop the next message without doing anything else with it, and free the
> +      pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
> +
> +    KDBUS_RECV_USE_PRIORITY
> +      Use the priority field (see below).
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __s64 priority;
> +      With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
> +      the queue with at least the given priority. If no such message is waiting
> +      in the queue, -ENOMSG is returned.
> +
> +  __u64 offset;
> +      Upon return of the ioctl, this field contains the offset in the
> +      receiver's memory pool.
> +};
> +
> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the
> +offset field contains the location of the new message inside the receiver's
> +pool. The message is stored as struct kdbus_msg at this offset, and can be
> +interpreted with the semantics described above.
> +
> +Also, if the connection allowed for file descriptor to be passed
> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
> +returns. The receiving task is obliged to close all of them appropriately.
> +
> +The caller is obliged to call KDBUS_CMD_FREE with the returned offset when
> +the memory is no longer needed.
> +
> +
> +7.5 Canceling messages synchronously waiting for replies
> +--------------------------------------------------------
> +
> +When a connection sends a message with KDBUS_MSG_FLAGS_SYNC_REPLY and
> +blocks while waiting for the reply, the KDBUS_CMD_MSG_CANCEL ioctl can be
> +used on the same file descriptor to cancel the message, based on its cookie.
> +If there are multiple messages with the same cookie that are all synchronously
> +waiting for a reply, all of them will be canceled. Obviously, this is only
> +possible in multi-threaded applications.
> +
> +
> +8. Name registry
> +===============================================================================
> +
> +Each bus instantiates a name registry to resolve well-known names into unique
> +connection IDs for message delivery. The registry will be queried when a
> +message is sent with kdbus_msg.dst_id set to KDBUS_DST_ID_NAME, or when a
> +registry dump is requested.
> +
> +All of the below is subject to policy rules for SEE and OWN permissions.
> +
> +
> +8.1 Name validity
> +-----------------
> +
> +A name has to comply to the following rules to be considered valid:
> +
> + - The name has two or more elements separated by a period ('.') character
> + - All elements must contain at least one character
> + - Each element must only contain the ASCII characters "[A-Z][a-z][0-9]_"
> +   and must not begin with a digit
> + - The name must contain at least one '.' (period) character
> +   (and thus at least two elements)
> + - The name must not begin with a '.' (period) character
> + - The name must not exceed KDBUS_NAME_MAX_LEN (255)
> +
> +
> +8.2 Acquiring a name
> +--------------------
> +
> +To acquire a name, a client uses the KDBUS_CMD_NAME_ACQUIRE ioctl with the
> +following data structure.
> +
> +struct kdbus_cmd_name {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    Flags to control details in the name acquisition.
> +
> +    KDBUS_NAME_REPLACE_EXISTING
> +      Acquiring a name that is already present usually fails, unless this flag
> +      is set in the call, and KDBUS_NAME_ALLOW_REPLACEMENT or (see below) was
> +      set when the current owner of the name acquired it, or if the current
> +      owner is an activator connection (see below).
> +
> +    KDBUS_NAME_ALLOW_REPLACEMENT
> +      Allow other connections to take over this name. When this happens, the
> +      former owner of the connection will be notified of the name loss.
> +
> +    KDBUS_NAME_QUEUE (acquire)
> +      A name that is already acquired by a connection, and which wasn't
> +      requested with the KDBUS_NAME_ALLOW_REPLACEMENT flag set can not be
> +      acquired again. However, a connection can put itself in a queue of
> +      connections waiting for the name to be released. Once that happens, the
> +      first connection in that queue becomes the new owner and is notified
> +      accordingly.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Items to submit the name. Currently, one one item of type KDBUS_ITEM_NAME

one one

> +    is expected and allowed, and the contained string must be a valid bus name.
> +};
> +
> +
> +8.3 Releasing a name
> +--------------------
> +
> +A connection may release a name explicitly with the KDBUS_CMD_NAME_RELEASE
> +ioctl. If the connection was an implementor of an activatable name, its
> +pending messages are moved back to the activator. If there are any connections
> +queued up as waiters for the name, the oldest one of them will become the new
> +owner. The same happens implicitly for all names once a connection terminates.
> +
> +The KDBUS_CMD_NAME_RELEASE ioctl uses the same data structure as the
> +acquisition call, but with slightly different field usage.
> +
> +struct kdbus_cmd_name {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +
> +  struct kdbus_item items[0];
> +    Items to submit the name. Currently, one one item of type KDBUS_ITEM_NAME

one one

> +    is expected and allowed, and the contained string must be a valid bus name.
> +};
> +
> +
> +8.4 Dumping the name registry
> +-----------------------------
> +
> +A connection may request a complete or filtered dump of currently active bus
> +names with the KDBUS_CMD_NAME_LIST ioctl, which takes a struct
> +kdbus_cmd_name_list as argument.
> +
> +struct kdbus_cmd_name_list {
> +  __u64 flags;
> +    Any combination of flags to specify which names should be dumped.
> +
> +    KDBUS_NAME_LIST_UNIQUE
> +      List the unique (numeric) IDs of the connection, whether it owns a name
> +      or not.
> +
> +    KDBUS_NAME_LIST_NAMES
> +      List well-known names stored in the database which are actively owned by
> +      a real connection (not an activator).
> +
> +    KDBUS_NAME_LIST_ACTIVATORS
> +      List names that are owned by an activator.
> +
> +    KDBUS_NAME_LIST_QUEUED
> +      List connections that are not yet owning a name but are waiting for it
> +      to become available.
> +
> +  __u64 offset;
> +    When the ioctl returns successfully, the offset to the name registry dump
> +    inside the connection's pool will be stored in this field.
> +};
> +
> +The returned list of names is stored in a struct kdbus_name_list that in turn
> +contains a dynamic number of struct kdbus_cmd_name that carry the actual
> +information. The fields inside that struct kdbus_cmd_name is described next.
> +
> +struct kdbus_name_info {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    The current flags for this name. Can be any combination of
> +
> +    KDBUS_NAME_ALLOW_REPLACEMENT
> +
> +    KDBUS_NAME_IN_QUEUE (list)
> +      When retrieving a list of currently acquired name in the registry, this
> +      flag indicates whether the connection actually owns the name or is
> +      currently waiting for it to become available.
> +
> +    KDBUS_NAME_ACTIVATOR (list)
> +      An activator connection owns a name as a placeholder for an implementor,
> +      which is started on demand as soon as the first message arrives. There's
> +      some more information on this topic below. In contrast to
> +      KDBUS_NAME_REPLACE_EXISTING, when a name is taken over from an activator
> +      connection, all the messages that have been queued in the activator
> +      connection will be moved over to the new owner. The activator connection
> +      will still be tracked for the name and will take control again if the
> +      implementor connection terminates.
> +      This flag can not be used when acquiring a name, but is implicitly set
> +      through KDBUS_CMD_HELLO with KDBUS_HELLO_ACTIVATOR set in
> +      kdbus_cmd_hello.conn_flags.
> +
> +  __u64 owner_id;
> +    The owning connection's unique ID.
> +
> +  __u64 conn_flags;
> +    The flags of the owning connection.
> +
> +  struct kdbus_item items[0];
> +    Items containing the actual name. Currently, one one item of type

one one

> +    KDBUS_ITEM_NAME will be attached.
> +};
> +
> +The returned buffer must be freed with the KDBUS_CMD_FREE ioctl when the user
> +is finished with it.
> +
> +
> +9. Notifications
> +===============================================================================
> +
> +The kernel will notify its users of the following events.
> +
> +  * When connection A is terminated while connection B is waiting for a reply
> +    from it, connection B is notified with a message with an item of type
> +    KDBUS_ITEM_REPLY_DEAD.
> +
> +  * When connection A does not receive a reply from connection B within the
> +    specified timeout window, connection A will receive a message with an item
> +    of type KDBUS_ITEM_REPLY_TIMEOUT.
> +
> +  * When a connection is created on or removed from a bus, messages with an
> +    item of type KDBUS_ITEM_ID_ADD or KDBUS_ITEM_ID_REMOVE, respectively, are
> +    sent to all bus members that match these messages through their match
> +    database.
> +
> +  * When a connection owns or loses a name, or a name is moved from one
> +    connection to another, messages with an item of type KDBUS_ITEM_NAME_ADD,
> +    KDBUS_ITEM_NAME_REMOVE or KDBUS_ITEM_NAME_CHANGE are sent to all bus
> +    members that match these messages through their match database.
> +
> +A kernel notification is a regular kdbus message with the following details.
> +
> +  * kdbus_msg.src_id == KDBUS_SRC_ID_KERNEL
> +  * kdbus_msg.dst_id == KDBUS_DST_ID_BROADCAST
> +  * kdbus_msg.payload_type == KDBUS_PAYLOAD_KERNEL
> +  * Has exactly one of the aforementioned items attached
> +
> +
> +10. Message Matching, Bloom filters
> +===============================================================================
> +
> +10.1 Matches for broadcast messages from other connections
> +----------------------------------------------------------
> +
> +A message addressed at the connection ID KDBUS_DST_ID_BROADCAST (~0ULL) is a
> +broadcast message, delivered to all connected peers which installed a rule to
> +match certain properties of the message. Without any rules installed in the
> +connection, no broadcast message or kernel-side notifications will be delivered
> +to the connection. Broadcast messages are subject to policy rules and TALK
> +access checks.
> +
> +See section 11 for details on policies, and section 11.5 for more
> +details on implicit policies.
> +
> +Matches for messages from other connections (not kernel notifications) are
> +implemented as bloom filters. The sender adds certain properties of the message
> +as elements to a bloom filter bit field, and sends that along with the
> +broadcast message.
> +
> +The connection adds the message properties it is interested as elements to a
> +bloom mask bit field, and uploads the mask to the match rules of the
> +connection.
> +
> +The kernel will match the broadcast message's bloom filter against the
> +connections bloom mask (simply by &-ing it), and decide whether the message
> +should be delivered to the connection.
> +
> +The kernel has no notion of any specific properties of the message, all it
> +sees are the bit fields of the bloom filter and mask to match against. The
> +use of bloom filters allows simple and efficient matching, without exposing
> +any message properties or internals to the kernel side. Clients need to deal
> +with the fact that they might receive broadcasts which they did not subscribe
> +to, as the bloom filter might allow false-positives to pass the filter.
> +
> +To allow the future extension of the set of elements in the bloom filter, the
> +filter specifies a "generation" number. A later generation must always contain
> +all elements of the set of the previous generation, but can add new elements
> +to the set. The match rules mask can carry an array with all previous
> +generations of masks individually stored. When the filter and mask are matched
> +by the kernel, the mask with the closest matching "generation" is selected
> +as the index into the mask array.
> +
> +
> +10.2 Matches for kernel notifications
> +------------------------------------
> +
> +To receive kernel generated notifications (see section 9), a connection must
> +install special match rules that are different from the bloom filter matches
> +described in the section above. They can be filtered by a sender connection's
> +ID, by one of the name the sender connection owns at the time of sending the
> +message, or by type of the notification (id/name add/remove/change).
> +
> +10.3 Adding a match
> +-------------------
> +
> +To add a match, the KDBUS_CMD_MATCH_ADD ioctl is used, which takes a struct
> +of the struct described below.
> +
> +Note that each of the items attached to this command will internally create
> +one match 'rule', and the collection of them, which is submitted as one block
> +via the ioctl is called a 'match'. To allow a message to pass, all rules of a
> +match have to be satisfied. Hence, adding more items to the command will only
> +narrow the possibility of a match to effectively let the message pass, and will
> +cause the connection's user space process to wake up less likely.
> +
> +Multiple matches can be installed per connection. As long as one of it has a
> +set of rules which allows the message to pass, this one will be decisive.
> +
> +struct kdbus_cmd_match {
> +  __u64 size;
> +    The overall size of the struct, including its items.
> +
> +  __u64 cookie;
> +    A cookie which identifies the match, so it can be referred to at removal
> +    time.
> +
> +  __u64 flags;
> +    Flags to control the behavior of the ioctl.
> +
> +    KDBUS_MATCH_REPLACE:
> +      Remove all entries with the given cookie before installing the new one.
> +      This allows for race-free replacement of matches.
> +
> +  struct kdbus_item items[0];
> +    Items to define the actual rules of the matches. The following item types
> +    are expected. Each item will cause one new match rule to be created.
> +
> +    KDBUS_ITEM_BLOOM_MASK
> +      An item that carries the bloom filter mask to match against in its
> +      data field. The payload size must match the bloom filter size that
> +      was specified when the bus was created.
> +      See section 10.4 for more information.
> +
> +    KDBUS_ITEM_NAME
> +      Specify a name that a sending connection must own at a time of sending
> +      a broadcast message in order to match this rule.
> +
> +    KDBUS_ITEM_ID
> +      Specify a sender connection's ID that will match this rule.
> +
> +    KDBUS_ITEM_NAME_ADD
> +    KDBUS_ITEM_NAME_REMOVE
> +    KDBUS_ITEM_NAME_CHANGE
> +      These items request delivery of broadcast messages that describe a name
> +      acquisition, loss, or change. The details are stored in the item's
> +      kdbus_notify_name_change member. All information specified must be
> +      matched in order to make the message pass. Use KDBUS_MATCH_ID_ANY to
> +      match against any unique connection ID.
> +
> +    KDBUS_ITEM_ID_ADD
> +    KDBUS_ITEM_ID_REMOVE
> +      These items request delivery of broadcast messages that are generated
> +      when a connection is created or terminated. struct kdbus_notify_id_change
> +      is used to store the actual match information. This item can be used to
> +      monitor one particular connection ID, or, when the id field is set to
> +      KDBUS_MATCH_ID_ANY, all of them.
> +
> +    Other item types are ignored.
> +};
> +
> +
> +10.4 Bloom filters
> +------------------
> +
> +Bloom filters allow checking whether a given word is present in a dictionary.
> +This allows connections to set up a mask for information it is interested in,
> +and will be delivered broadcast messages that have a matching filter.
> +
> +For general information on bloom filters, see
> +
> +  https://en.wikipedia.org/wiki/Bloom_filter
> +
> +The size of the bloom filter is defined per bus when it is created, in
> +kdbus_bloom_parameter.size. All bloom filters attached to broadcast messages
> +on the bus must match this size, and all bloom filter matches uploaded by
> +connections must also match the size, or a multiple thereof (see below).
> +
> +The calculation of the mask has to be done on the userspace side. The kernel
> +just checks the bitmasks to decide whether or not to let the message pass. All
> +bits in the mask must match the filter in and bit-wise AND logic, but the
> +mask may have more bits set than the filter. Consequently, false positive
> +matches are expected to happen, and userspace must deal with that fact.
> +
> +Masks are entities that are always passed to the kernel as part of a match
> +(with an item of type KDBUS_ITEM_BLOOM_MASK), and filters can be attached to
> +broadcast messages (with an item of type KDBUS_ITEM_BLOOM_FILTER).
> +
> +For a broadcast to match, all set bits in the filter have to be set in the
> +installed match mask as well. For example, consider a bus has a bloom size
> +of 8 bytes, and the following mask/filter combinations:
> +
> +    filter  0x0101010101010101
> +    mask    0x0101010101010101
> +            -> matches
> +
> +    filter  0x0303030303030303
> +    mask    0x0101010101010101
> +            -> doesn't match
> +
> +    filter  0x0101010101010101
> +    mask    0x0303030303030303
> +            -> matches
> +
> +Hence, in order to catch all messages, a mask filled with 0xff bytes can be
> +installed as a wildcard match rule.
> +
> +Uploaded matches may contain multiple masks, each of which in the size of the
> +bloom size defined by the bus. Each block of a mask is called a 'generation',
> +starting at index 0.
> +
> +At match time, when a broadcast message is about to be delivered, a bloom
> +mask generation is passed, which denotes which of the bloom masks the filter
> +should be matched against. This allows userspace to provide backward compatible
> +masks at upload time, while older clients can still match against older
> +versions of filters.
> +
> +
> +10.5 Removing a match
> +--------------------
> +
> +Matches can be removed through the KDBUS_CMD_MATCH_REMOVE ioctl, which again
> +takes struct kdbus_cmd_match as argument, but its fields are used slightly
> +differently.
> +
> +struct kdbus_cmd_match {
> +  __u64 size;
> +    The overall size of the struct. As it has no items in this use case, the
> +    value should yield 16.
> +
> +  __u64 cookie;
> +    The cookie of the match, as it was passed when the match was added.
> +    All matches that have this cookie will be removed.
> +
> +  __u64 flags;
> +    Unused for this use case,
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Unused for this use case.
> +};
> +
> +
> +11. Policy
> +===============================================================================
> +
> +A policy databases restrict the possibilities of connections to own, see and
> +talk to well-known names. It can be associated with a bus (through a policy
> +holder connection) or a custom endpoint.
> +
> +See section 8.1 for more details on the validity of well-known names.
> +
> +Default endpoints of buses always have a policy database. The default
> +policy is to deny all operations except for operations that are covered by
> +implicit policies. Custom endpoints always have a policy, and by default,
> +a policy database is empty. Therefore, unless policy rules are added, all
> +operations will also be denied by default.
> +
> +See section 11.5 for more details on implicit policies.
> +
> +A set of policy rules is described by a name and multiple access rules, defined
> +by the following struct.
> +
> +struct kdbus_policy_access {
> +  __u64 type;	/* USER, GROUP, WORLD */
> +    One of the following.
> +
> +    KDBUS_POLICY_ACCESS_USER
> +      Grant access to a user with the uid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_GROUP
> +      Grant access to a user with the gid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_WORLD
> +      Grant access to everyone. The 'id' field is ignored.
> +
> +  __u64 access;	/* OWN, TALK, SEE */
> +    The access to grant.
> +
> +    KDBUS_POLICY_SEE
> +      Allow the name to be seen.
> +
> +    KDBUS_POLICY_TALK
> +      Allow the name to be talked to.
> +
> +    KDBUS_POLICY_OWN
> +      Allow the name to be owned.
> +
> +  __u64 id;
> +    For KDBUS_POLICY_ACCESS_USER, stores the uid.
> +    For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
> +};
> +
> +Policies are set through KDBUS_CMD_HELLO (when creating a policy holder
> +connection), KDBUS_CMD_CONN_UPDATE (when updating a policy holder connection),
> +KDBUS_CMD_ENDPOINT_MAKE (creating a custom endpoint) or
> +KDBUS_CMD_ENDPOINT_UPDATE (updating a custom endpoint). In all cases, the name
> +and policy access information is stored in items of type KDBUS_ITEM_NAME and
> +KDBUS_ITEM_POLICY_ACCESS. For this transport, the following rules apply.
> +
> +  * An item of type KDBUS_ITEM_NAME must be followed by at least one
> +    KDBUS_ITEM_POLICY_ACCESS item
> +  * An item of type KDBUS_ITEM_NAME can be followed by an arbitrary number of
> +    KDBUS_ITEM_POLICY_ACCESS items
> +  * An arbitrary number of groups of names and access levels can be passed
> +
> +uids and gids are internally always stored in the kernel's view of global ids,
> +and are translated back and forth on the ioctl level accordingly.
> +
> +
> +11.2 Wildcard names
> +-------------------
> +
> +Policy holder connections may upload names that contain the wildcard suffix
> +(".*"). That way, a policy can be uploaded that is effective for every
> +well-kwown name that extends the provided name by exactly one more level.
> +
> +For example, if an item of a set up uploaded policy rules contains the name
> +"foo.bar.*", both "foo.bar.baz" and "foo.bar.bazbaz" are valid, but
> +"foo.bar.baz.baz" is not.
> +
> +This allows connections to take control over multiple names that the policy
> +holder doesn't need to know about when uploading the policy.
> +
> +Such wildcard entries are not allowed for custom endpoints.
> +
> +
> +11.3 Policy example
> +-------------------
> +
> +For example, a set of policy rules may look like this:
> +
> +  KDBUS_ITEM_NAME: str='org.foo.bar'
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=1000
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=TALK, id=1001
> +  KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=SEE
> +  KDBUS_ITEM_NAME: str='org.blah.baz'
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=0
> +  KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=TALK
> +
> +That means that 'org.foo.bar' may only be owned by uid 1000, but every user on
> +the bus is allowed to see the name. However, only uid 1001 may actually send
> +a message to the connection and receive a reply from it.
> +
> +The second rule allows 'org.blah.baz' to be owned by uid 0 only, but every user
> +may talk to it.
> +
> +
> +11.4 TALK access and multiple well-known names per connection
> +-------------------------------------------------------------
> +
> +Note that TALK access is checked against all names of a connection.
> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
> +after all, we allow messages to be directed to either the name or a well-known
> +name, and policy is applied to the connection, not the name. In other words,
> +the effective TALK policy for a connection is the most permissive of all names
> +the connection owns.
> +
> +If a policy database exists for a bus (because a policy holder created one on
> +demand) or for a custom endpoint (which always has one), each one is consulted
> +during name registry listing, name owning or message delivery. If either one
> +fails, the operation is failed with -EPERM.
> +
> +For best practices, connections that own names with a restricted TALK
> +access should not install matches. This avoids cases where the sent
> +message may pass the bloom filter due to false-positives and may also
> +satisfy the policy rules.
> +
> +11.5 Implicit policies
> +----------------------
> +
> +Depending on the type of the endpoint, a set of implicit rules might be
> +enforced. On default endpoints, the following set is enforced:
> +
> +  * Privileged connections always override any installed policy. Those
> +    connections could easily install their own policies, so there is no
> +    reason to enforce installed policies.
> +  * Connections can always talk to connections of the same user. This
> +    includes broadcast messages.
> +  * Connections that own names might send broadcast messages to other
> +    connections that belong to a different user, but only if that
> +    destination connection does not own any name.
> +
> +Custom endpoints have stricter policies. The following rules apply:
> +
> +  * Policy rules are always enforced, even if the connection is a privileged
> +    connection.
> +  * Policy rules are always enforced for TALK access, even if both ends are
> +    running under the same user. This includes broadcast messages.
> +  * To restrict the set of names that can be seen, endpoint policies can
> +    install "SEE" policies.
> +
> +
> +12. Pool
> +===============================================================================
> +
> +A pool for data received from the kernel is installed for every connection of
> +the bus, and is sized according to kdbus_cmd_hello.pool_size. It is accessed
> +when one of the following ioctls is issued:
> +
> +  * KDBUS_CMD_MSG_RECV, to receive a message
> +  * KDBUS_CMD_NAME_LIST, to dump the name registry
> +  * KDBUS_CMD_CONN_INFO, to retrieve information on a connection
> +
> +Internally, the pool is organized in slices, stored in an rb-tree. The offsets
> +returned by either one of the aforementioned ioctls describe offsets inside the
> +pool. In order to make the slice available for subsequent calls, KDBUS_CMD_FREE
> +has to be called on the offset.
> +
> +To access the memory, the caller is expected to mmap() it to its task, like
> +this:
> +
> +  /*
> +   * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
> +   * value that was previously passed in the .pool_size field of struct
> +   * kdbus_cmd_hello.
> +   */
> +
> +  buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_PRIVATE, conn_fd, 0);
> +
> +
> +13. Metadata
> +===============================================================================
> +
> +When a message is delivered to a receiver connection, it is augmented by
> +metadata items in accordance to the destination's current attach flags. The
> +information stored in those metadata items refer to the sender task at the
> +time of sending the message, so even if any detail of the sender task has
> +already changed upon message reception (or if the sender task does not exist
> +anymore), the information is still preserved and won't be modfied until the
> +message is freed.
> +
> +Note that there are two exceptions to the above rules:
> +
> +  a) Kernel generated messages don't have a source connection, so they won't be
> +     augmented.
> +
> +  b) If a connection was created with faked credentials (see section 6.2),
> +     the only attached metadata items are the ones provided by the connection
> +     itself. The destination's attach_flags won't be looked at in such cases.
> +
> +Also, there are two things to be considered by userspace programs regarding
> +those metadata items:
> +
> +  a) Userspace must cope with the fact that it might get more metadata than
> +     they requested. That happens, for example, when a broadcast message is
> +     sent and receivers have different attach flags. Items that haven't been
> +     requested should hence be silently ignored.
> +
> +  b) Userspace might not always get all requested metadata items that it
> +     requested. That is because some of those items are only added if a
> +     corresponding kernel feature has been enabled. Also, the two exceptions
> +     described above will as well lead to less items be attached than
> +     requested.
> +
> +
> +13.1 Known item types
> +---------------------
> +
> +The following attach flags are currently supported.
> +
> +  KDBUS_ATTACH_TIMESTAMP
> +    Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
> +    monotonic and the realtime timestamp, taken when the message was
> +    processed on the kernel side.
> +
> +  KDBUS_ATTACH_CREDS
> +    Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
> +    described in kdbus_creds: the uid, gid, pid, tid and starttime of the task.
> +
> +  KDBUS_ATTACH_AUXGROUPS
> +    Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
> +    number of auxiliary groups the sending task was a member of.
> +
> +  KDBUS_ATTACH_NAMES
> +    Attaches items of type KDBUS_ITEM_NAME, one for each name the sending
> +    connection currently owns. The name is stored in kdbus_item.str for each
> +    of them.
> +
> +  KDBUS_ATTACH_COMM
> +    Attaches an items of type KDBUS_ITEM_PID_COMM and KDBUS_ITEM_TID_COMM,
> +    both transporting the sending task's 'comm', for both the pid and the tid.
> +    The strings are stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_EXE
> +    Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
> +    executable of the sending task, stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_CMDLINE
> +    Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
> +    arguments of the sending task, as an array of strings, stored in
> +    kdbus_item.str.
> +
> +  KDBUS_ATTACH_CGROUP
> +    Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
> +
> +  KDBUS_ATTACH_CAPS
> +    Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
> +    that should be accessed via kdbus_item.caps.caps. Also, userspace should
> +    be written in a way that it takes kdbus_item.caps.last_cap into account,
> +    and derive the number of sets and rows from the item size and the reported
> +    number of valid capability bits.
> +
> +  KDBUS_ATTACH_SECLABEL
> +    Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
> +    security label of the sending task. Access via kdbus_item->str.
> +
> +  KDBUS_ATTACH_AUDIT
> +    Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audio label
> +    of the sending taskj. Access via kdbus_item->str.
> +
> +  KDBUS_ATTACH_CONN_NAME
> +    Attaches an item of type KDBUS_ITEM_CONN_NAME that contain's the
> +    sending's connection current name in kdbus_item.str.
> +
> +
> +13.1 Metadata and namespaces
> +----------------------------
> +Note that if the user or PID namespaces of a connection at the time of sending
> +differ from those that were active then the connection was created
> +(KDBUS_CMD_HELLO), data structures such as messages will not have any metadata
> +attached to prevent leaking security-relevant information.
> +
> +
> +14. Error codes
> +===============================================================================
> +
> +Below is a list of error codes that might be returned by the individual
> +ioctl commands. The list focuses on the return values from kdbus code itself,
> +and might not cover those of all kernel internal functions.
> +
> +For all ioctls:
> +
> +  -ENOMEM	The kernel memory is exhausted
> +  -ENOTTY	Illegal ioctl command issued for the file descriptor
> +  -ENOSYS	The requested functionality is not available
> +
> +For all ioctls that carry a struct as payload:
> +
> +  -EFAULT	The supplied data pointer was not 64-bit aligned, or was
> +		inaccessible from the kernel side.
> +  -EINVAL	The size inside the supplied struct was smaller than expected
> +  -EMSGSIZE	The size inside the supplied struct was bigger than expected
> +  -ENAMETOOLONG	A supplied name is larger than the allowed maximum size
> +
> +For KDBUS_CMD_BUS_MAKE:
> +
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid or
> +		the supplied name does not start with the current uid and a '-'
> +  -EEXIST	A bus of that name already exists
> +  -ESHUTDOWN	The domain for the bus is already shut down
> +  -EMFILE	The maximum number of buses for the current user is exhausted
> +
> +For KDBUS_CMD_DOMAIN_MAKE:
> +
> +  -EPERM	The calling user does not have CAP_IPC_OWNER set, or
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid, or
> +		no name supplied for top-level domain
> +  -EEXIST	A domain of that name already exists
> +
> +For KDBUS_CMD_ENDPOINT_MAKE:
> +
> +  -EPERM	The calling user is not privileged (see Terminology)
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid
> +  -EEXIST	An endpoint of that name already exists
> +
> +For KDBUS_CMD_HELLO:
> +
> +  -EFAULT	The supplied pool size was 0 or not a multiple of the page size
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid, or
> +		an illegal combination of KDBUS_HELLO_MONITOR,
> +		KDBUS_HELLO_ACTIVATOR and KDBUS_HELLO_POLICY_HOLDER was passed
> +		in the flags, or an invalid set of items was supplied
> +  -EPERM	An KDBUS_ITEM_CREDS items was supplied, but the current user is
> +		not privileged
> +  -ESHUTDOWN	The bus has already been shut down
> +  -EMFILE	The maximum number of connection on the bus has been reached
> +
> +For KDBUS_CMD_BYEBYE:
> +
> +  -EALREADY	The connection has already been shut down
> +  -EBUSY	There are still messages queued up in the connection's pool
> +
> +For KDBUS_CMD_MSG_SEND:
> +
> +  -EOPNOTSUPP	The connection is unconnected, or a fd was passed that is
> +		either a kdbus handle itself or a unix domain socket. Both is
> +		currently unsupported.
> +  -EINVAL	The submitted payload type is KDBUS_PAYLOAD_KERNEL,
> +		KDBUS_MSG_FLAGS_EXPECT_REPLY was set without a timeout value,
> +		KDBUS_MSG_FLAGS_SYNC_REPLY was set without
> +		KDBUS_MSG_FLAGS_EXPECT_REPLY, an invalid item was supplied,
> +		src_id was != 0 and different from the current connection's ID,
> +		a supplied memfd had a size of 0, a string was not properly
> +		nul-terminated
> +  -ENOTUNIQ	KDBUS_MSG_FLAGS_EXPECT_REPLY was set, but the dst_id is set
> +		to KDBUS_DST_ID_BROADCAST
> +  -E2BIG	Too many items
> +  -EMSGSIZE	A payload vector was too big, and the current user is
> +		unprivileged.
> +  -ENOTUNIQ	A fd or memfd payload was passed in a broadcast message, or
> +		a timeout was given for a broadcast message
> +  -EEXIST	Multiple KDBUS_ITEM_FDS or KDBUS_ITEM_BLOOM_FILTER,
> +		KDBUS_ITEM_DST_NAME were supplied
> +  -EBADF	A memfd item contained an illegal fd
> +  -EMEDIUMTYPE	A file descriptor which is not a kdbus memfd was
> +		refused to send as KDBUS_MSG_PAYLOAD_MEMFD.
> +  -EMFILE	Too many file descriptors inside a KDBUS_ITEM_FDS
> +  -EBADMSG	An item had illegal size, both a dst_id and a
> +		KDBUS_ITEM_DST_NAME was given, or both a name and a bloom
> +		filter was given
> +  -ETXTBSY	A kdbus memfd file cannot be sealed or the seal removed,
> +		because it is shared with other processes or still mmap()ed
> +  -ECOMM	A peer does not accept the file descriptors addressed to it
> +  -EFAULT	The supplied bloom filter size was not 64-bit aligned
> +  -EDOM		The supplied bloom filter size did not match the bloom filter
> +		size of the bus
> +  -EDESTADDRREQ	dst_id was set to KDBUS_DST_ID_NAME, but no KDBUS_ITEM_DST_NAME
> +		was attached
> +  -ESRCH	The name to look up was not found in the name registry
> +  -EADDRNOTAVAIL KDBUS_MSG_FLAGS_NO_AUTO_START was given but the destination
> +		 connection is an activator.
> +  -ENXIO	The passed numeric destination connection ID couldn't be found,
> +		or is not connected
> +  -ECONNRESET	The destination connection is no longer active
> +  -ETIMEDOUT	Timeout while synchronously waiting for a reply
> +  -EINTR	System call interrupted while synchronously waiting for a reply
> +  -EPIPE	When sending a message, a synchronous reply from the receiving
> +		connection was expected but the connection died before
> +		answering
> +  -ECANCELED	A synchronous message sending was cancelled
> +  -ENOBUFS	Too many pending messages on the receiver side
> +  -EREMCHG	Both a well-known name and a unique name (ID) was given, but
> +		the name is not currently owned by that connection.
> +
> +For KDBUS_CMD_MSG_RECV:
> +
> +  -EINVAL	Invalid flags or offset
> +  -EAGAIN	No message found in the queue
> +  -ENOMSG	No message of the requested priority found
> +
> +For KDBUS_CMD_MSG_CANCEL:
> +
> +  -EINVAL	Invalid flags
> +  -ENOENT	Pending message with the supplied cookie not found
> +
> +For KDBUS_CMD_FREE:
> +
> +  -ENXIO	No pool slice found at given offset
> +  -EINVAL	Invalid flags provided, the offset is valid, but the user is
> +		not allowed to free the slice. This happens, for example, if
> +		the offset was retrieved with KDBUS_RECV_PEEK.
> +
> +For KDBUS_CMD_NAME_ACQUIRE:
> +
> +  -EINVAL	Illegal command flags, illegal name provided, or an activator
> +		tried to acquire a second name
> +  -EPERM	Policy prohibited name ownership
> +  -EALREADY	Connection already owns that name
> +  -EEXIST	The name already exists and can not be taken over
> +  -ECONNRESET	The connection was reset during the call
> +
> +For KDBUS_CMD_NAME_RELEASE:
> +
> +  -EINVAL	Invalid command flags, or invalid name provided
> +  -ESRCH	Name is not found found in the registry
> +  -EADDRINUSE	Name is owned by a different connection and can't be released
> +
> +For KDBUS_CMD_NAME_LIST:
> +
> +  -EINVAL	Invalid flags
> +  -ENOBUFS	No available memory in the connection's pool.
> +
> +For KDBUS_CMD_CONN_INFO:
> +
> +  -EINVAL	Invalid flags, or neither an ID nor a name was provided,
> +		or the name is invalid.
> +  -ESRCH	Connection lookup by name failed
> +  -ENXIO	No connection with the provided number connection ID found
> +
> +For KDBUS_CMD_CONN_UPDATE:
> +
> +  -EINVAL	Illegal flags or items
> +  -EOPNOTSUPP	Operation not supported by connection.
> +  -E2BIG	Too many policy items attached
> +  -EINVAL	Wildcards submitted in policy entries, or illegal sequence
> +		of policy items
> +
> +For KDBUS_CMD_ENDPOINT_UPDATE:
> +
> +  -E2BIG	Too many policy items attached
> +  -EINVAL	Invalid flags, or wildcards submitted in policy entries,
> +		or illegal sequence of policy items
> +
> +For KDBUS_CMD_MATCH_ADD:
> +
> +  -EINVAL	Illegal flags or items
> +  -EDOM		Illegal bloom filter size
> +  -EMFILE	Too many matches for this connection
> +
> +For KDBUS_CMD_MATCH_REMOVE:
> +
> +  -EINVAL	Illegal flags
> +  -ENOENT	A match entry with the given cookie could not be found.
> 

-- 

Peter Meerwald
+43-664-2444418 (mobile)

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Simon McVittie @ 2014-10-30 12:28 UTC (permalink / raw)
  To: Tom Gundersen, Andy Lutomirski
  Cc: Greg Kroah-Hartman, Jiri Kosina, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, Daniel Mack,
	alban.crequy, Javier Martinez Canillas
In-Reply-To: <CAG-2HqX9RUQHiF1U_CXiDVVLS-7aUOQdYn7EVNSMZNdbe38cTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 30/10/14 11:52, Tom Gundersen wrote:
> For example, if you want to get the audit identity
> bits, you can now get this attached securely by the kernel, at the
> time the message is sent, rather than having to firest get the peer's
> $PID from SCM_CREDENTIALS and then read the audit identity bits racily
> from /proc/$PID/loginuid and /proc/$PID/sessionid

... which dbus-daemon (traditional D-Bus) deliberately doesn't offer as
a feature, because we are not aware of any way to do that over Unix
sockets without a race condition; and if we can't have it securely, we
don't want to have it at all.
<https://bugs.freedesktop.org/show_bug.cgi?id=83499>
It would be great if kdbus can fix that omission.

Capabilities are in the same boat, and as a result, systemd can't
currently have D-Bus methods that can only be called with CAP_WHATEVER.

> * fewer userspace context switches
[...]
> * fewer message copies in userspace

Readers are probably already aware of this, but note that D-Bus is
designed to be usable between mutually distrusting processes, which is
why we use Unix sockets and a lot of copies, rather than mmap or something.

    S

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Andy Lutomirski @ 2014-10-30 13:48 UTC (permalink / raw)
  To: Tom Gundersen
  Cc: Djalal Harouni, Arnd Bergmann, Ryan Lortie, Greg Kroah-Hartman,
	Marcel Holtmann, David Herrmann,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Simon McVittie, John Stultz, Eric W. Biederman, Bastien Nocera,
	Linux API, Tejun Heo, Linux Containers, Linus Torvalds,
	Javier Martinez Canillas, Daniel Mack
In-Reply-To: <CAG-2HqUChohNrRSdXzckSiv8ZUYwFLMvRTc41Uo7-b-qmkSFMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 30, 2014 at 3:15 AM, Tom Gundersen <teg-B22kvLQNl6c@public.gmane.org> wrote:
> Do I understand you correctly that what you want is unnamed/anonymous
> domains? Considering that domain creation is anyway privileged, why is
> this necessary?

As an executive summary, this is the *problem*, not a mitigation.
Domain creation *should not require privilege*.  You should be able to
do it in a user namespace in which you have appropriate capabilities
without needing systemd's (or whatever other daemon's) help from
outside.

Once you fix that (which may not have broken whatever you tested with
but will absolutely break anyone who tries to use this in LXC, Docker,
Sandstorm, etc. without awful hacks) then you will have all of the
problems that you've currently mitigated.

--Andy

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Andy Lutomirski @ 2014-10-30 13:59 UTC (permalink / raw)
  To: Tom Gundersen
  Cc: Greg Kroah-Hartman, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, John Stultz, Arnd Bergmann,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Djalal Harouni, Simon McVittie, Daniel Mack,
	alban.crequy, Javier Martinez Canillas
In-Reply-To: <CAG-2HqX9RUQHiF1U_CXiDVVLS-7aUOQdYn7EVNSMZNdbe38cTA@mail.gmail.com>

On Thu, Oct 30, 2014 at 4:52 AM, Tom Gundersen <teg@jklm.no> wrote:
> On 10/30/2014 12:55 AM, Andy Lutomirski wrote:> It's worth noting that:
>>
>>  - Proper credential passing could be added to UNIX sockets, and we
>> may want to do that anyway.  Also, the current kdbus semantics seem to
>> be "spew lots of credentials and other miscellaneous
>> potentially-sensitive and sometime spoofable information all over the
>> place", which isn't obviously an improvement.  (This is fixable, but
>> it will almost certainly not be compatible with current systemd kdbus
>> code if fixed.)
>
> Care to elaborate on what you think is spoofable, and what needs to be fixed?

cmd and comm are trivially replaceable by any sender.

>
> Anyway, the idea is that by simply connecting to the bus and sending a
> message to some service, you implicitly agree to passing some metadata
> along to the service (and to a lesser extent to the bus). It's not
> that this information is leaked, or that the peer could actively
> access any of the sender's private memory.

To me, this smells like bad design.  By using kdbus, I implicitly
agree to send everyone my command line?!?  If I'm in a cgroup that
policy decrees should be privileged, then I should invoke that
privilege by specifically asking, *at the time of capture*, to send
that cgroup.  Otherwise it becomes unclear what things convey
privilege when, and that will lead immediately to incomprehensible
security models, and that will lead to exploits.

<snark>Sorry, but "implicitly agree" sounds a lot like using my
esteemed cellphone carrier.  When I use it, some argue that I
implicitly agree to have my identity prepended to all outgoing HTTP
requests.  This is *not* a good thing.</snark>

> Also note that this kind of
> metadata information is also available via /proc/$PID, and via
> SCM_CREDENTIALS/SO_PEERCRED and the socket seclabel APIs.

Not if you have a sensible LSM policy or if you use hidepid.  And,
once you've fixed the namespacing issues, not if the sender and
receiver are in different PID namespaces or if they don't have /proc
mounted at all.

>
> When credential information is passed between processes of different
> (PID) namespaces most of the attached metadata is suppressed.

This is a bug.  It prevents users from usefully sandboxing themselves
in a kdbus world.  If you create and enter a user namespace, then your
outside identity (which should be unchanged) is suppressed.  (Note
that anything that captures credentials other than at open time is
also an issue for sandboxes in the other direction: it may interfere
with selective privilege dropping.)

> This
> isn't too different from how SCM_CREDENTIALS works, which will zero
> out the bits it cannot translate as well.

SCM_CREDENTIALS translates the translatable parts.

> There are some major benefits regarding performance:
>
> * fewer userspace context switches. For a full-duplex method call it's
> down from five to two: instead of sender -> dbus daemon -> service ->
> dbus daemon -> sender it's just sender -> service -> sender.
> * fewer message copies in userspace. For a full-duplex method call
> it's down from eight to two: instead of copying the method call data
> into a socket, out of a socket, into a socket, out of a socket, and
> the same for the method reply, we just copy one message directly to
> the receiver, and the reply back.
> * generally fewer syscalls involved. A synchronous method call is now
> doable in a single ioctl on the sender side.
> * memfds can be used for transport purposes of larger payload. This
> way, we can cover substantial payload sizes instead of just small
> control messages, with no extra copies. kdbus, in its transport layer,
> makes sure only sealed memfds are passed in as payload, so the sender
> cannot modify the contents while the receiver is already parsing it.

There should be a number measured in, say, nanoseconds in here
somewhere.  The actual extent of the speedup is unmeasurable here.
Also, it's worth reading at least one of Linus' many rants about
zero-copy.  It's not an automatic win.

--Andy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox