* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 13:53 UTC (permalink / raw)
To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
ebiederm-aS9lmoZGLiVWk0Htik3J/w,
gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
tixxdz-Umm1ozX2/EEdnm+yROfE0A, Daniel Mack, Johannes Stezenbach
In-Reply-To: <1421435777-25306-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
>
> The interface to all functions in this driver is implemented via ioctls
> on files exposed through a filesystem called 'kdbusfs'. The default
> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> documentation about the kernel level API design.
I have some details feedback on the contents of this file, and some
bigger questions. I'll split them out into separate mails.
So here, the bigger, general questions to start with. I've arrived late
to this, so sorry if they've already been discussed, but the answers to
some of the questions should actually be in this file, I would have
expected.
This is an enormous and complex API. Why is the API ioctl() based,
rather than system-call-based? Have we learned nothing from the hydra
that the futex() multiplexing syscall became? (And kdbus is an order
of magnitude more complex, by the look of things.) At the very least,
a *good* justification of why the API is ioctl()-based should be part
of this documentation file.
An observation: The documentation below is substantial, but this API is
enormous, so the documentation still feels rather thin. What would
really help would be some example code in the doc.
And on the subject of code examples... Are there any (prototype)
working user-space applications that exercise the current kdbus
implementation? That is, can I install these kdbus patches, and
then find a simple example application somewhere that does
something to exercise kdbus?
And then: is there any substantial real-world application (e.g., a
full D-Bus port) that is being developed in tandem with this kernel
side patch? (I don't mean a user-space library; I mean a seriously
large application.) This is an incredibly complex API whose
failings are only going to become evident through real-world use.
Solidifying an API in the kernel and then discovering the API
problems later when writing real-world applications would make for
a sad story. A story something like that of inotify, an API which
is an order of magnitude less complex than kdbus. (I can't help but
feel that many of inotify problems that I discuss at
https://lwn.net/Articles/605128/ might have been fixed or mitigated
if a few real-world applications had been implemented before the
API was set in stone.)
> +For a kdbus specific userspace library implementation please refer to:
> + http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
Is this library intended just for systemd? More generally, is there an
intention to provide a general purpose library API for kdbus? Or is the
intention that each application will roll a library suitable to its
needs? I think an answer to that question would be useful in this
Documentation file.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 13:58 UTC (permalink / raw)
To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
ebiederm-aS9lmoZGLiVWk0Htik3J/w,
gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
tixxdz-Umm1ozX2/EEdnm+yROfE0A, Daniel Mack
In-Reply-To: <1421435777-25306-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
>
> The interface to all functions in this driver is implemented via ioctls
> on files exposed through a filesystem called 'kdbusfs'. The default
> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> documentation about the kernel level API design.
And now the details feedback.
Please note that for the various points I raise below, even in
cases where I don't suggest/request a fix, the fact that I've
needed to answer a question probably suggests a deficiency in
the documentation that probably needs to be remedied.
Many of my comments below are wording and typo fixes. While
these may sem trivial, the existence of various wording problems
and typos is a significant distraction, especially while trying
to grok a document of this size.
I've marked one or two notable questions about the API with "###".
> Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Signed-off-by: David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> ---
> Documentation/kdbus.txt | 2107 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 2107 insertions(+)
> create mode 100644 Documentation/kdbus.txt
>
> diff --git a/Documentation/kdbus.txt b/Documentation/kdbus.txt
> new file mode 100644
> index 000000000000..2592a7e37079
> --- /dev/null
> +++ b/Documentation/kdbus.txt
> @@ -0,0 +1,2107 @@
> +D-Bus is a system for powerful, easy to use interprocess communication (IPC).
> +
> +The focus of this document is an overview of the low-level, native kernel D-Bus
> +transport called kdbus. Kdbus exposes its functionality via files in a
> +filesystem called 'kdbusfs'. All communication between processes takes place
> +via ioctls on files exposed through the mount point of a kdbusfs. The default
> +mount point of kdbusfs is /sys/fs/kdbus.
> +
> +Please note that kdbus was designed as transport layer for D-Bus, but is in no
> +way limited, nor controlled by the D-Bus protocol specification. The D-Bus
> +protocol is one possible application layer on top of kdbus.
> +
> +For the general D-Bus protocol specification, the payload format, the
> +marshaling, and the communication semantics, please refer to:
> + http://dbus.freedesktop.org/doc/dbus-specification.html
> +
> +For a kdbus specific userspace library implementation please refer to:
> + http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
> +
> +Articles about D-Bus and kdbus:
> + http://lwn.net/Articles/580194/
> +
> +
> +1. Terminology
> +===============================================================================
> +
> + Domain:
> + A domain is created each time a kdbusfs is mounted. Each process that is
> + capable to mount a new instance of a kdbusfs will have its own kdbus
s/is capable to mount/mounts/
> + hierarchy. Each domain (ie, each mount point) offers its own "control"
> + file to create new buses. Domains have no connection to each other and
> + cannot see nor talk to each other. See section 5 for more details.
Smoother would be:
s/cannot see nor talk/can neither see nor talk/
> +
> + Bus:
> + A bus is a named object inside a domain. Clients exchange messages
> + over a bus. Multiple buses themselves have no connection to each other;
> + messages can only be exchanged on the same bus. The default endpoint of
> + a bus, where clients establish the connection to, is the "bus" file
Maybe:
where clients establish the connection to
==>
to which clients establish connections
?
> + /sys/fs/kdbus/<bus name>/bus.
> + Common operating system setups create one "system bus" per system, and one
> + "user bus" for every logged-in user. Applications or services may create
At the kdbus level is there any difference between such system and user buses?
If not, it would perhaps be good to insert a parenthetical aside to say
that.
> + their own private buses. See section 5 for more details.
> +
> + Endpoint:
> + An endpoint provides a file to talk to a bus. Opening an endpoint
> + creates a new connection to the bus to which the endpoint belongs. All
> + endpoints have unique names and are accessible as files underneath the
> + directory of a bus, e.g., /sys/fs/kdbus/<bus>/<endpoint>
> + Every bus has a default endpoint called "bus". A bus can optionally offer
> + additional endpoints with custom names to provide restricted access to the
> + bus. Custom endpoints carry additional policy which can be used to create
> + sandboxes with locked-down, limited, filtered access to a bus. See
> + section 5 for more details.
> +
> + Connection:
> + A connection to a bus is created by opening an endpoint file of a bus and
> + becoming an active client with the HELLO exchange. Every ordinary client
> + connection has a unique identifier on the bus and can address messages to
> + every other connection on the same bus by using the peer's connection id
> + as the destination. See section 6 for more details.
> +
> + Pool:
> + Each connection allocates a piece of shmem-backed memory that is used
> + to receive messages and answers to ioctl commands from the kernel. It is
> + never used to send anything to the kernel. In order to access that memory,
> + userspace must mmap() it into its address space.
s/userspace/a userspace application/
> + See section 12 for more details.
> +
> + Well-known Name:
> + A connection can, in addition to its implicit unique connection id, request
> + the ownership of a textual well-known name. Well-known names are noted in
> + reverse-domain notation, such as com.example.service1. Connections offering
> + a service on a bus are usually reached by its well-known name. The analogy
Noun/pronoun number disagreement at "Connections... its".
==>
A connection that offers a service on a bus is usually reached by its
well-known name.
> + of connection id and well-known name is an IP address and a DNS name
Doing s/id/ID/ throughout the doc would help readability. (The doc
already uses "ID" in some places, so consistency is a further argument
for this change.)
> + associated with that address.
> +
> + Message:
> + Connections can exchange messages with other connections by addressing
> + the peers with their connection id or well-known name. A message consists
> + of a message header with kernel-specific information on how to route the
What does "kernel-specific" mean here? Something needs explaining (or removing).
> + message, and the message payload, which is a logical byte stream of
> + arbitrary size. Messages can carry additional file descriptors to be passed
So, this is FD passing like UNIX domain sockets? If yes, it would be helpful
here to mention that analogy.
> + from one connection to another. Every connection can specify which set of
> + metadata the kernel should attach to the message when it is delivered
> + to the receiving connection. Metadata contains information like: system
> + timestamps, uid, gid, tid, proc-starttime, well-known-names, process comm,
s/well-known-names/well-known names/
See the note on "ID" above. I think UID, GID, TID throughout would help
readability.
> + process exe, process argv, cgroup, capabilities, seclabel, audit session,
> + loginuid and the connection's human-readable name.
> + See section 7 and 13 for more details.
> +
> + Item:
> + The API of kdbus implements a notion of items, submitted through and
s/a notion/the notion/
> + returned by most ioctls, and stored inside data structures in the
> + connection's pool. See section 4 for more details.
> +
> + Broadcast and Match:
> + Broadcast messages are potentially sent to all connections of a bus. By
> + default, the connections will not actually receive any of the sent
> + broadcast messages; only after installing a match for specific message
> + properties, a broadcast message passes this filter.
"Filter" suddenly gets mentioned here without previously being defined. I suspect
the last piece should read more like:
a connection will receive a broadcast message only after it installs a filter
that matches the specific message properties of the broadcast message.
> + See section 10 for more details.
> +
> + Policy:
> + A policy is a set of rules that define which connections can see, talk to,
> + or register a well-know name on the bus. A policy is attached to buses and
s/know/know/
> + custom endpoints, and modified by policy holder connections or owners of
> + custom endpoints. See section 11 for more details.
> + See section 11 for more details.
Repeated last sentence.
> + Privileged bus users:
> + A user connecting to the bus is considered privileged if it is either the
> + creator of the bus, or if it has the CAP_IPC_OWNER capability flag set.
> +
> +
> +2. Control Files Layout
> +===============================================================================
> +
> +The kdbus interface is exposed through files in its kdbusfs mount point
> +(defaults to /sys/fs/kdbus):
> +
> + /sys/fs/kdbus (mount point of kdbusfs)
> + |-- control (domain control-file)
> + |-- 0-system (bus of user uid=0)
> + | |-- bus (default endpoint of bus '0-system')
> + | `-- ep.apache (custom endpoint of bus '0-system')
> + |-- 1000-user (bus of user uid=1000)
> + | `-- bus (default endpoint of bus '1000-user')
> + `-- 2702-user (bus of user uid=2702)
> + |-- bus (default endpoint of bus '2702-user')
> + `-- ep.app (custom endpoint of bus '2702-user')
> +
> +
> +3. Data Structures and flags
> +===============================================================================
> +
> +3.1 Data structures and interconnections
> +----------------------------------------
> +
> + +--------------------------------------------------------------------------+
> + | Domain (Mount Point) |
> + | /sys/fs/kdbus/control |
> + | +----------------------------------------------------------------------+ |
> + | | Bus (System Bus) | |
> + | | /sys/fs/kdbus/0-system/ | |
> + | | +-------------------------------+ +--------------------------------+ | |
> + | | | Endpoint | | Endpoint | | |
> + | | | /sys/fs/kdbus/0-system/bus | | /sys/fs/kdbus/0-system/ep.app | | |
> + | | +-------------------------------+ +--------------------------------+ | |
> + | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> + | | | Connection | | Connection | | Connection | | Connection | | |
> + | | | :1.22 | | :1.25 | | :1.55 | | :1.81 | | |
> + | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> + | +----------------------------------------------------------------------+ |
> + | |
> + | +----------------------------------------------------------------------+ |
> + | | Bus (User Bus for UID 2702) | |
> + | | /sys/fs/kdbus/2702-user/ | |
> + | | +-------------------------------+ +--------------------------------+ | |
> + | | | Endpoint | | Endpoint | | |
> + | | | /sys/fs/kdbus/2702-user/bus | | /sys/fs/kdbus/2702-user/ep.app | | |
> + | | +-------------------------------+ +--------------------------------+ | |
> + | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> + | | | Connection | | Connection | | Connection | | Connection | | |
> + | | | :1.22 | | :1.25 | | :1.55 | | :1.81 | | |
> + | | +--------------+ +--------------+ +--------------------------------+ | |
> + | +----------------------------------------------------------------------+ |
> + +--------------------------------------------------------------------------+
> +
> +The above description uses the D-Bus notation of unique connection names that
> +adds a ":1." prefix to the connection's unique ID. kdbus itself doesn't
> +use that notation, neither internally nor externally. However, libraries and
> +other userspace code that aims for compatibility to D-Bus might.
s/compatibility to/compatibility with/
> +
> +3.2 Flags
> +---------
> +
> +All ioctls used in the communication with the driver contain three 64-bit
> +fields: 'flags', 'kernel_flags' and 'return_flags'. All of them are specific
> +to the ioctl used.
> +
> +In 'flags', the behavior of the command can be tweaked. All bits that are not
> +recognized by the kernel in this field are rejected, and the ioctl fails with
> +-EINVAL.
> +
> +In 'kernel_flags', the kernel driver writes back the mask of supported bits
> +upon each call, and sets the KDBUS_FLAGS_KERNEL bit. This is a way to probe
> +possible kernel features and make userspace code forward and backward
> +compatible.
So, is "kernel_flags' a bounding superset of what the caller may specify
in 'flags'? If yes, please make that clearer.
> +
> +In 'return_flags', the kernel can return results of the command, in addition
> +to the actual return value. This is mostly to inform userspace about non-fatal
> +conditions that occurred during the execution of the command.
> +
> +
> +4. Items
> +===============================================================================
> +
> +To flexibly augment transport structures, data blobs of type struct kdbus_item
> +can be attached to the structs passed into the ioctls. Some ioctls make items
> +of certain types mandatory, others are optional. Unsupported items will cause
> +the ioctl to fail -EINVAL.
> +
> +The total size of an item is variable and is in some cases defined by the item
> +type. In other cases, they can be of arbitrary length (for instance, a string).
> +
> +Items are also used for information stored in a connection's pool, such as
> +received messages, name lists or requested connection or bus owner information.
> +
> +Whenever items are used as part of the kdbus kernel API, they are embedded in
> +structs that have an overall size of their own, so there can be multiple items
"have an overall size of their own" as hard to grok. I think you mean
... are embedded insides structs that themselves include a size field
containing the overall size of the structure. This allows multiple
items per ioctl.
> +per ioctl.
> +
> +The kernel expects all items to be aligned to 8-byte boundaries. Unaligned
> +items or such that are unsupported by the ioctl are rejected.
s/such/items/?
(Or otherwise replace "such" with whatever it actually means.)
> +A simple iterator in userspace would iterate over the items until the items
> +have reached the embedding structure's overall size. An example implementation
> +of such an iterator can be found in tools/testing/selftests/kdbus/kdbus-util.h.
> +
> +
> +5. Creation of new domains, buses and endpoints
> +===============================================================================
> +
> +
> +5.1 Domains
> +-----------
> +
> +A domain is a container of buses. Domains themselves do not provide any IPC
> +functionality. Their sole purpose is to manage buses allocated in their
> +domain. Each time kdbusfs is mounted, a new kdbus domain is created, with its
> +own 'control' file. The lifetime of the domain ends once the user has unmounted
> +the kdbusfs. If you mount kdbusfs multiple times, each will have its own kdbus
> +domain internally.
What does that last sentence mean? Somehow it needs to be reworded to better
convey whatever sense it is trying to convey.
> Operations performed on one domain do not affect any
> +other domain.
> +
> +The full kdbusfs hierarchy, any sub-directory, or file can be bind-mounted to
> +an external mount point and will remain fully functional. The kdbus domain and
> +any linked resources stay available until the original mount and all subsequent
> +bind-mounts have been unmounted.
> +
> +During creation, domains pin the user-namespace of the creator and use
> +it as controlling user-namespace for this domain. Any user accounting is done
s/as/as the/
> +relative to that user-namespace.
> +
> +Newly created kdbus domains do not have any bus pre-created. The only resource
> +available is a 'control' file, which is used to manage kdbus domains.
> +Currently, 'control' files are exclusively used to create buses via
s/exclusively used/used exclusively/
> +KDBUS_CMD_BUS_MAKE, but further ioctls might be added in the future.
> +
> +
> +5.2 Buses
> +---------
> +
> +A bus is a shared resource between connections to transmit messages. Each bus
==> A bus is a resource that is shared between connections in order to
transmit messages
> +is independent and operations on the bus will not have any effect on other
> +buses. A bus is a management entity, that controls the addresses of its
s/,//
> +connections, their policies and message transactions performed via this bus.
> +
> +Each bus is bound to the domain it was created on. It has a custom name that is
> +unique across all buses of a domain. In kdbusfs, a bus is presented as a
> +directory. No operations can be performed on the bus itself, instead you need
s/,/;/
> +to perform those on an endpoint associated with the bus. Endpoints are
s/those/operations/
> +accessible as files underneath the bus directory. A default endpoint called
> +"bus" is provided on each bus.
> +
> +Bus names may be chosen freely except for one restriction: the name
> +must be prefixed with the numeric UID of the creator and a dash. This
s/UID/effective UID/
(I assume it's the effective UID...)
> +is required to avoid namespace clashes between different users. When
> +creating a bus the name must be passed in properly formatted, or the
the name must be passed in properly formatted
==>
that name that is passed in must be properly formatted
> +kernel will refuse creation of the bus. Example: "1047-foobar" is an
> +OK name for a bus registered by a user with UID 1047. However,
s/OK/acceptable/
> +"1024-foobar" is not, and neither is "foobar".
> +The UID must be provided in the user-namespace of the parent domain.
> +
> +To create a new bus, you need to open the control file of a domain and run the
s/run/employ/ (One doesn't "run" a system call.)
> +KDBUS_CMD_BUS_MAKE ioctl. The control file descriptor that was used to issue
> +KDBUS_CMD_BUS_MAKE must not have been used for any other control-ioctl before
Maybe better:
have been used for any other control-ioctl before
==>
previously have been used for any other control-ioctl
> +and needs to be kept open for the entire life-time of the created bus. Closing
s/needs/must/ (just reads smoother)
> +it will immediately cleanup the entire bus and all its associated resources and
> +endpoints. Every control file descriptor can only be used to create a single
> +new bus; from that point on, it is not used for any further communication until
> +the final close().
> +
> +Each bus will generate a random, 128-bit UUID upon creation. It will be
/It/This UUID/
> +returned to creators of connections through kdbus_cmd_hello.id128 and can
> +be used by userspace to uniquely identify buses, even across different machines
> +or containers. The UUID will have its variant bits set to 'DCE', and denote
> +version 4 (random).
I find that last sentence rather difficult to grasp. I think more detail
needs to be added.
> +When creating buses, a variable list of items that must be passed in
> +the items array is expected otherwise bus creation will fail.
What does "a variable list of items that must be passed in
the items array" mean? Something needs fixing, I think.
> +
> +5.3 Endpoints
> +-------------
> +
> +Endpoints are entry points to a bus. By default, each bus has a default
> +endpoint called 'bus'. The bus owner has the ability to create custom
> +endpoints with specific names, permissions, and policy databases (see below).
> +An endpoint is presented as file underneath the directory of the parent bus.
> +
> +To create a custom endpoint, open the default endpoint ('bus') and use the
> +KDBUS_CMD_ENDPOINT_MAKE ioctl with "struct kdbus_cmd_make". Custom endpoints
> +always have a policy database that, by default, forbids any operation. You have
> +to explicitly install policy entries to allow any operation on this endpoint.
> +Once KDBUS_CMD_ENDPOINT_MAKE succeeded, this file descriptor will manage the
> +newly created endpoint resource. It cannot be used to manage further resources.
> +
> +Endpoint names may be chosen freely except for one restriction: the name
> +must be prefixed with the numeric UID of the creator and a dash. This
s/UID/effective UID/
> +is required to avoid namespace clashes between different users. When
> +creating an endpoint the name must be passed in properly formatted, or the
creating an endpoint the name must be passed in properly formatted
==>
creating an endpoint, the name that is passed in must be properly formatted
> +kernel will refuse creation of the endpoint. Example: "1047-foobar" is an
> +OK name for an endpoint registered by a user with UID 1047. However,
s/OK/acceptable/
> +"1024-foobar" is not, and neither is "foobar".
Because this text reads almost exactly as the bus text in 5.2, I did
a double take here. I suggest making the text more distinct in each case.
So, for example:
Example: "1047-my-endpoint" is an OK name for an endpoint registered
by a user with UID 1047. However, "1024-my-endpoint" is not, and
neither is "my-endpoint".
(And you could do similar in the bus text in section 5.2.)
> +The UID must be provided in the user-namespace of the parent domain.
> +
> +To create connections to a bus, you use KDBUS_CMD_HELLO. See section 6 for
> +details. Note that once KDBUS_CMD_HELLO succeeded, this file descriptor manages
Note that once KDBUS_CMD_HELLO succeeded,
==>
Note that after a successful KDBUS_CMD_HELLO,
> +the newly created connection resource. It cannot be used to manage further
> +resources.
> +
> +
> +5.4 Creating buses and endpoints
> +--------------------------------
> +
> +KDBUS_CMD_BUS_MAKE, and KDBUS_CMD_ENDPOINT_MAKE take a
s/,//
> +struct kdbus_cmd_make argument.
> +
> +struct kdbus_cmd_make {
> + __u64 size;
> + The overall size of the struct, including its items.
> +
> + __u64 flags;
> + The flags for creation.
> +
> + KDBUS_MAKE_ACCESS_GROUP
> + Make the bus or endpoint file group-accessible
> +
> + KDBUS_MAKE_ACCESS_WORLD
> + Make the bus or endpoint file world-accessible
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + struct kdbus_item items[0];
> + A list of items that has specific meanings for KDBUS_CMD_BUS_MAKE
s/has/have/
> + and KDBUS_CMD_ENDPOINT_MAKE (see above).
> +
> + Following items are expected for KDBUS_CMD_BUS_MAKE:
> + KDBUS_ITEM_MAKE_NAME
> + Contains a string to identify the bus name.
So, up to here, I've seen no definition of 'kdbus_item', which leaves me
asking questions like: what subfield is KDBUS_ITEM_MAKE_NAME stored in?
which subfield holds the pointer to the string?
Somewhere earlier, kdbus_item needs to be exaplained in more detail,
I think.
> +
> + KDBUS_ITEM_BLOOM_PARAMETER
> + Bus-wide bloom parameters passed in a dbus_bloom_parameter struct
> +
> + KDBUS_ITEM_ATTACH_FLAGS_RECV
> + An optional item that contains a set of required attach flags
> + that connections must allow. This item is used as a negotiation
> + measure during connection creation. If connections do not satisfy
> + the bus requirements, they are not allowed on the bus.
> + If not set, the bus does not require any metadata to be attached,
s/,/;/
> + in this case connections are free to set their own attach flags.
> +
> + KDBUS_ITEM_ATTACH_FLAGS_SEND
> + An optional item that contains a set of attach flags that are
> + returned to connections when they query the bus creator metadata.
> + If not set, no metadata is returned.
> +
> + Unrecognized items are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +6. Connections
> +===============================================================================
> +
> +
> +6.1 Connection IDs and well-known connection names
> +--------------------------------------------------
> +
> +Connections are identified by their connection id, internally implemented as a
> +uint64_t counter. The IDs of every newly created bus start at 1, and every new
> +connection will increment the counter by 1. The ids are not reused.
Again, please change "ids" to IDs" throughout.
> +
> +In higher level tools, the user visible representation of a connection is
> +defined by the D-Bus protocol specification as ":1.<id>".
> +
> +Messages with a specific uint64_t destination id are directly delivered to
> +the connection with the corresponding id. Messages with the special destination
> +id KDBUS_DST_ID_BROADCAST are broadcast messages and are potentially delivered
> +to all known connections on the bus; clients interested in broadcast messages
> +need to subscribe to the specific messages they are interested, though before
> +any broadcast message reaches them.
The piece following the semicolon would be better as this separate sentence,
I think:
However, in order to receive any broadcast messages, clients must
to subscribe to the specific messages in which they are interested.
> +
> +Messages synthesized and sent directly by the kernel will carry the special
> +source id KDBUS_SRC_ID_KERNEL (0).
> +
> +In addition to the unique uint64_t connection id, established connections can
> +request the ownership of well-known names, under which they can be found and
> +addressed by other bus clients. A well-known name is associated with one and
> +only one connection at a time. See section 8 on name acquisition and the
> +name registry, and the validity of names.
> +
> +Messages can specify the special destination id 0 and carry a well-known name
> +in the message data. Such a message is delivered to the destination connection
> +which owns that well-known name.
> +
> + +-------------------------------------------------------------------------+
> + | +---------------+ +---------------------------+ |
> + | | Connection | | Message | -----------------+ |
> + | | :1.22 | --> | src: 22 | | |
> + | | | | dst: 25 | | |
> + | | | | | | |
> + | | | | | | |
> + | | | +---------------------------+ | |
> + | | | | |
> + | | | <--------------------------------------+ | |
> + | +---------------+ | | |
> + | | | |
> + | +---------------+ +---------------------------+ | | |
> + | | Connection | | Message | -----+ | |
> + | | :1.25 | --> | src: 25 | | |
> + | | | | dst: 0xffffffffffffffff | -------------+ | |
> + | | | | (KDBUS_DST_ID_BROADCAST) | | | |
> + | | | | | ---------+ | | |
> + | | | +---------------------------+ | | | |
> + | | | | | | |
> + | | | <--------------------------------------------------+ |
> + | +---------------+ | | |
> + | | | |
> + | +---------------+ +---------------------------+ | | |
> + | | Connection | | Message | --+ | | |
> + | | :1.55 | --> | src: 55 | | | | |
> + | | | | dst: 0 / org.foo.bar | | | | |
> + | | | | | | | | |
> + | | | | | | | | |
> + | | | +---------------------------+ | | | |
> + | | | | | | |
> + | | | <------------------------------------------+ | |
> + | +---------------+ | | |
> + | | | |
> + | +---------------+ | | |
> + | | Connection | | | |
> + | | :1.81 | | | |
> + | | org.foo.bar | | | |
> + | | | | | |
> + | | | | | |
> + | | | <-----------------------------------+ | |
> + | | | | |
> + | | | <----------------------------------------------+ |
> + | +---------------+ |
> + +-------------------------------------------------------------------------+
> +
> +
> +6.2 Creating connections
> +------------------------
> +
> +A connection to a bus is created by opening an endpoint file of a bus and
> +becoming an active client with the KDBUS_CMD_HELLO ioctl. Every connected client
> +connection has a unique identifier on the bus and can address messages to every
> +other connection on the same bus by using the peer's connection id as the
> +destination.
> +
> +The KDBUS_CMD_HELLO ioctl takes the following struct as argument.
> +
> +struct kdbus_cmd_hello {
> + __u64 size;
> + The overall size of the struct, including all attached items.
> +
> + __u64 flags;
> + Flags to apply to this connection:
> +
> + KDBUS_HELLO_ACCEPT_FD
> + When this flag is set, the connection can be sent file descriptors
> + as message payload. If it's not set, any attempt of doing so will
s/any attempt of doing so/an attempt to send file descriptors/
> + result in -ECOMM on the sender's side.
> +
> + KDBUS_HELLO_ACTIVATOR
> + Make this connection an activator (see below). With this bit set,
> + an item of type KDBUS_ITEM_NAME has to be attached which describes
s/attached which describes/attached. This item describes/
> + the well-known name this connection should be an activator for.
> +
> + KDBUS_HELLO_POLICY_HOLDER
> + Make this connection a policy holder (see below). With this bit set,
> + an item of type KDBUS_ITEM_NAME has to be attached which describes
s/attached which describes/attached. This item describes/
> + the well-known name this connection should hold a policy for.
> +
> + KDBUS_HELLO_MONITOR
> + Make this connection an eaves-dropping connection. See section 6.8 for
s/eaves-dropping/eavesdropping/
> + more information.
> +
> +To also receive broadcast messages,
^
Indentation error.
> + the connection has to upload appropriate matches as well.
> + This flag is only valid for privileged bus connections.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + __u64 attach_flags_send;
> + Set the bits for metadata this connection permits to be sent to the
> + receiving peer. Only metadata items that are both allowed to be sent by
> + the sender and that are requested by the receiver will effectively be
> + attached to the message eventually. Note, however, that the bus may
What does "eventually" mean here?
> + optionally enforce some of those bits to be set. If the match fails,
s/enforce/require/ ?
> + -ECONNREFUSED will be returned. In either case, this field will be set
> + to the mask of metadata items that are enforced by the bus. The
> + KDBUS_FLAGS_KERNEL bit will as well be set.
> +
> + __u64 attach_flags_recv;
> + Request the attachment of metadata for each message received by this
> + connection. The metadata actually attached may actually augment the list
Seems like two "actually" in the previous line is one too many.
> + of requested items. See section 13 for more details.
> +
> + __u64 bus_flags;
> + Upon successful completion of the ioctl, this member will contain the
> + flags of the bus it connected to.
> +
> + __u64 id;
> + Upon successful completion of the ioctl, this member will contain the
> + id of the new connection.
> +
> + __u64 pool_size;
> + The size of the communication pool, in bytes. The pool can be accessed
> + by calling mmap() on the file descriptor that was used to issue the
> + KDBUS_CMD_HELLO ioctl.
> +
> + __u64 offset;
> + The kernel will return the offset in the pool where returned details
> + will be stored.
> +
> + __u8 id128[16];
> + Upon successful completion of the ioctl, this member will contain the
> + 128 bit wide UUID of the connected bus.
s/128 bit wide/128-bit/
> +
> + struct kdbus_item items[0];
> + Variable list of items to add optional additional information. The
s/to add optional additional/containing optional additional/
> + following items are currently expected/valid:
> +
> + KDBUS_ITEM_CONN_DESCRIPTION
> + Contains a string to describes this connection's name, so it can be
s/to/that/
> + identified later.
> +
> + KDBUS_ITEM_NAME
> + KDBUS_ITEM_POLICY_ACCESS
> + For activators and policy holders only, combinations of these two
> + items describe policy access entries (see section about policy).
> +
> + KDBUS_ITEM_CREDS
> + KDBUS_ITEM_PIDS
> + KDBUS_ITEM_SECLABEL
> + Privileged bus users may submit these types in order to create
> + connections with faked credentials. This information will be returned
> + when peer information is queried by KDBUS_CMD_CONN_INFO. See section
> + 13 for more information.
> +
> + Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +At the offset returned in the 'offset' field of struct kdbus_cmd_hello, the
> +kernel will store items of the following types:
> +
> + KDBUS_ITEM_BLOOM_PARAMETER
> + Bloom filter parameter as defined by the bus creator (see below).
> +
> +The offset in the pool has to be freed with the KDBUS_CMD_FREE ioctl.
As far as I can tell, KDBUS_CMD_FREE detailed anywhere in this file. It
needs a detailed description soemwhere.
> +
> +6.3 Activator and policy holder connection
> +------------------------------------------
> +
> +An activator connection is a placeholder for a well-known name. Messages sent
> +to such a connection can be used by userspace to start an implementer
> +connection, which will then get all the messages from the activator copied
> +over. An activator connection cannot be used to send any message.
> +
> +A policy holder connection only installs a policy for one or more names.
> +These policy entries are kept active as long as the connection is alive, and
> +are removed once it terminates. Such a policy connection type can be used to
> +deploy restrictions for names that are not yet active on the bus. A policy
> +holder connection cannot be used to send any message.
> +
> +The creation of activator, policy holder or monitor connections is an operation
What is a "monitor connection"? That term springs up unannounced. Is it
an "eavesdropping connection" as described above? Either define the term
"monitor connection" or use consistent terminology. (Actually, further down
in the document, it is clarified that "monitor connection" == "eavesdropper".
But that is not clear at THIS point in the document. It needs to be clearer.)
> +restricted to privileged users on the bus (see section "Terminology").
> +
> +
> +6.4 Retrieving information on a connection
> +------------------------------------------
> +
> +The KDBUS_CMD_CONN_INFO ioctl can be used to retrieve credentials and
> +properties of the initial creator of a connection. This ioctl uses the
> +following struct:
> +
> +struct kdbus_cmd_info {
> + __u64 size;
> + The overall size of the struct, including the name with its 0-byte string
> + terminator.
"0-byte string terminator" reads strangely. I assume you mean "terminating
null byte" / "null-terminated string" Best to use those more standard terms.
So, maybe:
including the name with its terminating null byte
or:
including the null-terminated 'name' string
There are multiple other instances of 0-byte in the doc, and I think they
should also be fixed in a similar fashion.
> +
> + __u64 flags;
> + Specify which metadata items should be attached to the answer.
s/Specify/Specifies/
> + See section 13 for more details.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + __u64 id;
> + The connection's numerical ID to retrieve information for. If set to
Hard to parse sentence.
==> The numerical ID of the connection for which information is to
be retrieved.
> + non-zero value, the 'name' field is ignored.
s/non-zero value/a non-zero value/
> +
> + __u64 offset;
> + When the ioctl returns, this value will yield the offset of the connection
s/value will yield/field will contain/
> + information inside the caller's pool.
> +
> + __u64 info_size;
> + The kernel will return the size of the returned information, so applications
> + can optionally mmap specific parts of the pool.
> +
> + struct kdbus_item items[0];
> + The optional item list, containing the well-known name to look up as
> + a KDBUS_ITEM_OWNED_NAME. Only required if the 'id' field is set to 0.
> + Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +After the ioctl returns, the following struct will be stored in the caller's
> +pool at 'offset'.
> +
> +struct kdbus_info {
> + __u64 size;
> + The overall size of the struct, including all its items.
> +
> + __u64 id;
> + The connection's unique ID.
> +
> + __u64 flags;
> + The connection's flags as specified when it was created.
> +
> + struct kdbus_item items[0];
> + Depending on the 'flags' field in struct kdbus_cmd_info, items of
> + types KDBUS_ITEM_OWNED_NAME and KDBUS_ITEM_CONN_DESCRIPTION are followed
s/are followed/follow/
> + here.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.5 Getting information about a connection's bus creator
> +--------------------------------------------------------
> +
> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
> +the bus the connection is attached to. The metadata returned by this call is
> +collected during the creation of the bus and is never altered afterwards, so
> +it provides pristine information on the task that created the bus, at the
> +moment when it did so.
> +
> +In response to this call, a slice in the connection's pool is allocated and
> +filled with an object of type struct kdbus_info, pointed to by the ioctl's
> +'offset' field.
> +
> +struct kdbus_info {
> + __u64 size;
> + The overall size of the struct, including all its items.
> +
> + __u64 id;
> + The bus ID
> +
> + __u64 flags;
> + The bus flags as specified when it was created.
s/it/the bus/
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + struct kdbus_item items[0];
> + Metadata information is stored in items here. The item list contains
> + a KDBUS_ITEM_MAKE_NAME item that indicates the bus name of the
> + calling connection.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.6 Updating connection details
> +-------------------------------
> +
> +Some of a connection's details can be updated with the KDBUS_CMD_CONN_UPDATE
> +ioctl, using the file descriptor that was used to create the connection.
> +The update command uses the following struct.
> +
> +struct kdbus_cmd_update {
> + __u64 size;
> + The overall size of the struct, including all its items.
> +
> + __u64 flags;
> + Currently no flags are supported. Reserved for future use.
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + struct kdbus_item items[0];
> + Items to describe the connection details to be updated. The following item
> + types are supported:
> +
> + KDBUS_ITEM_ATTACH_FLAGS_SEND
> + Supply a new set of items that this connection permits to be sent along
> + with messages.
> +
> + KDBUS_ITEM_ATTACH_FLAGS_RECV
> + Supply a new set of items to be attached to each message.
> +
> + KDBUS_ITEM_NAME
> + KDBUS_ITEM_POLICY_ACCESS
> + Policy holder connections may supply a new set of policy information
> + with these items. For other connection types, -EOPNOTSUPP is returned.
> +
> + Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +6.7 Termination
> +---------------
> +
> +A connection can be terminated by simply closing its file descriptor. All
> +pending incoming messages will be discarded, and the memory in the pool will
> +be freed.
> +
> +An alternative way of closing down a connection is calling the KDBUS_CMD_BYEBYE
> +ioctl on it, which will only succeed if the message queue of the connection is
> +empty at the time of closing, otherwise, -EBUSY is returned.
The preceding is a little hard to parse. I suggest:
An alternative way of closing down a connection is via the KDBUS_CMD_BYEBYE
ioctl. This ioctly will succeed only if the message queue of the connection is
empty at the time of closing; otherwise, -EBUSY is returned.
> +
> +When this ioctl returns successfully, the connection has been terminated and
> +won't accept any new messages from remote peers. This way, a connection can
> +be terminated race-free, without losing any messages.
> +
> +
> +6.8 Monitor connections ('eavesdropper')
> +----------------------------------------
> +
> +Eavesdropping connections are created by setting the KDBUS_HELLO_MONITOR flag
> +in struct kdbus_hello.flags. Such connections have all properties of any other,
> +regular connection, except for the following details:
> +
> + * They will get every message sent over the bus, both unicasts and broadcasts
> +
> + * Installing matches for broadcast messages is neither necessary nor allowed
> +
> + * They cannot send messages or be directly addressed as receiver
> +
> + * Their creation and destruction will not cause KDBUS_ITEM_ID_{ADD,REMOVE}
> + notifications to be generated, so other connections cannot detect the
> + presence of an eavesdropper.
> +
> +
> +7. Messages
> +===============================================================================
> +
> +Messages consist of a fixed-size header followed directly by a list of
> +variable-sized data 'items'. The overall message size is specified in the
> +header of the message. The chain of data items can contain well-defined
> +message metadata fields, raw data, references to data, or file descriptors.
> +
> +
> +7.1 Sending messages
> +--------------------
> +
> +Messages are passed to the kernel with the KDBUS_CMD_SEND ioctl. Depending
> +on the destination address of the message, the kernel delivers the message to
> +the specific destination connection or to all connections on the same bus.
> +Sending messages across buses is not possible. Messages are always queued in
> +the memory pool of the destination connection (see below).
> +
> +The KDBUS_CMD_SEND ioctl uses struct kdbus_cmd_send to describe the message
> +transfer.
> +
> +struct kdbus_cmd_send {
> + __u64 size;
> + The overall size of the struct, including the attached items.
> +
> + __u64 flags;
> + Flags for message delivery:
> +
> + KDBUS_SEND_SYNC_REPLY
> + By default, all calls to kdbus are considered asynchronous,
> + non-blocking. However, as there are many use cases that need to wait
> + for a remote peer to answer a method call, there's a way to send a
> + message and wait for a reply in a synchronous fashion. This is what
> + the KDBUS_MSG_SYNC_REPLY controls. The KDBUS_CMD_SEND ioctl will block
> + until the reply has arrived, the timeout limit is reached, in case the
> + remote connection was shut down, or if interrupted by a signal before
> + any reply; see signal(7).
> +
> + The offset of the reply message in the sender's pool is stored in in
> + 'offset_reply' when the ioctl has returned without error. Hence, there
> + is no need for another KDBUS_CMD_RECV ioctl or anything else to receive
> + the reply.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call of
> + KDBUS_CMD_SEND.
> +
> + __u64 kernel_msg_flags;
> + Valid bits for message flags, returned by the kernel upon each call of
> + KDBUS_CMD_SEND.
> +
> + __u64 return_flags;
> + Kernel-provided flags, returning non-fatal errors that occurred during
> + send. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + __u64 msg_address;
> + Userspace has to provide a pointer to a message (struct kdbus_msg) to send.
> +
> + struct kdbus_msg_info reply;
> + Only used for synchronous replies. See description of struct kdbus_cmd_recv
> + for more details.
> +
> + struct kdbus_item items[0];
> + The following items are currently recognized:
> +
> + KDBUS_ITEM_CANCEL_FD
> + When this optional item is passed in, and the call is executed as SYNC
> + call, the passed in file descriptor can be used as alternative
> + cancellation point. The kernel will call poll() on this file descriptor,
> + and if it reports any incoming bytes, the blocking send operation will
> + be canceled, and the call will return -ECANCELED. Any type of file
> + descriptor that implements poll() can be used as payload to this item.
> + For asynchronous message sending, this item is accepted but ignored.
> +
> + All other items are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +The message referenced by 'msg_address' above has the following layout.
> +
> +struct kdbus_msg {
> + __u64 size;
> + The overall size of the struct, including the attached items.
> +
> + __u64 flags;
> + KDBUS_MSG_EXPECT_REPLY
> + Expect a reply from the remote peer to this message. With this bit set,
s/from the remote peer to this message/to this message from the remote peer/
> + the timeout_ns field must be set to a non-zero number of nanoseconds in
> + which the receiving peer is expected to reply. If such a reply is not
> + received in time, the sender will be notified with a timeout message
> + (see below). The value must be an absolute value, in nanoseconds and
> + based on CLOCK_MONOTONIC.
Why is the option of a relative timeout not available?
> + For a message to be accepted as reply, it must be a direct message to
> + the original sender (not a broadcast), and its kdbus_msg.reply_cookie
> + must match the previous message's kdbus_msg.cookie.
> +
> + Expected replies also temporarily open the policy of the sending
> + connection, so the other peer is allowed to respond within the given
> + time window.
> +
> + KDBUS_MSG_NO_AUTO_START
> + By default, when a message is sent to an activator connection, the
> + activator notified and will start an implementer. This flag inhibits
> + that behavior. With this bit set, and the remote being an activator,
> + -EADDRNOTAVAIL is returned from the ioctl.
> +
> + __s64 priority;
> + The priority of this message. Receiving messages (see below) may
> + optionally be constrained to messages of a minimal priority. This
> + allows for use cases where timing critical data is interleaved with
> + control data on the same connection. If unused, the priority should be
> + set to zero.
> +
> + __u64 dst_id;
> + The numeric ID of the destination connection, or KDBUS_DST_ID_BROADCAST
> + (~0ULL) to address every peer on the bus, or KDBUS_DST_ID_NAME (0) to look
> + it up dynamically from the bus' name registry. In the latter case, an item
> + of type KDBUS_ITEM_DST_NAME is mandatory.
> +
> + __u64 src_id;
> + Upon return of the ioctl, this member will contain the sending
> + connection's numerical ID. Should be 0 at send time.
> +
> + __u64 payload_type;
> + Type of the payload in the actual data records. Currently, only
> + KDBUS_PAYLOAD_DBUS is accepted as input value of this field. When
> + receiving messages that are generated by the kernel (notifications),
> + this field will yield KDBUS_PAYLOAD_KERNEL.
s/yield/cointain/ ?
> +
> + __u64 cookie;
> + Cookie of this message, for later recognition. Also, when replying
> + to a message (see above), the cookie_reply field must match this value.
> +
> + __u64 timeout_ns;
> + If the message sent requires a reply from the remote peer (see above),
> + this field contains the timeout in absolute nanoseconds based on
> + CLOCK_MONOTONIC.
> +
> + __u64 cookie_reply;
> + If the message sent is a reply to another message, this field must
> + match the cookie of the formerly received message.
> +
> + struct kdbus_item items[0];
> + A dynamically sized list of items to contain additional information.
> + The following items are expected/valid:
> +
> + KDBUS_ITEM_PAYLOAD_VEC
> + KDBUS_ITEM_PAYLOAD_MEMFD
> + KDBUS_ITEM_FDS
> + Actual data records containing the payload. See section "Passing of
> + Payload Data".
> +
> + KDBUS_ITEM_BLOOM_FILTER
> + Bloom filter for matches (see below).
> +
> + KDBUS_ITEM_DST_NAME
> + Well-known name to send this message to. Required if dst_id is set
> + to KDBUS_DST_ID_NAME. If a connection holding the given name can't
> + be found, -ESRCH is returned.
> + For messages to a unique name (ID), this item is optional. If present,
> + the kernel will make sure the name owner matches the given unique name.
> + This allows userspace tie the message sending to the condition that a
> + name is currently owned by a certain unique name.
> +};
> +
> +The message will be augmented by the requested metadata items when queued into
> +the receiver's pool. See also section 13.2 ("Metadata and namespaces").
> +
> +
> +7.2 Message layout
> +------------------
> +
> +The layout of a message is shown below.
> +
> + +-------------------------------------------------------------------------+
> + | Message |
> + | +---------------------------------------------------------------------+ |
> + | | Header | |
> + | | size: overall message size, including the data records | |
> + | | destination: connection id of the receiver | |
> + | | source: connection id of the sender (set by kernel) | |
> + | | payload_type: "DBusDBus" textual identifier stored as uint64_t | |
> + | +---------------------------------------------------------------------+ |
> + | +---------------------------------------------------------------------+ |
> + | | Data Record | |
> + | | size: overall record size (without padding) | |
> + | | type: type of data | |
> + | | data: reference to data (address or file descriptor) | |
> + | +---------------------------------------------------------------------+ |
> + | +---------------------------------------------------------------------+ |
> + | | padding bytes to the next 8 byte alignment | |
> + | +---------------------------------------------------------------------+ |
> + | +---------------------------------------------------------------------+ |
> + | | Data Record | |
> + | | size: overall record size (without padding) | |
> + | | ... | |
> + | +---------------------------------------------------------------------+ |
> + | +---------------------------------------------------------------------+ |
> + | | padding bytes to the next 8 byte alignment | |
> + | +---------------------------------------------------------------------+ |
> + | +---------------------------------------------------------------------+ |
> + | | Data Record | |
> + | | size: overall record size | |
> + | | ... | |
> + | +---------------------------------------------------------------------+ |
> + | ... further data records ... |
> + +-------------------------------------------------------------------------+
> +
> +
> +7.3 Passing of Payload Data
> +---------------------------
> +
> +When connecting to the bus, receivers request a memory pool of a given size,
> +large enough to carry all backlog of data enqueued for the connection. The
> +pool is internally backed by a shared memory file which can be mmap()ed by
> +the receiver.
> +
> +KDBUS_MSG_PAYLOAD_VEC:
> + Messages are directly copied by the sending process into the receiver's pool,
s/,/./ and then start a new sentence.
> + that way two peers can exchange data by effectively doing a single-copy from
> + one process to another, the kernel will not buffer the data anywhere else.
s/,/;/
> +
> +KDBUS_MSG_PAYLOAD_MEMFD:
> + Messages can reference memfd files which contain the data.
> + memfd files are tmpfs-backed files that allow sealing of the content of the
> + file, which prevents all writable access to the file content.
> + Only memfds that have (F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL) set
> + are accepted as payload data, which enforces reliable passing of data.
> + The receiver can assume that neither the sender nor anyone else can alter the
> + content after the message is sent.
> + Apart from the sender filling-in the content into memfd files, the data will
> + be passed as zero-copy from one process to another, read-only, shared between
> + the peers.
> +
> +The sender must not make any assumptions on the type how data is received by the
Wording error at "the type how". Some fix is needed.
> +remote peer. The kernel is free to re-pack multiple VEC and MEMFD payloads. For
> +instance, the kernel may decide to merge multiple VECs into a single VEC, inline
> +MEMFD payloads into memory or merge all passed VECs into a single MEMFD.
> +However, the kernel preserves the order of passed data. This means, the order of
s/,//
> +all VEC and MEMFD items is not changed in respect to each other.
s/in/with/
> +
> +In other words: All passed VEC and MEMFD data payloads are treated as a single
> +stream of data that may be received by the remote peer in a different set of
> +hunks than it was sent as.
> +
> +
> +7.4 Receiving messages
> +----------------------
> +
> +Messages are received by the client with the KDBUS_CMD_RECV ioctl. The endpoint
> +file of the bus supports poll() to wake up the receiving process when new
s%poll()%poll/select()/epoll% ?
> +messages are queued up to be received.
> +
> +With the KDBUS_CMD_RECV ioctl, a struct kdbus_cmd_recv is used.
> +
> +struct kdbus_cmd_recv {
> + __u64 size;
> + The overall size of the struct, including the attached items.
> +
> + __u64 flags;
> + Flags to control the receive command.
> +
> + KDBUS_RECV_PEEK
> + Just return the location of the next message. Do not install file
> + descriptors or anything else. This is usually used to determine the
> + sender of the next queued message.
> +
> + KDBUS_RECV_DROP
> + Drop the next message without doing anything else with it, and free the
> + pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
> +
> + KDBUS_RECV_USE_PRIORITY
> + Use the priority field (see below).
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Kernel-provided flags, returning non-fatal errors that occurred during
> + send. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + __s64 priority;
> + With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
> + the queue with at least the given priority. If no such message is waiting
> + in the queue, -ENOMSG is returned.
###
How do I simply select the highest priority message, without knowing what
its priority is?
> +
> + __u64 dropped_msgs;
> + If the CMD_RECV ioctl fails with EOVERFLOW, this field is filled by
> + the kernel with the number of messages that couldn't be transmitted to
> + this connection. In that case, the @offset member must not be accessed.
> +
> + struct kdbus_msg_info msg;
> + Embedded struct to be filled when the command succeeded (see below).
> +
> + struct kdbus_item items[0];
> + Items to specify further details for the receive command. Currently unused.
> +};
> +
> +Both 'struct kdbus_cmd_recv' and 'struct kdbus_cmd_send' embed 'struct
> +kdbus_msg_info'. For the SEND ioctl, it is used to catch synchronous replies,
> +if one was requested, and is unused otherwise.
> +
> +struct kdbus_msg_info {
> + __u64 offset;
> + Upon return of the ioctl, this field contains the offset in the receiver's
> + memory pool. The memory must be freed with KDBUS_CMD_FREE.
> +
> + __u64 msg_size;
> + Upon successful return of the ioctl, this field contains the size of the
> + allocated slice at offset @offset. It is the combination of the size of
> + the stored kdbus_msg object plus all appended VECs. You can use it in
> + combination with @offset to map a single message, instead of mapping the
> + whole pool.
> +
> + __u64 return_flags;
> + Kernel-provided return flags. Currently, the following flags are defined.
> +
> + KDBUS_RECV_RETURN_INCOMPLETE_FDS
> + The message contained file descriptors which couldn't be installed
> + into the receiver's task. Most probably that happened because the
> + maximum number of file descriptors for that task were exceeded.
> + The message is still delivered, so this is not a fatal condition.
> + File descriptors inside the KDBUS_ITEM_FDS item that could not be
> + installed will be set to -1.
> +};
> +
> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the
s/and given that the ioctl succeeded/after a successful KDBUS_CMD_RECV ioctl/
> +offset field contains the location of the new message inside the receiver's
> +pool. The message is stored as struct kdbus_msg at this offset, and can be
> +interpreted with the semantics described above.
> +
> +Also, if the connection allowed for file descriptor to be passed
> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
> +installed into the receiving process after the KDBUS_CMD_RECV ioctl returns.
###
"after"??? When exactly?
> +The receiving task is obliged to close all of them appropriately. If
> +KDBUS_RECV_PEEK is set, no file descriptors are installed. This allows for
> +peeking at a message and dropping it via KDBUS_RECV_DROP, without installing
> +the passed file descriptors into the receiving process.
> +
> +The caller is obliged to call KDBUS_CMD_FREE with the returned offset when
> +the memory is no longer needed.
> +
> +
> +8. Name registry
> +===============================================================================
> +
> +Each bus instantiates a name registry to resolve well-known names into unique
> +connection IDs for message delivery. The registry will be queried when a
> +message is sent with kdbus_msg.dst_id set to KDBUS_DST_ID_NAME, or when a
> +registry dump is requested.
> +
> +All of the below is subject to policy rules for SEE and OWN permissions.
> +
> +
> +8.1 Name validity
> +-----------------
> +
> +A name has to comply to the following rules to be considered valid:
s/comply to/comply with/
> +
> + - The name has two or more elements separated by a period ('.') character
> + - All elements must contain at least one character
> + - Each element must only contain the ASCII characters "[A-Z][a-z][0-9]_"
> + and must not begin with a digit
> + - The name must contain at least one '.' (period) character
> + (and thus at least two elements)
> + - The name must not begin with a '.' (period) character
> + - The name must not exceed KDBUS_NAME_MAX_LEN (255)
> +
> +
> +8.2 Acquiring a name
> +--------------------
> +
> +To acquire a name, a client uses the KDBUS_CMD_NAME_ACQUIRE ioctl with the
> +following data structure.
> +
> +struct kdbus_cmd_name {
> + __u64 size;
> + The overall size of this struct, including the name with its 0-byte string
> + terminator.
> +
> + __u64 flags;
> + Flags to control details in the name acquisition.
> +
> + KDBUS_NAME_REPLACE_EXISTING
> + Acquiring a name that is already present usually fails, unless this flag
> + is set in the call, and KDBUS_NAME_ALLOW_REPLACEMENT or (see below) was
> + set when the current owner of the name acquired it, or if the current
> + owner is an activator connection (see below).
> +
> + KDBUS_NAME_ALLOW_REPLACEMENT
> + Allow other connections to take over this name. When this happens, the
> + former owner of the connection will be notified of the name loss.
> +
> + KDBUS_NAME_QUEUE (acquire)
> + A name that is already acquired by a connection, and which wasn't
> + requested with the KDBUS_NAME_ALLOW_REPLACEMENT flag set can not be
> + acquired again. However, a connection can put itself in a queue of
> + connections waiting for the name to be released. Once that happens, the
> + first connection in that queue becomes the new owner and is notified
> + accordingly.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + struct kdbus_item items[0];
> + Items to submit the name. Currently, one item of type KDBUS_ITEM_NAME is
> + expected and allowed, and the contained string must be a valid bus name.
> + Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +8.3 Releasing a name
> +--------------------
> +
> +A connection may release a name explicitly with the KDBUS_CMD_NAME_RELEASE
> +ioctl. If the connection was an implementer of an activatable name, its
> +pending messages are moved back to the activator. If there are any connections
> +queued up as waiters for the name, the oldest one of them will become the new
> +owner. The same happens implicitly for all names once a connection terminates.
> +
> +The KDBUS_CMD_NAME_RELEASE ioctl uses the same data structure as the
> +acquisition call, but with slightly different field usage.
> +
> +struct kdbus_cmd_name {
> + __u64 size;
> + The overall size of this struct, including the name with its 0-byte string
> + terminator.
> +
> + __u64 flags;
> + Flags to the command. Currently unused.
And, so presumably must be 0? Best to note that.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> + struct kdbus_item items[0];
> + Items to submit the name. Currently, one item of type KDBUS_ITEM_NAME is
> + expected and allowed, and the contained string must be a valid bus name.
> +};
> +
> +
> +8.4 Dumping the name registry
> +-----------------------------
> +
> +A connection may request a complete or filtered dump of currently active bus
> +names with the KDBUS_CMD_NAME_LIST ioctl, which takes a struct
> +kdbus_cmd_name_list as argument.
> +
> +struct kdbus_cmd_name_list {
> + __u64 flags;
> + Any combination of flags to specify which names should be dumped.
> +
> + KDBUS_NAME_LIST_UNIQUE
> + List the unique (numeric) IDs of the connection, whether it owns a name
> + or not.
> +
> + KDBUS_NAME_LIST_NAMES
> + List well-known names stored in the database which are actively owned by
> + a real connection (not an activator).
> +
> + KDBUS_NAME_LIST_ACTIVATORS
> + List names that are owned by an activator.
> +
> + KDBUS_NAME_LIST_QUEUED
> + List connections that are not yet owning a name but are waiting for it
> + to become available.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + __u64 offset;
> + When the ioctl returns successfully, the offset to the name registry dump
> + inside the connection's pool will be stored in this field.
> +};
> +
> +The returned list of names is stored in a struct kdbus_name_list that in turn
> +contains a dynamic number of struct kdbus_cmd_name that carry the actual
> +information. The fields inside that struct kdbus_cmd_name is described next.
> +
> +struct kdbus_name_info {
> + __u64 size;
> + The overall size of this struct, including the name with its 0-byte string
> + terminator.
> +
> + __u64 owner_id;
> + The owning connection's unique ID.
> +
> + __u64 conn_flags;
> + The flags of the owning connection.
> +
> + struct kdbus_item items[0];
> + Items containing the actual name. Currently, one item of type
> + KDBUS_ITEM_OWNED_NAME will be attached, including the name's flags. In that
> + item, the flags field of the name may carry the following bits:
> +
> + KDBUS_NAME_ALLOW_REPLACEMENT
> + Other connections are allowed to take over this name from the
> + connection that owns it.
> +
> + KDBUS_NAME_IN_QUEUE (list)
> + When retrieving a list of currently acquired name in the registry, this
> + flag indicates whether the connection actually owns the name or is
> + currently waiting for it to become available.
> +
> + KDBUS_NAME_ACTIVATOR (list)
> + An activator connection owns a name as a placeholder for an implementer,
> + which is started on demand as soon as the first message arrives. There's
> + some more information on this topic below. In contrast to
> + KDBUS_NAME_REPLACE_EXISTING, when a name is taken over from an activator
> + connection, all the messages that have been queued in the activator
> + connection will be moved over to the new owner. The activator connection
> + will still be tracked for the name and will take control again if the
> + implementer connection terminates.
> + This flag can not be used when acquiring a name, but is implicitly set
> + through KDBUS_CMD_HELLO with KDBUS_HELLO_ACTIVATOR set in
> + kdbus_cmd_hello.conn_flags.
> +};
> +
> +The returned buffer must be freed with the KDBUS_CMD_FREE ioctl when the user
> +is finished with it.
> +
> +
> +9. Notifications
> +===============================================================================
> +
> +The kernel will notify its users of the following events.
> +
> + * When connection A is terminated while connection B is waiting for a reply
> + from it, connection B is notified with a message with an item of type
> + KDBUS_ITEM_REPLY_DEAD.
> +
> + * When connection A does not receive a reply from connection B within the
> + specified timeout window, connection A will receive a message with an item
> + of type KDBUS_ITEM_REPLY_TIMEOUT.
> +
> + * When an ordinary connection (not a monitor) is created on or removed from
> + a bus, messages with an item of type KDBUS_ITEM_ID_ADD or
> + KDBUS_ITEM_ID_REMOVE, respectively, are sent to all bus members that match
> + these messages through their match database. Eavesdroppers (monitor
> + connections) do not cause such notifications to be sent. They are invisible
> + on the bus.
> +
> + * When a connection gains or loses ownership of a name, messages with an item
> + of type KDBUS_ITEM_NAME_ADD, KDBUS_ITEM_NAME_REMOVE or
> + KDBUS_ITEM_NAME_CHANGE are sent to all bus members that match these
> + messages through their match database.
> +
> +A kernel notification is a regular kdbus message with the following details.
> +
> + * kdbus_msg.src_id == KDBUS_SRC_ID_KERNEL
> + * kdbus_msg.dst_id == KDBUS_DST_ID_BROADCAST
> + * kdbus_msg.payload_type == KDBUS_PAYLOAD_KERNEL
> + * Has exactly one of the aforementioned items attached
> +
> +Kernel notifications have an item of type KDBUS_ITEM_TIMESTAMP attached.
> +
> +
> +10. Message Matching, Bloom filters
> +===============================================================================
> +
> +10.1 Matches for broadcast messages from other connections
> +----------------------------------------------------------
> +
> +A message addressed at the connection ID KDBUS_DST_ID_BROADCAST (~0ULL) is a
> +broadcast message, delivered to all connected peers which installed a rule to
> +match certain properties of the message. Without any rules installed in the
> +connection, no broadcast message or kernel-side notifications will be delivered
> +to the connection. Broadcast messages are subject to policy rules and TALK
> +access checks.
> +
> +See section 11 for details on policies, and section 11.5 for more
> +details on implicit policies.
> +
> +Matches for messages from other connections (not kernel notifications) are
> +implemented as bloom filters. The sender adds certain properties of the message
> +as elements to a bloom filter bit field, and sends that along with the
> +broadcast message.
> +
> +The connection adds the message properties it is interested as elements to a
> +bloom mask bit field, and uploads the mask to the match rules of the
> +connection.
> +
> +The kernel will match the broadcast message's bloom filter against the
> +connections bloom mask (simply by &-ing it), and decide whether the message
> +should be delivered to the connection.
> +
> +The kernel has no notion of any specific properties of the message, all it
> +sees are the bit fields of the bloom filter and mask to match against. The
> +use of bloom filters allows simple and efficient matching, without exposing
> +any message properties or internals to the kernel side. Clients need to deal
> +with the fact that they might receive broadcasts which they did not subscribe
> +to, as the bloom filter might allow false-positives to pass the filter.
> +
> +To allow the future extension of the set of elements in the bloom filter, the
> +filter specifies a "generation" number. A later generation must always contain
> +all elements of the set of the previous generation, but can add new elements
> +to the set. The match rules mask can carry an array with all previous
> +generations of masks individually stored. When the filter and mask are matched
> +by the kernel, the mask with the closest matching "generation" is selected
> +as the index into the mask array.
> +
> +
> +10.2 Matches for kernel notifications
> +------------------------------------
> +
> +To receive kernel generated notifications (see section 9), a connection must
> +install special match rules that are different from the bloom filter matches
> +described in the section above. They can be filtered by a sender connection's
> +ID, by one of the name the sender connection owns at the time of sending the
> +message, or by type of the notification (id/name add/remove/change).
s/type/the type/
> +
> +10.3 Adding a match
> +-------------------
> +
> +To add a match, the KDBUS_CMD_MATCH_ADD ioctl is used, which takes a struct
> +of the struct described below.
> +
> +Note that each of the items attached to this command will internally create
> +one match 'rule', and the collection of them, which is submitted as one block
> +via the ioctl is called a 'match'. To allow a message to pass, all rules of a
s/ioctl/ioctl,/
> +match have to be satisfied. Hence, adding more items to the command will only
> +narrow the possibility of a match to effectively let the message pass, and will
> +cause the connection's user space process to wake up less likely.
Make that last line
decrease the chance that the connection's userspace process wil be woken up
> +
> +Multiple matches can be installed per connection. As long as one of it has a
s/it/them/ ?
(If that change is not correct, then the sentence is quite confused.)
> +set of rules which allows the message to pass, this one will be decisive.
> +
> +struct kdbus_cmd_match {
> + __u64 size;
> + The overall size of the struct, including its items.
> +
> + __u64 cookie;
> + A cookie which identifies the match, so it can be referred to at removal
> + time.
> +
> + __u64 flags;
> + Flags to control the behavior of the ioctl.
> +
> + KDBUS_MATCH_REPLACE:
> + Remove all entries with the given cookie before installing the new one.
> + This allows for race-free replacement of matches.
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + struct kdbus_item items[0];
> + Items to define the actual rules of the matches. The following item types
> + are expected. Each item will cause one new match rule to be created.
> +
> + KDBUS_ITEM_BLOOM_MASK
> + An item that carries the bloom filter mask to match against in its
> + data field. The payload size must match the bloom filter size that
> + was specified when the bus was created.
> + See section 10.4 for more information.
> +
> + KDBUS_ITEM_NAME
> + Specify a name that a sending connection must own at a time of sending
s/a time/the time/
> + a broadcast message in order to match this rule.
> +
> + KDBUS_ITEM_ID
> + Specify a sender connection's ID that will match this rule.
> +
> + KDBUS_ITEM_NAME_ADD
> + KDBUS_ITEM_NAME_REMOVE
> + KDBUS_ITEM_NAME_CHANGE
> + These items request delivery of broadcast messages that describe a name
> + acquisition, loss, or change. The details are stored in the item's
> + kdbus_notify_name_change member. All information specified must be
> + matched in order to make the message pass. Use KDBUS_MATCH_ID_ANY to
> + match against any unique connection ID.
> +
> + KDBUS_ITEM_ID_ADD
> + KDBUS_ITEM_ID_REMOVE
> + These items request delivery of broadcast messages that are generated
> + when a connection is created or terminated. struct kdbus_notify_id_change
> + is used to store the actual match information. This item can be used to
> + monitor one particular connection ID, or, when the id field is set to
> + KDBUS_MATCH_ID_ANY, all of them.
> +
> + Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +10.4 Bloom filters
> +------------------
> +
> +Bloom filters allow checking whether a given word is present in a dictionary.
> +This allows connections to set up a mask for information it is interested in,
> +and will be delivered signal messages that have a matching filter.
> +
> +For general information on bloom filters, see
> +
> + https://en.wikipedia.org/wiki/Bloom_filter
> +
> +The size of the bloom filter is defined per bus when it is created, in
> +kdbus_bloom_parameter.size. All bloom filters attached to signals on the bus
> +must match this size, and all bloom filter matches uploaded by connections must
> +also match the size, or a multiple thereof (see below).
> +
> +The calculation of the mask has to be done on the userspace side. The kernel
> +just checks the bitmasks to decide whether or not to let the message pass. All
> +bits in the mask must match the filter in and bit-wise AND logic, but the
Parse error at "in and bit-wise AND logic". I am not sure what you are meaning
there, buth something needs fixing.
> +mask may have more bits set than the filter. Consequently, false positive
> +matches are expected to happen, and userspace must deal with that fact.
> +
> +Masks are entities that are always passed to the kernel as part of a match
> +(with an item of type KDBUS_ITEM_BLOOM_MASK), and filters can be attached to
> +signals, with an item of type KDBUS_ITEM_BLOOM_FILTER.
> +
> +For a filter to match, all its bits have to be set in the match mask as well.
> +For example, consider a bus has a bloom size of 8 bytes, and the following
s/has/that has/
> +mask/filter combinations:
> +
> + filter 0x0101010101010101
> + mask 0x0101010101010101
> + -> matches
> +
> + filter 0x0303030303030303
> + mask 0x0101010101010101
> + -> doesn't match
> +
> + filter 0x0101010101010101
> + mask 0x0303030303030303
> + -> matches
> +
> +Hence, in order to catch all messages, a mask filled with 0xff bytes can be
> +installed as a wildcard match rule.
> +
> +Uploaded matches may contain multiple masks, each of which in the size of the
Parse error at "each of which in the size of"
s/in/is/?
> +bloom size defined by the bus. Each block of a mask is called a 'generation',
> +starting at index 0.
> +
> +At match time, when a signal is about to be delivered, a bloom mask generation
> +is passed, which denotes which of the bloom masks the filter should be matched
> +against. This allows userspace to provide backward compatible masks at upload
> +time, while older clients can still match against older versions of filters.
> +
> +
> +10.5 Removing a match
> +--------------------
> +
> +Matches can be removed through the KDBUS_CMD_MATCH_REMOVE ioctl, which again
> +takes struct kdbus_cmd_match as argument, but its fields are used slightly
> +differently.
> +
> +struct kdbus_cmd_match {
> + __u64 size;
> + The overall size of the struct. As it has no items in this use case, the
> + value should yield 16.
s/yield/contain/ ?
> +
> + __u64 cookie;
> + The cookie of the match, as it was passed when the match was added.
> + All matches that have this cookie will be removed.
> +
> + __u64 flags;
> + Unused for this use case,
> +
> + __u64 kernel_flags;
> + Valid flags for this command, returned by the kernel upon each call.
> +
> + __u64 return_flags;
> + Flags returned by the kernel. Currently unused.
And, so presumably always returned as 0? Best to note that.
> +
> + struct kdbus_item items[0];
> + Unused und not allowed for this use case.
> +};
> +
> +
> +11. Policy
> +===============================================================================
> +
> +A policy databases restrict the possibilities of connections to own, see and
> +talk to well-known names. It can be associated with a bus (through a policy
s/It/A policy/
> +holder connection) or a custom endpoint.
> +
> +See section 8.1 for more details on the validity of well-known names.
> +
> +Default endpoints of buses always have a policy database. The default
> +policy is to deny all operations except for operations that are covered by
> +implicit policies. Custom endpoints always have a policy, and by default,
> +a policy database is empty. Therefore, unless policy rules are added, all
s/a policy database/the policy database/
> +operations will also be denied by default.
> +
> +See section 11.5 for more details on implicit policies.
> +
> +A set of policy rules is described by a name and multiple access rules, defined
> +by the following struct.
> +
> +struct kdbus_policy_access {
> + __u64 type; /* USER, GROUP, WORLD */
> + One of the following.
> +
> + KDBUS_POLICY_ACCESS_USER
> + Grant access to a user with the uid stored in the 'id' field.
> +
> + KDBUS_POLICY_ACCESS_GROUP
> + Grant access to a user with the gid stored in the 'id' field.
> +
> + KDBUS_POLICY_ACCESS_WORLD
> + Grant access to everyone. The 'id' field is ignored.
> +
> + __u64 access; /* OWN, TALK, SEE */
> + The access to grant.
> +
> + KDBUS_POLICY_SEE
> + Allow the name to be seen.
> +
> + KDBUS_POLICY_TALK
> + Allow the name to be talked to.
> +
> + KDBUS_POLICY_OWN
> + Allow the name to be owned.
> +
> + __u64 id;
> + For KDBUS_POLICY_ACCESS_USER, stores the uid.
> + For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
> +};
> +
> +Policies are set through KDBUS_CMD_HELLO (when creating a policy holder
> +connection), KDBUS_CMD_CONN_UPDATE (when updating a policy holder connection),
> +KDBUS_CMD_ENDPOINT_MAKE (creating a custom endpoint) or
> +KDBUS_CMD_ENDPOINT_UPDATE (updating a custom endpoint). In all cases, the name
> +and policy access information is stored in items of type KDBUS_ITEM_NAME and
> +KDBUS_ITEM_POLICY_ACCESS. For this transport, the following rules apply.
> +
> + * An item of type KDBUS_ITEM_NAME must be followed by at least one
> + KDBUS_ITEM_POLICY_ACCESS item
> + * An item of type KDBUS_ITEM_NAME can be followed by an arbitrary number of
> + KDBUS_ITEM_POLICY_ACCESS items
> + * An arbitrary number of groups of names and access levels can be passed
> +
> +uids and gids are internally always stored in the kernel's view of global ids,
> +and are translated back and forth on the ioctl level accordingly.
> +
> +
> +11.2 Wildcard names
> +-------------------
> +
> +Policy holder connections may upload names that contain the wild card suffix
s/wild card/wildcard/
> +(".*"). That way, a policy can be uploaded that is effective for every
> +well-known name that extends the provided name by exactly one more level.
> +
> +For example, if an item of a set up uploaded policy rules contains the name
I find that last line very difficult to parse. Someehting is broken.
What are you trying to say?
> +"foo.bar.*", both "foo.bar.baz" and "foo.bar.bazbaz" are valid, but
s/both/then both/ to help parsing.
> +"foo.bar.baz.baz" is not.
> +
> +This allows connections to take control over multiple names that the policy
> +holder doesn't need to know about when uploading the policy.
> +
> +Such wildcard entries are not allowed for custom endpoints.
> +
> +
> +11.3 Policy example
> +-------------------
> +
> +For example, a set of policy rules may look like this:
> +
> + KDBUS_ITEM_NAME: str='org.foo.bar'
> + KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=1000
> + KDBUS_ITEM_POLICY_ACCESS: type=USER, access=TALK, id=1001
> + KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=SEE
> + KDBUS_ITEM_NAME: str='org.blah.baz'
> + KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=0
> + KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=TALK
> +
> +That means that 'org.foo.bar' may only be owned by uid 1000, but every user on
> +the bus is allowed to see the name. However, only uid 1001 may actually send
> +a message to the connection and receive a reply from it.
> +
> +The second rule allows 'org.blah.baz' to be owned by uid 0 only, but every user
> +may talk to it.
> +
> +
> +11.4 TALK access and multiple well-known names per connection
> +-------------------------------------------------------------
> +
> +Note that TALK access is checked against all names of a connection.
> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
> +after all, we allow messages to be directed to either the ID or a well-known
> +name, and policy is applied to the connection, not the name. In other words,
> +the effective TALK policy for a connection is the most permissive of all names
> +the connection owns.
> +
> +For broadcast messages, the receiver needs TALK permissions to the sender to
> +receive the broadcast.
> +
> +If a policy database exists for a bus (because a policy holder created one on
> +demand) or for a custom endpoint (which always has one), each one is consulted
By "created one on demand" do you mean "explicitly created one"? If so
please use the latter wording, since its clearer.
> +during name registry listing, name owning or message delivery. If either one
What is "name owning". I think this could be worded better.
> +fails, the operation is failed with -EPERM.
> +
> +For best practices, connections that own names with a restricted TALK
> +access should not install matches. This avoids cases where the sent
> +message may pass the bloom filter due to false-positives and may also
> +satisfy the policy rules.
> +
> +
> +11.5 Implicit policies
> +----------------------
> +
> +Depending on the type of the endpoint, a set of implicit rules that
> +override installed policies might be enforced.
> +
> +On default endpoints, the following set is enforced and checked before
> +any user-supplied policy is checked.
> +
> + * Privileged connections always override any installed policy. Those
> + connections could easily install their own policies, so there is no
> + reason to enforce installed policies.
> + * Connections can always talk to connections of the same user. This
> + includes broadcast messages.
> +
> +Custom endpoints have stricter policies. The following rules apply:
> +
> + * Policy rules are always enforced, even if the connection is a privileged
> + connection.
> + * Policy rules are always enforced for TALK access, even if both ends are
> + running under the same user. This includes broadcast messages.
> + * To restrict the set of names that can be seen, endpoint policies can
> + install "SEE" policies.
> +
> +
> +12. Pool
> +===============================================================================
> +
> +A pool for data received from the kernel is installed for every connection of
> +the bus, and is sized according to the information stored in the
> +KDBUS_ITEM_BLOOM_PARAMETER item that is returned by KDBUS_CMD_HELLO.
> +
> +The pool is written to by the kernel when one of the following ioctls is issued:
> +
> + * KDBUS_CMD_HELLO, to receive details about the bus the connection was made to
> + * KDBUS_CMD_RECV, to receive a message
> + * KDBUS_CMD_NAME_LIST, to dump the name registry
> + * KDBUS_CMD_CONN_INFO, to retrieve information on a connection
> +
> +The offsets returned by either one of the aforementioned ioctls describe offsets
> +inside the pool. In order to make the slice available for subsequent calls,
> +KDBUS_CMD_FREE has to be called on the offset.
> +
> +To access the memory, the caller is expected to mmap() it to its task, like
s/to its task//
> +this:
> +
> + /*
> + * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
> + * value that was previously returned through the KDBUS_ITEM_BLOOM_PARAMETER
> + * item field when the KDBUS_CMD_HELLO ioctl returned.
> + */
> +
> + buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_SHARED, conn_fd, 0);
> +
> +Alternatively, instead of mapping the entire pool buffer, only parts of it can
> +be mapped. The length of the response is returned by the kernel along with the
> +offset for each of the ioctls listed above.
> +
> +
> +13. Metadata
> +===============================================================================
> +
> +kdbus records data about the system in certain situations. Such metadata can
> +refer to the currently active process (creds, PIDs, current user groups, process
> +names and its executable path, cgroup membership, capabilities, security label
> +and audit information), connection information (description string, currently
> +owned names) and the timestamp.
> +
> +Metadata is collected in the following circumstances:
> +
> + * When a bus is created (KDBUS_CMD_MAKE), information about the calling task
> + is collected. This data is returned by the kernel via the
> + KDBUS_CMD_BUS_CREATOR_INFO call-
s/-/./
> +
> + * When a connection is created (KDBUS_CMD_HELLO), information about the
> + calling task is collected. Alternatively, a privileged connection may
> + provide 'faked' information about credentials, PIDs and a security labels
s/a security/security/
> + which will be taken instead. This data is returned by the kernel as
> + information on a connection (KDBUS_CMD_CONN_INFO).
> +
> + * When a message is sent (KDBUS_CMD_SEND), information about the sending task
> + and the sending connection are collected. This metadata will be attached
> + to the message when it arrives in the receiver's pool. If the connection
> + sending the message installed faked credentials (see above), the message
> + will not be augmented by any information about the currently sending task.
> +
> +Which metadata items are actually delivered depends on the following sets and
> +masks:
> +
> + (a) the system-wide kmod creds mask (module parameter 'attach_flags_mask')
> + (b) the per-connection send creds mask, set by the connecting client
> + (c) the per-connection receive creds mask, set by the connecting client
> + (d) the per-bus minimal creds mask, set by the bus creator
> + (e) the per-bus owner creds mask, set by the bus creator
> + (f) the mask specified when querying creds of a bus peer
> + (g) the mask specified when querying creds of a bus owner
> +
> +With the following rules:
> +
> + [1] The creds attached to messages are determined as (a & b & c).
> + [2] When connecting to a bus (KDBUS_CMD_HELLO), and (~b & d) != 0, the call
> + will fail, the connection is refused.
> + [3] When querying creds of a bus peer, the creds returned are (a & b & f)
> + [4] When querying creds of a bus owner, the creds returned are (a & e & g)
> + [5] When creating a new bus, and (d & ~a) != 0, then bus creation will fail
> +
> +Hence, userspace might not always get all requested metadata items that it
> +requested. Code must be written so that it can cope with this fact.
> +
> +
> +13.1 Known item types
> +---------------------
> +
> +The following attach flags are currently supported.
> +
> + KDBUS_ATTACH_TIMESTAMP
> + Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
> + monotonic and the realtime timestamp, taken when the message was
> + processed on the kernel side.
> +
> + KDBUS_ATTACH_CREDS
> + Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
> + described in struct kdbus_creds: the user and group IDs in the usual four
> + flavors: real, effective, saved and file-system related.
> +
> + KDBUS_ATTACH_PIDS
> + Attaches an item of type KDBUS_ITEM_PIDS, containing information on the
> + process. In particular, the PID (process ID), TID (thread ID), and PPID
> + (PID of the parent process).
> +
> + KDBUS_ATTACH_AUXGROUPS
> + Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
> + number of auxiliary groups the sending task was a member of.
> +
> + KDBUS_ATTACH_NAMES
> + Attaches items of type KDBUS_ITEM_OWNED_NAME, one for each name the sending
> + connection currently owns. The name and flags are stored in kdbus_item.name
> + for each of them.
> +
> + KDBUS_ATTACH_TID_COMM [*]
> + Attaches an items of type KDBUS_ITEM_TID_COMM, transporting the sending
> + task's 'comm', for the tid. The string is stored in kdbus_item.str.
> +
> + KDBUS_ATTACH_PID_COMM [*]
> + Attaches an items of type KDBUS_ITEM_PID_COMM, transporting the sending
> + task's 'comm', for the pid. The string is stored in kdbus_item.str.
> +
> + KDBUS_ATTACH_EXE [*]
> + Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
> + executable of the sending task, stored in kdbus_item.str.
> +
> + KDBUS_ATTACH_CMDLINE [*]
> + Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
> + arguments of the sending task, as an array of strings, stored in
> + kdbus_item.str.
> +
> + KDBUS_ATTACH_CGROUP
> + Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
> +
> + KDBUS_ATTACH_CAPS
> + Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
> + that should be accessed via kdbus_item.caps.caps. Also, userspace should
> + be written in a way that it takes kdbus_item.caps.last_cap into account,
> + and derive the number of sets and rows from the item size and the reported
> + number of valid capability bits.
> +
> + KDBUS_ATTACH_SECLABEL
> + Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
> + security label of the sending task. SELinux and other MACs might want to
> + do additional per-service security checks. For example, a service manager
> + might want to check the security label of a service file against the
> + security label of the client process checking the SELinux database before
> + allowing access. The label can be accessed via kdbus_item->str.
> +
> + KDBUS_ATTACH_AUDIT
> + Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audit
> + sessionid and loginuid of the sending task. Access via kdbus_item->audit.
> +
> + KDBUS_ATTACH_CONN_DESCRIPTION
> + Attaches an item of type KDBUS_ITEM_CONN_DESCRIPTION that contains a
> + descriptive string of the sending peer. That string can be supplied
> + during HELLO by attaching an item of type KDBUS_ITEM_CONN_DESCRIPTION.
> +
> +
> +[*] Note that the content stored in these items can easily be tampered by
> + the sending tasks. Therefore, they should NOT be used for any sort of
> + security relevant assumptions. The only reason why they are transmitted is
> + to let receivers know about details that were set when metadata was
> + collected, even though the task they were collected from is not active any
> + longer when the items are received.
> +
> +
> +13.2 Metadata and namespaces
> +----------------------------
> +
> +Metadata such as PIDs, UIDs or GIDs are automatically translated to the
> +namespaces of the task that receives them.
> +
> +
> +14. Error codes
> +===============================================================================
> +
> +Below is a list of error codes that might be returned by the individual
> +ioctl commands. The list focuses on the return values from kdbus code itself,
> +and might not cover those of all kernel internal functions.
> +
> +For all ioctls:
> +
> + -ENOMEM The kernel memory is exhausted
> + -ENOTTY Illegal ioctl command issued for the file descriptor
Why ENOTTY here, rather than EINVAL? The latter is, I beleive, the usual
ioctl() error for invalid commands, I believe (If you keep ENOTTY, add an
explanation in this document.)
> + -ENOSYS The requested functionality is not available
Maybe add here an explanation or examples of why the functionality is
not available?
> + -EINVAL Unsupported item attached to command
> +
> +For all ioctls that carry a struct as payload:
> +
> + -EFAULT The supplied data pointer was not 64-bit aligned, or was
> + inaccessible from the kernel side.
> + -EINVAL The size inside the supplied struct was smaller than expected
> + -EMSGSIZE The size inside the supplied struct was bigger than expected
Why two different errors for smaller and larger than expected? (If you keep things this
way, pelase explain the reason in this document.)
> + -ENAMETOOLONG A supplied name is larger than the allowed maximum size
> +
> +For KDBUS_CMD_BUS_MAKE:
> +
> + -EINVAL The flags supplied in the kdbus_cmd_make struct are invalid or
> + the supplied name does not start with the current uid and a '-'
> + -EEXIST A bus of that name already exists
> + -ESHUTDOWN The domain for the bus is already shut down
> + -EMFILE The maximum number of buses for the current user is exhausted
> +
> +For KDBUS_CMD_ENDPOINT_MAKE:
> +
> + -EPERM The calling user is not privileged (see Terminology)
> + -EINVAL The flags supplied in the kdbus_cmd_make struct are invalid
> + -EEXIST An endpoint of that name already exists
> +
> +For KDBUS_CMD_HELLO:
> +
> + -EFAULT The supplied pool size was 0 or not a multiple of the page size
> + -EINVAL The flags supplied in the kdbus_cmd_make struct are invalid, or
> + an illegal combination of KDBUS_HELLO_MONITOR,
> + KDBUS_HELLO_ACTIVATOR and KDBUS_HELLO_POLICY_HOLDER was passed
> + in the flags, or an invalid set of items was supplied
> + -ECONNREFUSED The attach_flags_send field did not satisfy the requirements of
> + the bus
> + -EPERM An KDBUS_ITEM_CREDS items was supplied, but the current user is
> + not privileged
> + -ESHUTDOWN The bus has already been shut down
> + -EMFILE The maximum number of connection on the bus has been reached
s/connection/connections/
> + -EOPNOTSUPP The endpoint does not support the connection flags
> + supplied in the kdbus_cmd_hello struct
> +
> +For KDBUS_CMD_BYEBYE:
> +
> + -EALREADY The connection has already been shut down
> + -EBUSY There are still messages queued up in the connection's pool
> +
> +For KDBUS_CMD_SEND:
> +
> + -EOPNOTSUPP The connection is not an ordinary connection, or the passed
> + file descriptors are either kdbus handles or unix domain
> + sockets. Both are currently unsupported
> + -EINVAL The submitted payload type is KDBUS_PAYLOAD_KERNEL,
> + KDBUS_MSG_EXPECT_REPLY was set without timeout or cookie
> + values, KDBUS_MSG_SYNC_REPLY was set without
> + KDBUS_MSG_EXPECT_REPLY, an invalid item was supplied,
> + src_id was != 0 and different from the current connection's ID,
s/src_id was != 0 and different/src_id was nonzero and was different/
> + a supplied memfd had a size of 0, a string was not properly
> + null-terminated
> + -ENOTUNIQ The supplied destination is KDBUS_DST_ID_BROADCAST, a file
> + descriptor was passed, KDBUS_MSG_EXPECT_REPLY was set,
> + or a timeout was given for a broadcast message
> + -E2BIG Too many items
> + -EMSGSIZE The size of the message header and items or the payload vector
> + is too big.
> + -EEXIST Multiple KDBUS_ITEM_FDS, KDBUS_ITEM_BLOOM_FILTER or
> + KDBUS_ITEM_DST_NAME items were supplied
> + -EBADF The supplied KDBUS_ITEM_FDS or KDBUS_MSG_PAYLOAD_MEMFD items
> + contained an illegal file descriptor
> + -EMEDIUMTYPE The supplied memfd is not a sealed kdbus memfd
> + -EMFILE Too many file descriptors inside a KDBUS_ITEM_FDS
> + -EBADMSG An item had illegal size, both a dst_id and a
> + KDBUS_ITEM_DST_NAME was given, or both a name and a bloom
> + filter was given
> + -ETXTBSY The supplied kdbus memfd file cannot be sealed or the seal
> + was removed, because it is shared with other processes or
> + still mmap()ed
> + -ECOMM A peer does not accept the file descriptors addressed to it
> + -EFAULT The supplied bloom filter size was not 64-bit aligned
> + -EDOM The supplied bloom filter size did not match the bloom filter
> + size of the bus
> + -EDESTADDRREQ dst_id was set to KDBUS_DST_ID_NAME, but no KDBUS_ITEM_DST_NAME
> + was attached
> + -ESRCH The name to look up was not found in the name registry
> + -EADDRNOTAVAIL KDBUS_MSG_NO_AUTO_START was given but the destination
> + connection is an activator.
> + -ENXIO The passed numeric destination connection ID couldn't be found,
> + or is not connected
> + -ECONNRESET The destination connection is no longer active
> + -ETIMEDOUT Timeout while synchronously waiting for a reply
> + -EINTR System call interrupted while synchronously waiting for a reply
> + -EPIPE When sending a message, a synchronous reply from the receiving
> + connection was expected but the connection died before
> + answering
> + -ENOBUFS Too many pending messages on the receiver side
> + -EREMCHG Both a well-known name and a unique name (ID) was given, but
> + the name is not currently owned by that connection.
> + -EXFULL The memory pool of the receiver is full
> + -EREMOTEIO While synchronously waiting for a reply, the remote peer
> + failed with an I/O error.
> +
> +For KDBUS_CMD_RECV:
> +
> + -EINVAL Invalid flags or offset
> + -EAGAIN No message found in the queue
> + -ENOMSG No message of the requested priority found
> + -EOVERFLOW Broadcast messages have been lost
> +
> +For KDBUS_CMD_FREE:
> +
> + -ENXIO No pool slice found at given offset
> + -EINVAL Invalid flags provided, the offset is valid, but the user is
> + not allowed to free the slice. This happens, for example, if
> + the offset was retrieved with KDBUS_RECV_PEEK.
It would be easier to read if this was written to clarify that there are
two distinct error cases:
-EINVAL Invalid flags provided.
-EINVAL The offset is valid, but the user is
not allowed to free the slice. This happens, for example, if
the offset was retrieved with KDBUS_RECV_PEEK.
> +For KDBUS_CMD_NAME_ACQUIRE:
> +
> + -EINVAL Illegal command flags, illegal name provided, or an activator
> + tried to acquire a second name
> + -EPERM Policy prohibited name ownership
> + -EALREADY Connection already owns that name
> + -EEXIST The name already exists and can not be taken over
> + -E2BIG The maximum number of well-known names per connection
> + is exhausted
> + -ECONNRESET The connection was reset during the call
> +
> +For KDBUS_CMD_NAME_RELEASE:
> +
> + -EINVAL Invalid command flags, or invalid name provided
> + -ESRCH Name is not found found in the registry
> + -EADDRINUSE Name is owned by a different connection and can't be released
> +
> +For KDBUS_CMD_NAME_LIST:
> +
> + -EINVAL Invalid flags
> + -ENOBUFS No available memory in the connection's pool.
> +
> +For KDBUS_CMD_CONN_INFO:
> +
> + -EINVAL Invalid flags, or neither an ID nor a name was provided,
> + or the name is invalid.
> + -ESRCH Connection lookup by name failed
> + -ENXIO No connection with the provided connection ID found
> +
> +For KDBUS_CMD_CONN_UPDATE:
> +
> + -EINVAL Illegal flags or items
> + -EOPNOTSUPP Operation not supported by connection.
> + -E2BIG Too many policy items attached
> + -EINVAL Wildcards submitted in policy entries, or illegal sequence
> + of policy items
> +
> +For KDBUS_CMD_ENDPOINT_UPDATE:
> +
> + -E2BIG Too many policy items attached
> + -EINVAL Invalid flags, or wildcards submitted in policy entries,
> + or illegal sequence of policy items
> +
> +For KDBUS_CMD_MATCH_ADD:
> +
> + -EINVAL Illegal flags or items
> + -EDOM Illegal bloom filter size
> + -EMFILE Too many matches for this connection
> +
> +For KDBUS_CMD_MATCH_REMOVE:
> +
> + -EINVAL Illegal flags
> + -ENOENT A match entry with the given cookie could not be found.
> +
> +
> +15. Internal object relations
> +===============================================================================
> +
> +This is a simplified outline of the internal kdbus object relations, for
> +those interested in the inner life of the driver implementation.
> +
> +From the a mount point's (domain's) perspective:
> +
> +struct kdbus_domain
> + |» struct kdbus_domain_user *user (many, owned)
> + '» struct kdbus_node node (embedded)
> + |» struct kdbus_node children (many, referenced)
> + |» struct kdbus_node *parent (pinned)
> + '» struct kdbus_bus (many, pinned)
> + |» struct kdbus_node node (embedded)
> + '» struct kdbus_ep (many, pinned)
> + |» struct kdbus_node node (embedded)
> + |» struct kdbus_bus *bus (pinned)
> + |» struct kdbus_conn conn_list (many, pinned)
> + | |» struct kdbus_ep *ep (pinned)
> + | |» struct kdbus_name_entry *activator_of (owned)
> + | |» struct kdbus_match_db *match_db (owned)
> + | |» struct kdbus_meta *meta (owned)
> + | |» struct kdbus_match_db *match_db (owned)
> + | | '» struct kdbus_match_entry (many, owned)
> + | |
> + | |» struct kdbus_pool *pool (owned)
> + | | '» struct kdbus_pool_slice *slices (many, owned)
> + | | '» struct kdbus_pool *pool (pinned)
> + | |
> + | |» struct kdbus_domain_user *user (pinned)
> + | `» struct kdbus_queue_entry entries (many, embedded)
> + | |» struct kdbus_pool_slice *slice (pinned)
> + | |» struct kdbus_conn_reply *reply (owned)
> + | '» struct kdbus_domain_user *user (pinned)
> + |
> + '» struct kdbus_domain_user *user (pinned)
> + '» struct kdbus_policy_db policy_db (embedded)
> + |» struct kdbus_policy_db_entry (many, owned)
> + | |» struct kdbus_conn (pinned)
> + | '» struct kdbus_ep (pinned)
> + |
> + '» struct kdbus_policy_db_cache_entry (many, owned)
> + '» struct kdbus_conn (pinned)
> +
> +
> +For the life-time of a file descriptor derived from calling open() on a file
> +inside the mount point:
> +
> +struct kdbus_handle
> + |» struct kdbus_meta *meta (owned)
> + |» struct kdbus_ep *ep (pinned)
> + |» struct kdbus_conn *conn (owned)
> + '» struct kdbus_ep *ep (owned)
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:05 UTC (permalink / raw)
To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
ebiederm-aS9lmoZGLiVWk0Htik3J/w,
gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <1421435777-25306-1-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> kdbus is a kernel-level IPC implementation that aims for resemblance to
> the the protocol layer with the existing userspace D-Bus daemon while
> enabling some features that couldn't be implemented before in userspace.
>
> The documentation in the first patch in this series explains the
> protocol and the API details.
>
> Full details on what has changed from the v2 submission are at the
> bottom of this email.
>
> Reasons why this should be done in the kernel, instead of userspace as
> it is currently done today include the following:
>
> - performance: fewer process context switches, fewer copies, fewer
> syscalls, larger memory chunks via memfd. This is really important
> for a whole class of userspace programs that are ported from other
> operating systems that are run on tiny ARM systems that rely on
> hundreds of thousands of messages passed at boot time, and at
> "critical" times in their user interaction loops.
> - security: the peers which communicate do not have to trust each other,
> as the only trustworthy compoenent in the game is the kernel which
> adds metadata and ensures that all data passed as payload is either
> copied or sealed, so that the receiver can parse the data without
> having to protect against changing memory while parsing buffers. Also,
> all the data transfer is controlled by the kernel, so that LSMs can
> track and control what is going on, without involving userspace.
> Because of the LSM issue, security people are much happier with this
> model than the current scheme of having to hook into dbus to mediate
> things.
> - more metadata can be attached to messages than in userspace
> - semantics for apps with heavy data payloads (media apps, for instance)
> with optinal priority message dequeuing, and global message ordering.
> Some "crazy" people are playing with using kdbus for audio data in the
> system. I'm not saying that this is the best model for this, but
> until now, there wasn't any other way to do this without having to
> create custom "busses", one for each application library.
> - being in the kernle closes a lot of races which can't be fixed with
> the current userspace solutions. For example, with kdbus, there is a
> way a client can disconnect from a bus, but do so only if no further
> messages present in its queue, which is crucial for implementing
> race-free "exit-on-idle" services
> - eavesdropping on the kernel level, so privileged users can hook into
> the message stream without hacking support for that into their
> userspace processes
> - a number of smaller benefits: for example kdbus learned a way to peek
> full messages without dequeing them, which is really useful for
> logging metadata when handling bus-activation requests.
>
> Of course, some of the bits above could be implemented in userspace
> alone, for example with more sophisticated memory management APIs, but
> this is usually done by losing out on the other details. For example,
> for many of the memory management APIs, it's hard to not require the
> communicating peers to fully trust each other. And we _really_ don't
> want peers to have to trust each other.
>
> Another benefit of having this in the kernel, rather than as a userspace
> daemon, is that you can now easily use the bus from the initrd, or up to
> the very end when the system shuts down. On current userspace D-Bus,
> this is not really possible, as this requires passing the bus instance
> around between initrd and the "real" system. Such a transition of all
> fds also requires keeping full state of what has already been read from
> the connection fds. kdbus makes this much simpler, as we can change the
> ownership of the bus, just by passing one fd over from one part to the
> other.
I tend to think that much of the above should also be part of the
documentation file (patch 01/13).
Cheers,
Michael
> Regarding binder: binder and kdbus follow very different design
> concepts. Binder implies the use of thread-pools to dispatch incoming
> method calls. This is a very efficient scheme, and completely natural
> in programming languages like Java. On most Linux programs, however,
> there's a much stronger focus on central poll() loops that dispatch all
> sources a program cares about. kdbus is much more usable in such
> environments, as it doesn't enforce a threading model, and it is happy
> with serialized dispatching. In fact, this major difference had an
> effect on much of the design decisions: binder does not guarantee global
> message ordering due to the parallel dispatching in the thread-pools,
> but kdbus does. Moreover, there's also a difference in the way message
> handling. In kdbus, every message is basically taken and dispatched as
> one blob, while in binder, continious connections to other peers are
> created, which are then used to send messages on. Hence, the models are
> quite different, and they serve different needs. I believe that the
> D-Bus/kdbus model is more compatible and friendly with how Linux
> programs are usually implemented.
>
> This can also be found in a git tree, the kdbus branch of char-misc.git at:
> https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/
>
> Changes since v2:
>
> * Add FS_USERNS_MOUNT to the file system flags, so users can mount
> their own kdbusfs instances without being root in the parent
> user-ns. Spotted by Andy Lutomirski.
>
> * Rewrite major parts of the metadata implementation to allow for
> per-recipient namespace translations. For this, namespaces are
> now not pinned by domains anymore. Instead, metadata is recorded
> in kernel scope, and exported into the currently active namespaces
> at the time of message installing.
>
> * Split PID and TID from KDBUS_ITEM_CREDS into KDBUS_ITEM_PIDS.
> The starttime is there to detect re-used PIDs, so move it to that
> new item type as well. Consequently, introduce struct kdbus_pids
> to accommodate the information. Requested by Andy Lutomirski.
>
> * Add {e,s,fs}{u,g}id to KDBUS_ITEM_CREDS, so users have a way to
> get more fine-grained credential information.
>
> * Removed KDBUS_CMD_CANCEL. The interface was not usable from
> threaded userspace implementation due to inherent races. Instead,
> add an item type CANCEL_FD which can be used to pass a file
> descriptor to the CMD_SEND ioctl. When the SEND is done
> synchronously, it will get cancelled as soon as the passed
> FD signals POLLIN.
>
> * Dropped startttime from KDBUS_ITEM_PIDS
>
> * Restrict names of custom endpoints to names with a "<uid>-" prefix,
> just like we do for buses.
>
> * Provide module-parameter "kdbus.attach_flags_mask" to specify the
> a mask of metadata items that is applied on all exported items.
>
> * Monitors are now entirely invisible (IOW, there won't be any
> notification when they are created) and they don't need to install
> filters for broadcast messages anymore.
>
> * All information exposed via a connection's pool now also reports
> the length in addition to the offset. That way, userspace
> applications can mmap() only parts of the pool on demand.
>
> * Due to the metadata rework, KDBUS_ITEM_PAYLOAD_OFF items now
> describe the offset relative to the pool, where they used to be
> relative to the message header.
>
> * Added return_flags bitmask to all kdbus_cmd_* structs, so the
> kernel can report details of the command processing. This is
> mostly reserved for future extensions.
>
> * Some fixes in kdbus.txt and tests, spotted by Harald Hoyer, Andy
> Lutomirski, Michele Curti, Sergei Zviagintsev, Sheng Yong, Torstein
> Husebø and Hristo Venev.
>
> * Fixed compiler warnings in test-message by Michele Curti
>
> * Unexpected items are now rejected with -EINVAL
>
> * Split signal and broadcast handling. Unicast signals are now
> supported, and messages have a new KDBUS_MSG_SIGNAL flag.
>
> * KDBUS_CMD_MSG_SEND was renamed to KDBUS_CMD_SEND, and now takes
> a struct kdbus_cmd_send instead of a kdbus_msg.
>
> * KDBUS_CMD_MSG_RECV was renamed to KDBUS_CMD_RECV.
>
> * Test case memory leak plugged, and various other cleanups and
> fixes, by Rui Miguel Silva.
>
> * Build fix for s390
>
> * Test case fix for 32bit archs
>
> * The test framework now supports mount, pid and user namespaces.
>
> * The test framework learned a --tap command line parameter to
> format its output in the "Test Anything Protocol". This format
> is chosen by default when "make kselftest" is invoked.
>
> * Fixed buses and custom endpoints name validation, reported by
> Andy Lutomirski.
>
> * copy_from_user() return code issue fixed, reported by
> Dan Carpenter.
>
> * Avoid signed int overflow on archs without atomic_sub
>
> * Avoid variable size stack items. Fixes a sparse warning in queue.c.
>
> * New test case for kernel notification quota
>
> * Switched back to enums for the list of ioctls. This has advantages
> for userspace code as gdb, for instance, is able to resolve the
> numbers into names. Added features can easily be detected with
> autotools, and new iotcls can get #defines as well. Having #defines
> for the initial set of ioctls is uncecessary.
>
> Daniel Mack (13):
> kdbus: add documentation
> kdbus: add header file
> kdbus: add driver skeleton, ioctl entry points and utility functions
> kdbus: add connection pool implementation
> kdbus: add connection, queue handling and message validation code
> kdbus: add node and filesystem implementation
> kdbus: add code to gather metadata
> kdbus: add code for notifications and matches
> kdbus: add code for buses, domains and endpoints
> kdbus: add name registry implementation
> kdbus: add policy database implementation
> kdbus: add Makefile, Kconfig and MAINTAINERS entry
> kdbus: add selftests
>
> Documentation/ioctl/ioctl-number.txt | 1 +
> Documentation/kdbus.txt | 2107 +++++++++++++++++++++
> MAINTAINERS | 12 +
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/kdbus.h | 1049 ++++++++++
> include/uapi/linux/magic.h | 2 +
> init/Kconfig | 12 +
> ipc/Makefile | 2 +-
> ipc/kdbus/Makefile | 22 +
> ipc/kdbus/bus.c | 553 ++++++
> ipc/kdbus/bus.h | 103 +
> ipc/kdbus/connection.c | 2004 ++++++++++++++++++++
> ipc/kdbus/connection.h | 262 +++
> ipc/kdbus/domain.c | 350 ++++
> ipc/kdbus/domain.h | 84 +
> ipc/kdbus/endpoint.c | 232 +++
> ipc/kdbus/endpoint.h | 68 +
> ipc/kdbus/fs.c | 519 +++++
> ipc/kdbus/fs.h | 25 +
> ipc/kdbus/handle.c | 1134 +++++++++++
> ipc/kdbus/handle.h | 20 +
> ipc/kdbus/item.c | 309 +++
> ipc/kdbus/item.h | 57 +
> ipc/kdbus/limits.h | 95 +
> ipc/kdbus/main.c | 72 +
> ipc/kdbus/match.c | 535 ++++++
> ipc/kdbus/match.h | 32 +
> ipc/kdbus/message.c | 598 ++++++
> ipc/kdbus/message.h | 133 ++
> ipc/kdbus/metadata.c | 1066 +++++++++++
> ipc/kdbus/metadata.h | 52 +
> ipc/kdbus/names.c | 891 +++++++++
> ipc/kdbus/names.h | 82 +
> ipc/kdbus/node.c | 910 +++++++++
> ipc/kdbus/node.h | 87 +
> ipc/kdbus/notify.c | 244 +++
> ipc/kdbus/notify.h | 30 +
> ipc/kdbus/policy.c | 481 +++++
> ipc/kdbus/policy.h | 51 +
> ipc/kdbus/pool.c | 784 ++++++++
> ipc/kdbus/pool.h | 47 +
> ipc/kdbus/queue.c | 505 +++++
> ipc/kdbus/queue.h | 108 ++
> ipc/kdbus/reply.c | 262 +++
> ipc/kdbus/reply.h | 68 +
> ipc/kdbus/util.c | 317 ++++
> ipc/kdbus/util.h | 133 ++
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/kdbus/.gitignore | 11 +
> tools/testing/selftests/kdbus/Makefile | 46 +
> tools/testing/selftests/kdbus/kdbus-enum.c | 95 +
> tools/testing/selftests/kdbus/kdbus-enum.h | 14 +
> tools/testing/selftests/kdbus/kdbus-test.c | 920 +++++++++
> tools/testing/selftests/kdbus/kdbus-test.h | 85 +
> tools/testing/selftests/kdbus/kdbus-util.c | 1646 ++++++++++++++++
> tools/testing/selftests/kdbus/kdbus-util.h | 216 +++
> tools/testing/selftests/kdbus/test-activator.c | 319 ++++
> tools/testing/selftests/kdbus/test-attach-flags.c | 751 ++++++++
> tools/testing/selftests/kdbus/test-benchmark.c | 427 +++++
> tools/testing/selftests/kdbus/test-bus.c | 174 ++
> tools/testing/selftests/kdbus/test-chat.c | 123 ++
> tools/testing/selftests/kdbus/test-connection.c | 611 ++++++
> tools/testing/selftests/kdbus/test-daemon.c | 66 +
> tools/testing/selftests/kdbus/test-endpoint.c | 344 ++++
> tools/testing/selftests/kdbus/test-fd.c | 710 +++++++
> tools/testing/selftests/kdbus/test-free.c | 36 +
> tools/testing/selftests/kdbus/test-match.c | 442 +++++
> tools/testing/selftests/kdbus/test-message.c | 658 +++++++
> tools/testing/selftests/kdbus/test-metadata-ns.c | 507 +++++
> tools/testing/selftests/kdbus/test-monitor.c | 158 ++
> tools/testing/selftests/kdbus/test-names.c | 184 ++
> tools/testing/selftests/kdbus/test-policy-ns.c | 633 +++++++
> tools/testing/selftests/kdbus/test-policy-priv.c | 1270 +++++++++++++
> tools/testing/selftests/kdbus/test-policy.c | 81 +
> tools/testing/selftests/kdbus/test-race.c | 313 +++
> tools/testing/selftests/kdbus/test-sync.c | 368 ++++
> tools/testing/selftests/kdbus/test-timeout.c | 99 +
> 77 files changed, 27818 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/kdbus.txt
> create mode 100644 include/uapi/linux/kdbus.h
> create mode 100644 ipc/kdbus/Makefile
> create mode 100644 ipc/kdbus/bus.c
> create mode 100644 ipc/kdbus/bus.h
> create mode 100644 ipc/kdbus/connection.c
> create mode 100644 ipc/kdbus/connection.h
> create mode 100644 ipc/kdbus/domain.c
> create mode 100644 ipc/kdbus/domain.h
> create mode 100644 ipc/kdbus/endpoint.c
> create mode 100644 ipc/kdbus/endpoint.h
> create mode 100644 ipc/kdbus/fs.c
> create mode 100644 ipc/kdbus/fs.h
> create mode 100644 ipc/kdbus/handle.c
> create mode 100644 ipc/kdbus/handle.h
> create mode 100644 ipc/kdbus/item.c
> create mode 100644 ipc/kdbus/item.h
> create mode 100644 ipc/kdbus/limits.h
> create mode 100644 ipc/kdbus/main.c
> create mode 100644 ipc/kdbus/match.c
> create mode 100644 ipc/kdbus/match.h
> create mode 100644 ipc/kdbus/message.c
> create mode 100644 ipc/kdbus/message.h
> create mode 100644 ipc/kdbus/metadata.c
> create mode 100644 ipc/kdbus/metadata.h
> create mode 100644 ipc/kdbus/names.c
> create mode 100644 ipc/kdbus/names.h
> create mode 100644 ipc/kdbus/node.c
> create mode 100644 ipc/kdbus/node.h
> create mode 100644 ipc/kdbus/notify.c
> create mode 100644 ipc/kdbus/notify.h
> create mode 100644 ipc/kdbus/policy.c
> create mode 100644 ipc/kdbus/policy.h
> create mode 100644 ipc/kdbus/pool.c
> create mode 100644 ipc/kdbus/pool.h
> create mode 100644 ipc/kdbus/queue.c
> create mode 100644 ipc/kdbus/queue.h
> create mode 100644 ipc/kdbus/reply.c
> create mode 100644 ipc/kdbus/reply.h
> create mode 100644 ipc/kdbus/util.c
> create mode 100644 ipc/kdbus/util.h
> create mode 100644 tools/testing/selftests/kdbus/.gitignore
> create mode 100644 tools/testing/selftests/kdbus/Makefile
> create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.h
> create mode 100644 tools/testing/selftests/kdbus/kdbus-test.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-test.h
> create mode 100644 tools/testing/selftests/kdbus/kdbus-util.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-util.h
> create mode 100644 tools/testing/selftests/kdbus/test-activator.c
> create mode 100644 tools/testing/selftests/kdbus/test-attach-flags.c
> create mode 100644 tools/testing/selftests/kdbus/test-benchmark.c
> create mode 100644 tools/testing/selftests/kdbus/test-bus.c
> create mode 100644 tools/testing/selftests/kdbus/test-chat.c
> create mode 100644 tools/testing/selftests/kdbus/test-connection.c
> create mode 100644 tools/testing/selftests/kdbus/test-daemon.c
> create mode 100644 tools/testing/selftests/kdbus/test-endpoint.c
> create mode 100644 tools/testing/selftests/kdbus/test-fd.c
> create mode 100644 tools/testing/selftests/kdbus/test-free.c
> create mode 100644 tools/testing/selftests/kdbus/test-match.c
> create mode 100644 tools/testing/selftests/kdbus/test-message.c
> create mode 100644 tools/testing/selftests/kdbus/test-metadata-ns.c
> create mode 100644 tools/testing/selftests/kdbus/test-monitor.c
> create mode 100644 tools/testing/selftests/kdbus/test-names.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy-ns.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy-priv.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy.c
> create mode 100644 tools/testing/selftests/kdbus/test-race.c
> create mode 100644 tools/testing/selftests/kdbus/test-sync.c
> create mode 100644 tools/testing/selftests/kdbus/test-timeout.c
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:12 UTC (permalink / raw)
To: Johannes Stezenbach, Greg Kroah-Hartman
Cc: mtk.manpages, arnd, ebiederm, gnomes, teg, jkosina, luto,
linux-api, linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <20150120132437.GB7545@sig21.net>
On 01/20/2015 02:24 PM, Johannes Stezenbach wrote:
> On Tue, Jan 20, 2015 at 07:26:09PM +0800, Greg Kroah-Hartman wrote:
>> On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:
>>> My guess is that the people porting from QNX were just confused
>>> and their use of D-Bus was in error. Maybe they should've used
>>> plain sockets, capnproto, ZeroMQ or whatever.
>>
>> I tend to trust that they knew what they were doing, they wouldn't have
>> picked D-Bus for no good reason.
>
> The automotive developers I had the pleasure to work with would
> use anything which is available via a mouse click in the
> commercial Embedded Linux SDK IDE of their choice :)
> Let's face it: QNX has a single IPC solution while Linux has
> a confusing multitude of possibilities.
Greg, from my spell in IVI, I too have to say your faith in the
wisdom of IVI developers' choices is touching. I think D-Bus was
in the main picked because it had some nice features, but then
people realized it had no bandwidth, and the solution has been
"make D-Bus faster", rather than "maybe we should explore
other (mixed model) solutions". This isn't to say that I'm
against adding kdbus, but I don't think there's much strength to
the argument you make above.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:15 UTC (permalink / raw)
To: Greg Kroah-Hartman, arnd, ebiederm, gnomes, teg, jkosina, luto,
linux-api, linux-kernel
Cc: mtk.manpages, Daniel Mack, dh.herrmann, tixxdz
In-Reply-To: <1421435777-25306-1-git-send-email-gregkh@linuxfoundation.org>
[Bother. Futzed Daniel Mack's email address. Resending]
On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> kdbus is a kernel-level IPC implementation that aims for resemblance to
> the the protocol layer with the existing userspace D-Bus daemon while
> enabling some features that couldn't be implemented before in userspace.
>
> The documentation in the first patch in this series explains the
> protocol and the API details.
>
> Full details on what has changed from the v2 submission are at the
> bottom of this email.
>
> Reasons why this should be done in the kernel, instead of userspace as
> it is currently done today include the following:
>
> - performance: fewer process context switches, fewer copies, fewer
> syscalls, larger memory chunks via memfd. This is really important
> for a whole class of userspace programs that are ported from other
> operating systems that are run on tiny ARM systems that rely on
> hundreds of thousands of messages passed at boot time, and at
> "critical" times in their user interaction loops.
> - security: the peers which communicate do not have to trust each other,
> as the only trustworthy compoenent in the game is the kernel which
> adds metadata and ensures that all data passed as payload is either
> copied or sealed, so that the receiver can parse the data without
> having to protect against changing memory while parsing buffers. Also,
> all the data transfer is controlled by the kernel, so that LSMs can
> track and control what is going on, without involving userspace.
> Because of the LSM issue, security people are much happier with this
> model than the current scheme of having to hook into dbus to mediate
> things.
> - more metadata can be attached to messages than in userspace
> - semantics for apps with heavy data payloads (media apps, for instance)
> with optinal priority message dequeuing, and global message ordering.
> Some "crazy" people are playing with using kdbus for audio data in the
> system. I'm not saying that this is the best model for this, but
> until now, there wasn't any other way to do this without having to
> create custom "busses", one for each application library.
> - being in the kernle closes a lot of races which can't be fixed with
> the current userspace solutions. For example, with kdbus, there is a
> way a client can disconnect from a bus, but do so only if no further
> messages present in its queue, which is crucial for implementing
> race-free "exit-on-idle" services
> - eavesdropping on the kernel level, so privileged users can hook into
> the message stream without hacking support for that into their
> userspace processes
> - a number of smaller benefits: for example kdbus learned a way to peek
> full messages without dequeing them, which is really useful for
> logging metadata when handling bus-activation requests.
>
> Of course, some of the bits above could be implemented in userspace
> alone, for example with more sophisticated memory management APIs, but
> this is usually done by losing out on the other details. For example,
> for many of the memory management APIs, it's hard to not require the
> communicating peers to fully trust each other. And we _really_ don't
> want peers to have to trust each other.
>
> Another benefit of having this in the kernel, rather than as a userspace
> daemon, is that you can now easily use the bus from the initrd, or up to
> the very end when the system shuts down. On current userspace D-Bus,
> this is not really possible, as this requires passing the bus instance
> around between initrd and the "real" system. Such a transition of all
> fds also requires keeping full state of what has already been read from
> the connection fds. kdbus makes this much simpler, as we can change the
> ownership of the bus, just by passing one fd over from one part to the
> other.
I tend to think that much of the above should also be part of the
documentation file (patch 01/13).
Cheers,
Michael
> Regarding binder: binder and kdbus follow very different design
> concepts. Binder implies the use of thread-pools to dispatch incoming
> method calls. This is a very efficient scheme, and completely natural
> in programming languages like Java. On most Linux programs, however,
> there's a much stronger focus on central poll() loops that dispatch all
> sources a program cares about. kdbus is much more usable in such
> environments, as it doesn't enforce a threading model, and it is happy
> with serialized dispatching. In fact, this major difference had an
> effect on much of the design decisions: binder does not guarantee global
> message ordering due to the parallel dispatching in the thread-pools,
> but kdbus does. Moreover, there's also a difference in the way message
> handling. In kdbus, every message is basically taken and dispatched as
> one blob, while in binder, continious connections to other peers are
> created, which are then used to send messages on. Hence, the models are
> quite different, and they serve different needs. I believe that the
> D-Bus/kdbus model is more compatible and friendly with how Linux
> programs are usually implemented.
>
> This can also be found in a git tree, the kdbus branch of char-misc.git at:
> https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/
>
> Changes since v2:
>
> * Add FS_USERNS_MOUNT to the file system flags, so users can mount
> their own kdbusfs instances without being root in the parent
> user-ns. Spotted by Andy Lutomirski.
>
> * Rewrite major parts of the metadata implementation to allow for
> per-recipient namespace translations. For this, namespaces are
> now not pinned by domains anymore. Instead, metadata is recorded
> in kernel scope, and exported into the currently active namespaces
> at the time of message installing.
>
> * Split PID and TID from KDBUS_ITEM_CREDS into KDBUS_ITEM_PIDS.
> The starttime is there to detect re-used PIDs, so move it to that
> new item type as well. Consequently, introduce struct kdbus_pids
> to accommodate the information. Requested by Andy Lutomirski.
>
> * Add {e,s,fs}{u,g}id to KDBUS_ITEM_CREDS, so users have a way to
> get more fine-grained credential information.
>
> * Removed KDBUS_CMD_CANCEL. The interface was not usable from
> threaded userspace implementation due to inherent races. Instead,
> add an item type CANCEL_FD which can be used to pass a file
> descriptor to the CMD_SEND ioctl. When the SEND is done
> synchronously, it will get cancelled as soon as the passed
> FD signals POLLIN.
>
> * Dropped startttime from KDBUS_ITEM_PIDS
>
> * Restrict names of custom endpoints to names with a "<uid>-" prefix,
> just like we do for buses.
>
> * Provide module-parameter "kdbus.attach_flags_mask" to specify the
> a mask of metadata items that is applied on all exported items.
>
> * Monitors are now entirely invisible (IOW, there won't be any
> notification when they are created) and they don't need to install
> filters for broadcast messages anymore.
>
> * All information exposed via a connection's pool now also reports
> the length in addition to the offset. That way, userspace
> applications can mmap() only parts of the pool on demand.
>
> * Due to the metadata rework, KDBUS_ITEM_PAYLOAD_OFF items now
> describe the offset relative to the pool, where they used to be
> relative to the message header.
>
> * Added return_flags bitmask to all kdbus_cmd_* structs, so the
> kernel can report details of the command processing. This is
> mostly reserved for future extensions.
>
> * Some fixes in kdbus.txt and tests, spotted by Harald Hoyer, Andy
> Lutomirski, Michele Curti, Sergei Zviagintsev, Sheng Yong, Torstein
> Husebø and Hristo Venev.
>
> * Fixed compiler warnings in test-message by Michele Curti
>
> * Unexpected items are now rejected with -EINVAL
>
> * Split signal and broadcast handling. Unicast signals are now
> supported, and messages have a new KDBUS_MSG_SIGNAL flag.
>
> * KDBUS_CMD_MSG_SEND was renamed to KDBUS_CMD_SEND, and now takes
> a struct kdbus_cmd_send instead of a kdbus_msg.
>
> * KDBUS_CMD_MSG_RECV was renamed to KDBUS_CMD_RECV.
>
> * Test case memory leak plugged, and various other cleanups and
> fixes, by Rui Miguel Silva.
>
> * Build fix for s390
>
> * Test case fix for 32bit archs
>
> * The test framework now supports mount, pid and user namespaces.
>
> * The test framework learned a --tap command line parameter to
> format its output in the "Test Anything Protocol". This format
> is chosen by default when "make kselftest" is invoked.
>
> * Fixed buses and custom endpoints name validation, reported by
> Andy Lutomirski.
>
> * copy_from_user() return code issue fixed, reported by
> Dan Carpenter.
>
> * Avoid signed int overflow on archs without atomic_sub
>
> * Avoid variable size stack items. Fixes a sparse warning in queue.c.
>
> * New test case for kernel notification quota
>
> * Switched back to enums for the list of ioctls. This has advantages
> for userspace code as gdb, for instance, is able to resolve the
> numbers into names. Added features can easily be detected with
> autotools, and new iotcls can get #defines as well. Having #defines
> for the initial set of ioctls is uncecessary.
>
> Daniel Mack (13):
> kdbus: add documentation
> kdbus: add header file
> kdbus: add driver skeleton, ioctl entry points and utility functions
> kdbus: add connection pool implementation
> kdbus: add connection, queue handling and message validation code
> kdbus: add node and filesystem implementation
> kdbus: add code to gather metadata
> kdbus: add code for notifications and matches
> kdbus: add code for buses, domains and endpoints
> kdbus: add name registry implementation
> kdbus: add policy database implementation
> kdbus: add Makefile, Kconfig and MAINTAINERS entry
> kdbus: add selftests
>
> Documentation/ioctl/ioctl-number.txt | 1 +
> Documentation/kdbus.txt | 2107 +++++++++++++++++++++
> MAINTAINERS | 12 +
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/kdbus.h | 1049 ++++++++++
> include/uapi/linux/magic.h | 2 +
> init/Kconfig | 12 +
> ipc/Makefile | 2 +-
> ipc/kdbus/Makefile | 22 +
> ipc/kdbus/bus.c | 553 ++++++
> ipc/kdbus/bus.h | 103 +
> ipc/kdbus/connection.c | 2004 ++++++++++++++++++++
> ipc/kdbus/connection.h | 262 +++
> ipc/kdbus/domain.c | 350 ++++
> ipc/kdbus/domain.h | 84 +
> ipc/kdbus/endpoint.c | 232 +++
> ipc/kdbus/endpoint.h | 68 +
> ipc/kdbus/fs.c | 519 +++++
> ipc/kdbus/fs.h | 25 +
> ipc/kdbus/handle.c | 1134 +++++++++++
> ipc/kdbus/handle.h | 20 +
> ipc/kdbus/item.c | 309 +++
> ipc/kdbus/item.h | 57 +
> ipc/kdbus/limits.h | 95 +
> ipc/kdbus/main.c | 72 +
> ipc/kdbus/match.c | 535 ++++++
> ipc/kdbus/match.h | 32 +
> ipc/kdbus/message.c | 598 ++++++
> ipc/kdbus/message.h | 133 ++
> ipc/kdbus/metadata.c | 1066 +++++++++++
> ipc/kdbus/metadata.h | 52 +
> ipc/kdbus/names.c | 891 +++++++++
> ipc/kdbus/names.h | 82 +
> ipc/kdbus/node.c | 910 +++++++++
> ipc/kdbus/node.h | 87 +
> ipc/kdbus/notify.c | 244 +++
> ipc/kdbus/notify.h | 30 +
> ipc/kdbus/policy.c | 481 +++++
> ipc/kdbus/policy.h | 51 +
> ipc/kdbus/pool.c | 784 ++++++++
> ipc/kdbus/pool.h | 47 +
> ipc/kdbus/queue.c | 505 +++++
> ipc/kdbus/queue.h | 108 ++
> ipc/kdbus/reply.c | 262 +++
> ipc/kdbus/reply.h | 68 +
> ipc/kdbus/util.c | 317 ++++
> ipc/kdbus/util.h | 133 ++
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/kdbus/.gitignore | 11 +
> tools/testing/selftests/kdbus/Makefile | 46 +
> tools/testing/selftests/kdbus/kdbus-enum.c | 95 +
> tools/testing/selftests/kdbus/kdbus-enum.h | 14 +
> tools/testing/selftests/kdbus/kdbus-test.c | 920 +++++++++
> tools/testing/selftests/kdbus/kdbus-test.h | 85 +
> tools/testing/selftests/kdbus/kdbus-util.c | 1646 ++++++++++++++++
> tools/testing/selftests/kdbus/kdbus-util.h | 216 +++
> tools/testing/selftests/kdbus/test-activator.c | 319 ++++
> tools/testing/selftests/kdbus/test-attach-flags.c | 751 ++++++++
> tools/testing/selftests/kdbus/test-benchmark.c | 427 +++++
> tools/testing/selftests/kdbus/test-bus.c | 174 ++
> tools/testing/selftests/kdbus/test-chat.c | 123 ++
> tools/testing/selftests/kdbus/test-connection.c | 611 ++++++
> tools/testing/selftests/kdbus/test-daemon.c | 66 +
> tools/testing/selftests/kdbus/test-endpoint.c | 344 ++++
> tools/testing/selftests/kdbus/test-fd.c | 710 +++++++
> tools/testing/selftests/kdbus/test-free.c | 36 +
> tools/testing/selftests/kdbus/test-match.c | 442 +++++
> tools/testing/selftests/kdbus/test-message.c | 658 +++++++
> tools/testing/selftests/kdbus/test-metadata-ns.c | 507 +++++
> tools/testing/selftests/kdbus/test-monitor.c | 158 ++
> tools/testing/selftests/kdbus/test-names.c | 184 ++
> tools/testing/selftests/kdbus/test-policy-ns.c | 633 +++++++
> tools/testing/selftests/kdbus/test-policy-priv.c | 1270 +++++++++++++
> tools/testing/selftests/kdbus/test-policy.c | 81 +
> tools/testing/selftests/kdbus/test-race.c | 313 +++
> tools/testing/selftests/kdbus/test-sync.c | 368 ++++
> tools/testing/selftests/kdbus/test-timeout.c | 99 +
> 77 files changed, 27818 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/kdbus.txt
> create mode 100644 include/uapi/linux/kdbus.h
> create mode 100644 ipc/kdbus/Makefile
> create mode 100644 ipc/kdbus/bus.c
> create mode 100644 ipc/kdbus/bus.h
> create mode 100644 ipc/kdbus/connection.c
> create mode 100644 ipc/kdbus/connection.h
> create mode 100644 ipc/kdbus/domain.c
> create mode 100644 ipc/kdbus/domain.h
> create mode 100644 ipc/kdbus/endpoint.c
> create mode 100644 ipc/kdbus/endpoint.h
> create mode 100644 ipc/kdbus/fs.c
> create mode 100644 ipc/kdbus/fs.h
> create mode 100644 ipc/kdbus/handle.c
> create mode 100644 ipc/kdbus/handle.h
> create mode 100644 ipc/kdbus/item.c
> create mode 100644 ipc/kdbus/item.h
> create mode 100644 ipc/kdbus/limits.h
> create mode 100644 ipc/kdbus/main.c
> create mode 100644 ipc/kdbus/match.c
> create mode 100644 ipc/kdbus/match.h
> create mode 100644 ipc/kdbus/message.c
> create mode 100644 ipc/kdbus/message.h
> create mode 100644 ipc/kdbus/metadata.c
> create mode 100644 ipc/kdbus/metadata.h
> create mode 100644 ipc/kdbus/names.c
> create mode 100644 ipc/kdbus/names.h
> create mode 100644 ipc/kdbus/node.c
> create mode 100644 ipc/kdbus/node.h
> create mode 100644 ipc/kdbus/notify.c
> create mode 100644 ipc/kdbus/notify.h
> create mode 100644 ipc/kdbus/policy.c
> create mode 100644 ipc/kdbus/policy.h
> create mode 100644 ipc/kdbus/pool.c
> create mode 100644 ipc/kdbus/pool.h
> create mode 100644 ipc/kdbus/queue.c
> create mode 100644 ipc/kdbus/queue.h
> create mode 100644 ipc/kdbus/reply.c
> create mode 100644 ipc/kdbus/reply.h
> create mode 100644 ipc/kdbus/util.c
> create mode 100644 ipc/kdbus/util.h
> create mode 100644 tools/testing/selftests/kdbus/.gitignore
> create mode 100644 tools/testing/selftests/kdbus/Makefile
> create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.h
> create mode 100644 tools/testing/selftests/kdbus/kdbus-test.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-test.h
> create mode 100644 tools/testing/selftests/kdbus/kdbus-util.c
> create mode 100644 tools/testing/selftests/kdbus/kdbus-util.h
> create mode 100644 tools/testing/selftests/kdbus/test-activator.c
> create mode 100644 tools/testing/selftests/kdbus/test-attach-flags.c
> create mode 100644 tools/testing/selftests/kdbus/test-benchmark.c
> create mode 100644 tools/testing/selftests/kdbus/test-bus.c
> create mode 100644 tools/testing/selftests/kdbus/test-chat.c
> create mode 100644 tools/testing/selftests/kdbus/test-connection.c
> create mode 100644 tools/testing/selftests/kdbus/test-daemon.c
> create mode 100644 tools/testing/selftests/kdbus/test-endpoint.c
> create mode 100644 tools/testing/selftests/kdbus/test-fd.c
> create mode 100644 tools/testing/selftests/kdbus/test-free.c
> create mode 100644 tools/testing/selftests/kdbus/test-match.c
> create mode 100644 tools/testing/selftests/kdbus/test-message.c
> create mode 100644 tools/testing/selftests/kdbus/test-metadata-ns.c
> create mode 100644 tools/testing/selftests/kdbus/test-monitor.c
> create mode 100644 tools/testing/selftests/kdbus/test-names.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy-ns.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy-priv.c
> create mode 100644 tools/testing/selftests/kdbus/test-policy.c
> create mode 100644 tools/testing/selftests/kdbus/test-race.c
> create mode 100644 tools/testing/selftests/kdbus/test-sync.c
> create mode 100644 tools/testing/selftests/kdbus/test-timeout.c
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: David Herrmann @ 2015-01-20 14:31 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack, Djalal Harouni, Daniel Mack,
Johannes Stezenbach
In-Reply-To: <54BE5DC8.70706-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Hi Michael
On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>
>> kdbus is a system for low-latency, low-overhead, easy to use
>> interprocess communication (IPC).
>>
>> The interface to all functions in this driver is implemented via ioctls
>> on files exposed through a filesystem called 'kdbusfs'. The default
>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
>> documentation about the kernel level API design.
>
> I have some details feedback on the contents of this file, and some
> bigger questions. I'll split them out into separate mails.
>
> So here, the bigger, general questions to start with. I've arrived late
> to this, so sorry if they've already been discussed, but the answers to
> some of the questions should actually be in this file, I would have
> expected.
>
> This is an enormous and complex API. Why is the API ioctl() based,
> rather than system-call-based? Have we learned nothing from the hydra
> that the futex() multiplexing syscall became? (And kdbus is an order
> of magnitude more complex, by the look of things.) At the very least,
> a *good* justification of why the API is ioctl()-based should be part
> of this documentation file.
>
> An observation: The documentation below is substantial, but this API is
> enormous, so the documentation still feels rather thin. What would
> really help would be some example code in the doc.
>
> And on the subject of code examples... Are there any (prototype)
> working user-space applications that exercise the current kdbus
> implementation? That is, can I install these kdbus patches, and
> then find a simple example application somewhere that does
> something to exercise kdbus?
If you run a 3.18 kernel, you can install kdbus.ko from our repository
and boot a full Fedora system running Gnome3 with kdbus, given that
you compiled systemd with --enable-kdbus (which is still
experimental). No legacy dbus1 daemon is running. Instead, we have a
bus-proxy that converts classic dbus1 to kdbus, so all
bus-communication runs on kdbus.
> And then: is there any substantial real-world application (e.g., a
> full D-Bus port) that is being developed in tandem with this kernel
> side patch? (I don't mean a user-space library; I mean a seriously
> large application.) This is an incredibly complex API whose
> failings are only going to become evident through real-world use.
> Solidifying an API in the kernel and then discovering the API
> problems later when writing real-world applications would make for
> a sad story. A story something like that of inotify, an API which
> is an order of magnitude less complex than kdbus. (I can't help but
> feel that many of inotify problems that I discuss at
> https://lwn.net/Articles/605128/ might have been fixed or mitigated
> if a few real-world applications had been implemented before the
> API was set in stone.)
I think running a whole Gnome3 stack counts as "substantial real-world
application", right? Sure, it's a dbus1-to-kdbus layer, but all the
systemd tools use kdbus natively and it works just fine. In fact, we
all run kdbus on our main-systems every day.
We've spent over a year fixing races and API misdesigns, we've talked
to other toolkit developers (glib, qt, ..) and made sure we're
backwards compatible to dbus1. I don't think the API is perfect,
everyone makes mistakes. But with bus-proxy and systemd we have two
huge users of kdbus that put a lot of pressure on API design.
>> +For a kdbus specific userspace library implementation please refer to:
>> + http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
>
> Is this library intended just for systemd? More generally, is there an
> intention to provide a general purpose library API for kdbus? Or is the
> intention that each application will roll a library suitable to its
> needs? I think an answer to that question would be useful in this
> Documentation file.
kdbus is in no way bound to systemd. There are ongoing efforts to port
glib and qt to kdbus natively. The API is pretty simple and I don't
see how a libkdbus would simplify things. In fact, even our tests only
have slim wrappers around the ioctls to simplify error-handling in
test-scenarios.
Note that most of the toolkit work is on the marshaling level, which
is independent of kdbus. kdbus just provides the transport level. DBus
is just one, yet significant, application-layer on top of kdbus. Our
test-cases use kdbus exclusively to transport raw byte streams.
Thanks
David
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Josh Boyer @ 2015-01-20 14:42 UTC (permalink / raw)
To: David Herrmann
Cc: Michael Kerrisk (man-pages), Greg Kroah-Hartman, Arnd Bergmann,
Eric W. Biederman, One Thousand Gnomes, Tom Gundersen,
Jiri Kosina, Andy Lutomirski, Linux API, linux-kernel,
Djalal Harouni, Daniel Mack, Johannes Stezenbach
In-Reply-To: <CANq1E4SjfZOKqTsdkt519vKc1Poeah5McVJBb_spdHjbKv4=7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Michael
>
> On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>>
>>> kdbus is a system for low-latency, low-overhead, easy to use
>>> interprocess communication (IPC).
>>>
>>> The interface to all functions in this driver is implemented via ioctls
>>> on files exposed through a filesystem called 'kdbusfs'. The default
>>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
>>> documentation about the kernel level API design.
>>
>> I have some details feedback on the contents of this file, and some
>> bigger questions. I'll split them out into separate mails.
>>
>> So here, the bigger, general questions to start with. I've arrived late
>> to this, so sorry if they've already been discussed, but the answers to
>> some of the questions should actually be in this file, I would have
>> expected.
>>
>> This is an enormous and complex API. Why is the API ioctl() based,
>> rather than system-call-based? Have we learned nothing from the hydra
>> that the futex() multiplexing syscall became? (And kdbus is an order
>> of magnitude more complex, by the look of things.) At the very least,
>> a *good* justification of why the API is ioctl()-based should be part
>> of this documentation file.
>>
>> An observation: The documentation below is substantial, but this API is
>> enormous, so the documentation still feels rather thin. What would
>> really help would be some example code in the doc.
>>
>> And on the subject of code examples... Are there any (prototype)
>> working user-space applications that exercise the current kdbus
>> implementation? That is, can I install these kdbus patches, and
>> then find a simple example application somewhere that does
>> something to exercise kdbus?
>
> If you run a 3.18 kernel, you can install kdbus.ko from our repository
> and boot a full Fedora system running Gnome3 with kdbus, given that
> you compiled systemd with --enable-kdbus (which is still
> experimental). No legacy dbus1 daemon is running. Instead, we have a
> bus-proxy that converts classic dbus1 to kdbus, so all
> bus-communication runs on kdbus.
FWIW, we've been building a "playground" repository for the kernel
that contains this already for Fedora. If you have a stock Fedora 21
or rawhide install, you can use:
https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/
which has the kernel+kdbus and systemd built with --enable-kdbus
already. Easy enough to throw in a VM for testing.
josh
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Djalal Harouni @ 2015-01-20 14:53 UTC (permalink / raw)
To: Josh Boyer
Cc: David Herrmann, Michael Kerrisk (man-pages), Greg Kroah-Hartman,
Arnd Bergmann, Eric W. Biederman, One Thousand Gnomes,
Tom Gundersen, Jiri Kosina, Andy Lutomirski, Linux API,
linux-kernel, Daniel Mack, Johannes Stezenbach
In-Reply-To: <CA+5PVA46wVa-L9K1zgHzt=Fwn6YFBYhxJUjiozSH4XBY03WMVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Hi,
On Tue, Jan 20, 2015 at 09:42:59AM -0500, Josh Boyer wrote:
> On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Hi Michael
> >
> > On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
> > <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> >>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> >>>
> >>> kdbus is a system for low-latency, low-overhead, easy to use
> >>> interprocess communication (IPC).
> >>>
> >>> The interface to all functions in this driver is implemented via ioctls
> >>> on files exposed through a filesystem called 'kdbusfs'. The default
> >>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> >>> documentation about the kernel level API design.
> >>
> >> I have some details feedback on the contents of this file, and some
> >> bigger questions. I'll split them out into separate mails.
> >>
> >> So here, the bigger, general questions to start with. I've arrived late
> >> to this, so sorry if they've already been discussed, but the answers to
> >> some of the questions should actually be in this file, I would have
> >> expected.
> >>
> >> This is an enormous and complex API. Why is the API ioctl() based,
> >> rather than system-call-based? Have we learned nothing from the hydra
> >> that the futex() multiplexing syscall became? (And kdbus is an order
> >> of magnitude more complex, by the look of things.) At the very least,
> >> a *good* justification of why the API is ioctl()-based should be part
> >> of this documentation file.
> >>
> >> An observation: The documentation below is substantial, but this API is
> >> enormous, so the documentation still feels rather thin. What would
> >> really help would be some example code in the doc.
> >>
> >> And on the subject of code examples... Are there any (prototype)
> >> working user-space applications that exercise the current kdbus
> >> implementation? That is, can I install these kdbus patches, and
> >> then find a simple example application somewhere that does
> >> something to exercise kdbus?
> >
> > If you run a 3.18 kernel, you can install kdbus.ko from our repository
> > and boot a full Fedora system running Gnome3 with kdbus, given that
> > you compiled systemd with --enable-kdbus (which is still
> > experimental). No legacy dbus1 daemon is running. Instead, we have a
> > bus-proxy that converts classic dbus1 to kdbus, so all
> > bus-communication runs on kdbus.
>
> FWIW, we've been building a "playground" repository for the kernel
> that contains this already for Fedora. If you have a stock Fedora 21
> or rawhide install, you can use:
>
> https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/
>
> which has the kernel+kdbus and systemd built with --enable-kdbus
> already. Easy enough to throw in a VM for testing.
>
> josh
Yes thanks josh!
Another addition, if kdbus is installed and loaded, you could also use
systemd-nspawn to boot a full system (systemd compiled with
--enable-kdbus) in a container [1], kdbusfs will be mounted in the
container.
There is also the busctl tool to query kdbus...
http://www.freedesktop.org/wiki/Software/systemd/VirtualizedTesting/
--
Djalal Harouni
http://opendz.org
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Johannes Stezenbach @ 2015-01-20 16:08 UTC (permalink / raw)
To: Djalal Harouni
Cc: Josh Boyer, David Herrmann, Michael Kerrisk (man-pages),
Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack
In-Reply-To: <20150120145353.GA8124@dztty>
Hi all,
On Tue, Jan 20, 2015 at 03:53:53PM +0100, Djalal Harouni wrote:
> On Tue, Jan 20, 2015 at 09:42:59AM -0500, Josh Boyer wrote:
> > On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann@gmail.com> wrote:
> > >
> > > If you run a 3.18 kernel, you can install kdbus.ko from our repository
> > > and boot a full Fedora system running Gnome3 with kdbus, given that
> > > you compiled systemd with --enable-kdbus (which is still
> > > experimental). No legacy dbus1 daemon is running. Instead, we have a
> > > bus-proxy that converts classic dbus1 to kdbus, so all
> > > bus-communication runs on kdbus.
> >
> > FWIW, we've been building a "playground" repository for the kernel
> > that contains this already for Fedora. If you have a stock Fedora 21
> > or rawhide install, you can use:
> >
> > https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/
> >
> > which has the kernel+kdbus and systemd built with --enable-kdbus
> > already. Easy enough to throw in a VM for testing.
>
> Another addition, if kdbus is installed and loaded, you could also use
> systemd-nspawn to boot a full system (systemd compiled with
> --enable-kdbus) in a container [1], kdbusfs will be mounted in the
> container.
>
> There is also the busctl tool to query kdbus...
>
> http://www.freedesktop.org/wiki/Software/systemd/VirtualizedTesting/
It is reassuring that kdbus actually works :)
However, let me repeat and rephrase my previous questions:
Is there a noticable or measurable improvement from using kdbus?
IOW, is the added complexity of kdbus worth the result?
I have stated my believe that current usage of D-Bus is not
performance sensitive and the number of messages exchanged
is low. I would love it if you would prove me wrong.
Or if you could show that any D-Bus related bug in Gnome3
is fixed by kdbus.
I would sooo love it if someone would finally post some
data that proves kdbus is useful beyond systemd.
Thanks,
Johannes
^ permalink raw reply
* [PATCH 01/05] fixup! virtio_pci: modern driver
From: Michael S. Tsirkin @ 2015-01-20 16:25 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421771093-1589-1-git-send-email-mst@redhat.com>
virtio_pci_modern: fix up vendor capability
Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.
Luckily we caught this in time.
This fixes commit "virtio_pci: modern driver"
and should be smashed with it.
Reported-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_pci.h | 7 ++++---
drivers/virtio/virtio_pci_modern.c | 23 +++++++++--------------
2 files changed, 13 insertions(+), 17 deletions(-)
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 4e05423..a2b2e13 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -117,10 +117,11 @@ struct virtio_pci_cap {
__u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */
__u8 cap_next; /* Generic PCI field: next ptr. */
__u8 cap_len; /* Generic PCI field: capability length */
- __u8 type_and_bar; /* Upper 3 bits: bar.
- * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+ __u8 cfg_type; /* Identifies the structure. */
+ __u8 bar; /* Where to find it. */
+ __u8 padding[3]; /* Pad to full dword. */
__le32 offset; /* Offset within bar. */
- __le32 length; /* Length. */
+ __le32 length; /* Length of the structure, in bytes. */
};
#define VIRTIO_PCI_CAP_BAR_SHIFT 5
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index e2f41c9..a3d8101 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -26,13 +26,13 @@ static void __iomem *map_capability(struct pci_dev *dev, int off,
u32 start, u32 size,
size_t *len)
{
- u8 type_and_bar, bar;
+ u8 bar;
u32 offset, length;
void __iomem *p;
pci_read_config_byte(dev, off + offsetof(struct virtio_pci_cap,
- type_and_bar),
- &type_and_bar);
+ bar),
+ &bar);
pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, offset),
&offset);
pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, length),
@@ -76,9 +76,6 @@ static void __iomem *map_capability(struct pci_dev *dev, int off,
if (len)
*len = length;
- bar = (type_and_bar >> VIRTIO_PCI_CAP_BAR_SHIFT) &
- VIRTIO_PCI_CAP_BAR_MASK;
-
if (minlen + offset < minlen ||
minlen + offset > pci_resource_len(dev, bar)) {
dev_err(&dev->dev,
@@ -438,15 +435,13 @@ static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type,
for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
pos > 0;
pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
- u8 type_and_bar, type, bar;
+ u8 type, bar;
pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
- type_and_bar),
- &type_and_bar);
-
- type = (type_and_bar >> VIRTIO_PCI_CAP_TYPE_SHIFT) &
- VIRTIO_PCI_CAP_TYPE_MASK;
- bar = (type_and_bar >> VIRTIO_PCI_CAP_BAR_SHIFT) &
- VIRTIO_PCI_CAP_BAR_MASK;
+ cfg_type),
+ &type);
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar),
+ &bar);
/* Ignore structures with reserved BAR values */
if (bar > 0x5)
--
MST
^ permalink raw reply related
* [PATCH 03/05] fixup! virtio_pci: macros for PCI layout offsets
From: Michael S. Tsirkin @ 2015-01-20 16:25 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421771093-1589-1-git-send-email-mst@redhat.com>
virtio_pci_modern: fix up vendor capability macros
Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.
Luckily we caught this in time.
This fixes commit "virtio_pci: macros for PCI layout offsets"
and should be smashed with it.
Reported-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_pci.h | 14 +++++---------
drivers/virtio/virtio_pci_modern.c | 6 ++++--
2 files changed, 9 insertions(+), 11 deletions(-)
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 0911c62..3b7e4d2 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -124,11 +124,6 @@ struct virtio_pci_cap {
__le32 length; /* Length of the structure, in bytes. */
};
-#define VIRTIO_PCI_CAP_BAR_SHIFT 5
-#define VIRTIO_PCI_CAP_BAR_MASK 0x7
-#define VIRTIO_PCI_CAP_TYPE_SHIFT 0
-#define VIRTIO_PCI_CAP_TYPE_MASK 0x7
-
struct virtio_pci_notify_cap {
struct virtio_pci_cap cap;
__le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
@@ -164,11 +159,12 @@ struct virtio_pci_common_cfg {
#define VIRTIO_PCI_CAP_VNDR 0
#define VIRTIO_PCI_CAP_NEXT 1
#define VIRTIO_PCI_CAP_LEN 2
-#define VIRTIO_PCI_CAP_TYPE_AND_BAR 3
-#define VIRTIO_PCI_CAP_OFFSET 4
-#define VIRTIO_PCI_CAP_LENGTH 8
+#define VIRTIO_PCI_CAP_CFG_TYPE 3
+#define VIRTIO_PCI_CAP_BAR 4
+#define VIRTIO_PCI_CAP_OFFSET 8
+#define VIRTIO_PCI_CAP_LENGTH 12
-#define VIRTIO_PCI_NOTIFY_CAP_MULT 12
+#define VIRTIO_PCI_NOTIFY_CAP_MULT 16
#define VIRTIO_PCI_COMMON_DFSELECT 0
#define VIRTIO_PCI_COMMON_DF 4
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index c86594e..b2e707ad 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -474,8 +474,10 @@ static inline void check_offsets(void)
offsetof(struct virtio_pci_cap, cap_next));
BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
offsetof(struct virtio_pci_cap, cap_len));
- BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
- offsetof(struct virtio_pci_cap, type_and_bar));
+ BUILD_BUG_ON(VIRTIO_PCI_CAP_CFG_TYPE !=
+ offsetof(struct virtio_pci_cap, cfg_type));
+ BUILD_BUG_ON(VIRTIO_PCI_CAP_BAR !=
+ offsetof(struct virtio_pci_cap, bar));
BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
offsetof(struct virtio_pci_cap, offset));
BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
--
MST
^ permalink raw reply related
* Re: [PATCH v5 0/5] power: Add max77693 charger driver
From: Sebastian Reichel @ 2015-01-20 16:33 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Dmitry Eremin-Solenikov, David Woodhouse, Samuel Ortiz, Lee Jones,
linux-kernel, linux-pm, devicetree, linux-api, Kyungmin Park,
Bartlomiej Zolnierkiewicz
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>
[-- Attachment #1: Type: text/plain, Size: 406 bytes --]
Hi,
On Tue, Jan 20, 2015 at 11:00:51AM +0100, Krzysztof Kozlowski wrote:
> You already acked some earlier version of this patch. In that time
> there were more dependencies on MFD tree. Now the MFD patch (2/5)
> is acked by Lee Jones so could you pick up everything?
Thanks for the patchset, queued:
http://git.infradead.org/battery-2.6.git/commit/f8f847b51a0e1cb0c96e706d5da0ac507d96c605
-- Sebastian
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: David Herrmann @ 2015-01-20 17:00 UTC (permalink / raw)
To: Johannes Stezenbach
Cc: Djalal Harouni, Josh Boyer, Michael Kerrisk (man-pages),
Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack
In-Reply-To: <20150120160842.GA9474-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org>
Hi
On Tue, Jan 20, 2015 at 5:08 PM, Johannes Stezenbach <js-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org> wrote:
> However, let me repeat and rephrase my previous questions:
> Is there a noticable or measurable improvement from using kdbus?
> IOW, is the added complexity of kdbus worth the result?
>
> I have stated my believe that current usage of D-Bus is not
> performance sensitive and the number of messages exchanged
> is low. I would love it if you would prove me wrong.
> Or if you could show that any D-Bus related bug in Gnome3
> is fixed by kdbus.
DBus is not used for performance sensitive applications because DBus
is slow. We want to make it fast so we can finally use it for
low-latency, high-throughput applications. A simple DBus
method-call+reply takes 200us here, with kdbus it takes 8us (with UDS
about 2us). If I increase the packet size from 8k to 128k, kdbus even
tops UDS thanks to single-copy transfers.
The fact that there is no performance-critical application using DBus
is, imho, an argument *pro* kdbus. People haven't been capable of
making classic dbus1 fast enough for low-latency audio, thus, we
present kdbus.
Starting up 'gdm' sends ~5k dbus messages on my machine. It takes >1s
to transmit the messages alone. Each dbus1 message has to be copied 4
times for each direction. With kdbus, each message is copied only once
for each transmission (or not at all, if you use memfds, though that
doesn't mean it's necessarily faster). No intermediate context-switch
is needed. This makes kdbus capable to transmit low-latency audio data
*inline*.
DBus marshaling is the de-facto standard in all major(!) linux desktop
systems. It is well established and accepted by many DEs. It also
solves many other problems, including: policy,
authentication/authorization, well-known name registry, efficient
broadcasts/multicasts, peer discovery, bus discovery, metadata
transmission, and more.
It is a shame that we cannot use this well-established protocol for
low-latency applications. We, effectively, have to duplicate all this
code on custom UDS and other transports just because DBus is too slow.
kdbus tries to unify those efforts, so that we don't need multiple
policy implementations, name registries and peer discovery mechanisms.
Furthermore, kdbus implements comprehensive, yet optional, metadata
transmission that allows to identify and authenticate peers in a
race-free manner (which is *not* possible with UDS).
Also, kdbus provides a single transport bus with sequential message
numbering. If you use multiple channels, you cannot give any ordering
guarantees across peers (for instance, regarding parallel
name-registry changes).
Given these theoretical advantages, here're some examples:
*) The Tizen developers have been complaining about the high latency
of DBus for polkit'ish policy queries. That's why their authentication
framework uses custom UDS sockets (called 'Cynara'). If a
UI-interaction needs multiple authentication-queries, you don't want
it to take multiple milliseconds, given that you usually want to
render the result in the same frame.
*) PulseAudio doesn't use DBus for data transmission. They had to
implement their own marshaling code, transport layer and so on, just
because DBus1-latency is horrible. With kdbus, we can basically drop
this code-duplication and unify the IPC layer. Same is true for
Wayland, btw.
*) By moving broadcast-transmission into the kernel, we can use the
time-slices of the sender to perform heavy operations. This is also
true for policy decisions, etc. With a userspace daemon, we cannot
perform operations in a time-slice of the caller. This makes DoS
attacks much harder.
*) With priority-inheritance, we can do synchronous calls into trusted
peers and let them optionally use our time-slice to perform the
action. This allows syscall-like/binder-like method-calls into other
processes. Without priority-inheritance, this is not possible in a
secure manner (see 'priority-inheritance').
*) Logging-daemons often want to attach metadata to log-messages so
debugging/filtering gets easier. If short-lived programs send
log-messages, the destination peer might not be able to read such
metadata from /proc, as the process might no longer be available at
that time. Same is true for policy-decisions like polkit does. You
cannot send off method-calls and exit. You have to wait for a reply,
even though you might not even care for it. If you don't wait, the
other side might not be able to verify your identity and as such
reject the request.
*) Even though the dbus traffic on idle-systems might be low, this
doesn't mean it's not significant at boot-times or under high-load. If
you run a dbus-monitor of your choice, you will see there is an
significant number of messages exchanged during VT-switches, startup,
shutdown, suspend, wakeup, hotplugging and similar situations where
lots of control-messages are exchanged. We don't want to spend
hundreds of ms just to transmit those messages.
*) dbus-daemon is not available during early-boot or shutdown.
These are just examples off the top of my head, but I think they're
already pretty convincing.
David
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-20 17:50 UTC (permalink / raw)
To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd, ebiederm,
gnomes, teg, jkosina, luto, linux-api, linux-kernel
Cc: daniel, dh.herrmann, tixxdz
In-Reply-To: <54BE5F1F.8070703@gmail.com>
Hi Michael,
Thanks a lot for for intense review of the documentation. Much appreciated.
I've addressed all but the below issues, following your suggestions.
On 01/20/2015 02:58 PM, Michael Kerrisk (man-pages) wrote:
>> + and KDBUS_CMD_ENDPOINT_MAKE (see above).
>> +
>> + Following items are expected for KDBUS_CMD_BUS_MAKE:
>> + KDBUS_ITEM_MAKE_NAME
>> + Contains a string to identify the bus name.
>
> So, up to here, I've seen no definition of 'kdbus_item', which leaves me
> asking questions like: what subfield is KDBUS_ITEM_MAKE_NAME stored in?
> which subfield holds the pointer to the string?
>
> Somewhere earlier, kdbus_item needs to be exaplained in more detail,
> I think.
Hmm, you're quoting text from section 5, and section 4 actually
describes the concept of items quite well I believe?
>> + __s64 priority;
>> + With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
>> + the queue with at least the given priority. If no such message is waiting
>> + in the queue, -ENOMSG is returned.
>
> ###
> How do I simply select the highest priority message, without knowing what
> its priority is?
The wording is indeed unclear here. KDBUS_RECV_USE_PRIORITY causes the
messages to be dequeued by their priority. The 'priority' field is
simply a filter that request a minimum priority. By setting this field
to the highest possible value, you effectively bypass the filter. I've
added a better description now.
>> + -ENOMEM The kernel memory is exhausted
>> + -ENOTTY Illegal ioctl command issued for the file descriptor
>
> Why ENOTTY here, rather than EINVAL? The latter is, I beleive, the usual
> ioctl() error for invalid commands, I believe (If you keep ENOTTY, add an
> explanation in this document.)
Hmm, no. -ENOTTY is commonly used as return code when calling ioctls
that can't be handled by the FDs they're called on. 'man errno(3)' even
states: "ENOTTY Inappropriate I/O control operation (POSIX.1)".
>> + -EINVAL Unsupported item attached to command
>> +
>> +For all ioctls that carry a struct as payload:
>> +
>> + -EFAULT The supplied data pointer was not 64-bit aligned, or was
>> + inaccessible from the kernel side.
>> + -EINVAL The size inside the supplied struct was smaller than expected
>> + -EMSGSIZE The size inside the supplied struct was bigger than expected
>
> Why two different errors for smaller and larger than expected? (If you keep things this
> way, pelase explain the reason in this document.)
Providing a struct that is smaller than the minimum doesn't give the
ioctl a valid set of information to process the request. Hence, I think
-EINVAL is appropriate. However, -EMSGSIZE is something that users might
hit when they make message payloads too big, and I think it's good to
have a change to distinguish the two cases in error handling. I added
something in the document now.
Again, thanks a lot for reading the documentation so accurately!
Daniel
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-20 18:23 UTC (permalink / raw)
To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd-r2nGTMty4D4,
ebiederm-aS9lmoZGLiVWk0Htik3J/w,
gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
tixxdz-Umm1ozX2/EEdnm+yROfE0A, Johannes Stezenbach
In-Reply-To: <54BE5DC8.70706-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On 01/20/2015 02:53 PM, Michael Kerrisk (man-pages) wrote:
> This is an enormous and complex API. Why is the API ioctl() based,
> rather than system-call-based? Have we learned nothing from the hydra
> that the futex() multiplexing syscall became? (And kdbus is an order
> of magnitude more complex, by the look of things.) At the very least,
> a *good* justification of why the API is ioctl()-based should be part
> of this documentation file.
I think the simplest reason is because we want to be able to build kdbus
as a module. It's rather an optional driver than a core kernel feature.
IMO, kernel primitives should be syscalls, but kdbus is not a primitive
but an elaborate subsystem.
Also, the context the kdbus commands operate on originate from a
mountable special-purpose file system. Hence, we decided not to use a
global kernel interface but specific ioctls on the nodes exposed by kdbusfs.
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH v5 5/5] tty/serial: Add Spreadtrum sc9836-uart driver support
From: Rob Herring @ 2015-01-20 20:17 UTC (permalink / raw)
To: Lyra Zhang
Cc: Chunyan Zhang, Greg Kroah-Hartman, Mark Rutland, Arnd Bergmann,
One Thousand Gnomes, Mark Brown, Rob Herring, Pawel Moll,
Ian Campbell, Kumar Gala, Will Deacon, Catalin Marinas,
Jiri Slaby, Jason Cooper, Heiko Stübner, Florian Vaussard,
Andrew Lunn, Robert Richter, Hayato Suzuki, Grant Likely,
Antony Pavlov, Joel Schopp
In-Reply-To: <CAAfSe-tAwURc_P+-0+m22ao9r+Fud6Ae89JF8FGsWgg_49Mdhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Jan 20, 2015 at 1:37 AM, Lyra Zhang <zhang.lyra-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi, Rob
>
> I still have a question to be conform, specific describes below:
>
> On Mon, Jan 19, 2015 at 10:11 PM, Rob Herring <robherring2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, Jan 19, 2015 at 3:55 AM, Lyra Zhang <zhang.lyra-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> On Sat, Jan 17, 2015 at 12:41 AM, Rob Herring <robherring2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Fri, Jan 16, 2015 at 4:00 AM, Chunyan Zhang
>>>> <chunyan.zhang-lxIno14LUO0EEoCn2XhGlw@public.gmane.org> wrote:
>>>>> Add a full sc9836-uart driver for SC9836 SoC which is based on the
>>>>> spreadtrum sharkl64 platform.
>>>>> This driver also support earlycon.
>>>>> This patch also replaced the spaces between the macros and their
>>>>> values with the tabs in serial_core.h
>>
>> [...]
>>
>>>>> +static int __init sprd_serial_init(void)
>>>>> +{
>>>>> + int ret = 0;
>>>>> +
>>>>> + ret = uart_register_driver(&sprd_uart_driver);
>>>>
>>>> This can be done in probe now. Then you can use module_platform_driver().
>>>>
>>>
>>> Question:
>>> 1. there are 4 uart ports configured in dt for sprd_serial, so probe
>>> will be called 4 times, but uart_register_driver only needs to be
>>> called one time, so can we use uart_driver.state to check if
>>> uart_register_driver has already been called ?
>>
>> Yes.
>>
>>> 2. if I use module_platform_driver() instead of
>>> module_init(sprd_serial_init) and module_exit(sprd_serial_exit) , I
>>> must move uart_unregister_driver() which is now processed in
>>> sprd_serial_exit() to sprd_remove(), there is a similar problem with
>>> probe(), sprd_remove() will also be called 4 times, and actually it
>>> should be called only one time. How can we deal with this case?
>>
>> Look at pl01x or Samsung UART drivers which have done this conversion.
>
> Samsung UART does use module_platform_driver, but pl010/pl011 doesn't.
That is because pl011 is an amba_device rather than a platform_device
and there is not an equivalent macro for this boilerplate (although I
think one has been posted recently).
> In the Samsung UART driver, uart_unregister_driver is processed in
> remove(), like below:
>
> static int s3c24xx_serial_remove(struct platform_device *dev)
> {
> struct uart_port *port = s3c24xx_dev_to_port(&dev->dev);
>
> if (port) {
> s3c24xx_serial_cpufreq_deregister(to_ourport(port));
> uart_remove_one_port(&s3c24xx_uart_drv, port);
> }
>
> uart_unregister_driver(&s3c24xx_uart_drv);
> }
>
> if this serial has more than one ports, uart_unregister_driver() must
> be called multiple times when the device need to be removed.
> I think there may be a problem because that uart_unregister_driver()
> will do kfree(drv->state) every time when it's called.
Yes. Looks like a bug still in the Samsung driver. So follow what the
pl01x driver does. It had the same problem, but there is a follow-up
commit to fix it.
Rob
^ permalink raw reply
* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
From: Alexei Starovoitov @ 2015-01-20 20:33 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
Network Development, LKML, yrl.pp-manager.tt@hitachi.com,
Jovi Zhangwei
On Tue, Jan 20, 2015 at 3:57 AM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
>
> Ok, BTW, would you think is it possible to use a reusable small scratchpad
> memory for passing arguments? (just a thought)
sure. doable, but what's the use case?
>> It's not usable for high frequency events which
>> need this in-kernel aggregation.
>> If events are rare, then just dumping everything
>> into trace buffer is just fine. No in-kernel program is needed.
>
> Hmm, let me ensure your point, the performance number is the reason why
> we need to do it in the kernel, right? Not mainly for the flexibility but speed.
if user space can do X at the same speed as kernel,
then user space is a better choice and more flexible.
In case of bpf programs two things user space cannot do:
- fast aggregation without adding penalty to things being traced
- access to in-kernel data structures
And often both used together.
Say, we want to monitor amount of network traffic per user.
So we'd use trace_net_dev_xmit() tracepoint and do
map[current_uid()] += skb_len
as part of the program.
Overhead will be tiny and users won't notice any slowdown.
Trying to do the same in user space by enabling
this tracepoint has two problems:
high overhead and events are hard to aggregate
per user, since trace has 'pid', but short lived
processes will have dead pids in trace output.
> - perf probe and kprobe-event gives us a complete understandable
> interface for what will be recorded at where.
> (we can see the event definitions via kprobe_events interface,
> without any tools)
> - kprobe-event gives a completely same interface as other tracepoint
> events.
> - it also doesn't require any build-binary parts :) nor special tools.
> We can play with ftrace on just a small busybox.
yeah, when debugging in busybox is the goal
and 'cat' and 'echo' are your only tools, then
debugfs interface is the only choice :)
> However, this does NOT interfere your patch upstreaming. I just said current
> ftrace method is also meaningful for some reasons :)
of course :)
To emphasize the point I was trying to make with tracex1:
The program is a filter/aggregator. The bpf maps
are not suitable for streaming the events. That's the job
of ring buffer/trace_pipe. The program may choose
to aggregate some events and discard them (by
returning 0 from the program), and the rest of
the events will be streamed to user space via
ring buffer in the format statically defined by tracepoint
or by kprobe arguments.
The tracex1 example loads the program and then
reads /sys/kernel/debugfs/tracing/trace_pipe...
That part I was trying to improve with bpf_trace_printk:
to give ability to programs to stream data in a format
different from the one statically defined by tracepoints.
But trace_printk has its disadvantages, so probably
something cleaner is needed.
Like in my earlier example of trace_net_dev_xmit,
if the program could add printing of uid to arguments
already printed, it would have helped user space.
> By the way, I concern about that bpf compiler can become another systemtap,
> especially if you build it on llvm.
> Would you plan to develop it on kernel
> tree? or apart from the kernel-side development?
I'm not sure I completely understand the concern.
perf is using a bunch of out-of-tree libraries.
mcjit of llvm or libgccjit are another libraries.
Or may be eventually eBPF can be generated
by something like libpcap.
Ideally I would like to see 'perf run script.txt'
where script.txt is a program in a language suited
for tracing. The tracing language not necessary
will fit networking use cases. Currently I'm
using C for both and it's the most convenient,
but some folks complained that 'restricted'
nature of this C is hard to grasp, so I can only
encourage Jovi to do ktap language to bpf
translator. If it generates bpf directly that's great,
if it uses gcc or llvm backend that's fine too.
> I think it is hard to sync the development if you do it out-of-tree.
I think some pieces would have to be out of tree.
I've kept standalone llvm backend across 3.2, 3.3 and 3.4
but it gets polluted with ifdefs and not really a long term
solution, so now I'm working on upstreaming it
and feedback/codereviews I got, definitely improved
the quality of the bpf backend.
In case of backends the only bit to sync is instruction
set itself, which is stable. New instructions may be
added, but that's not a concern.
llvm backend doesn't care what language is
used in front-end or how programs are attached
to tracepoints or what set of bpf helper
functions is available.
All such bits and the main interface for
dynamic tracer, imo, should be in perf binary.
What it does underneath and how
many times it calls into llvm/gcc lib, won't be visible.
In case of systemtap compile time, for whatever
reason, is slow to the point of being annoying,
but here it should be instant.
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Johannes Stezenbach @ 2015-01-20 22:00 UTC (permalink / raw)
To: David Herrmann
Cc: Djalal Harouni, Josh Boyer, Michael Kerrisk (man-pages),
Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack
In-Reply-To: <CANq1E4Q9m7JRXNYFY4okh3brM-kHVoVGOdF_SWjNoNURX930HQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Hi David,
On Tue, Jan 20, 2015 at 06:00:28PM +0100, David Herrmann wrote:
[big snip]
> These are just examples off the top of my head, but I think they're
> already pretty convincing.
Thank you for writing this up. This is the information I was
looking for which puts kdbus into context and explains
the motivation for its development. Naturally I don't agree
with all of it, but I'm content with what I learned so far.
Daniel informed me off-list that he (and probably others) does
not understand what my questions were aiming at. I'm sorry
about that, I thought it was clear I was just lacking
the background information to understand what kdbus is and
what it is not, and why it exists -- information I couldn't find
in some hours of googling.
Thanks,
Johannes
^ permalink raw reply
* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Andy Lutomirski @ 2015-01-20 22:40 UTC (permalink / raw)
To: Fam Zheng
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra <pe>
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
>
> NAME
> epoll_mod_wait - modify and wait for I/O events on an epoll file
> descriptor
>
> SYNOPSIS
>
> int epoll_mod_wait(int epfd, int flags,
> int ncmds, struct epoll_mod_cmd *cmds,
> struct epoll_wait_spec *spec);
>
> DESCRIPTION
>
> The epoll_mod_wait() system call can be seen as an enhanced combination
> of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> call. It is superior in two cases:
>
> 1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> will save context switches between user mode and kernel mode;
>
> 2) When you need higher precision than microsecond for wait timeout.
>
> The epoll_ctl(2) operations are embedded into this call by with ncmds
> and cmds. The latter is an array of command structs:
>
> struct epoll_mod_cmd {
>
> /* Reserved flags for future extension, must be 0 for now. */
> int flags;
>
> /* The same as epoll_ctl() op parameter. */
> int op;
>
> /* The same as epoll_ctl() fd parameter. */
> int fd;
>
> /* The same as the "events" field in struct epoll_event. */
> uint32_t events;
>
> /* The same as the "data" field in struct epoll_event. */
> uint64_t data;
>
> /* Output field, will be set to the return code once this
> * command is executed by kernel */
> int error;
> };
I would add an extra u32 at the end so that the structure size will be
a multiple of 8 bytes on all platforms.
>
> There is no guartantee that all the commands are executed in order. Only
> if all the commands are successfully executed (all the error fields are
> set to 0), events are polled.
If this doesn't happen, what error is returned?
> struct epoll_wait_spec {
>
> /* The same as "maxevents" in epoll_pwait() */
> int maxevents;
>
> /* The same as "events" in epoll_pwait() */
> struct epoll_event *events;
>
> /* Which clock to use for timeout */
> int clockid;
>
> /* Maximum time to wait if there is no event */
> struct timespec timeout;
>
> /* The same as "sigmask" in epoll_pwait() */
> sigset_t *sigmask;
>
> /* The same as "sigsetsize" in epoll_pwait() */
> size_t sigsetsize;
> } EPOLL_PACKED;
I think the convention is to align the structure's fields manually
rather than declaring it to be packed.
>
> RETURN VALUE
>
> When any error occurs, epoll_mod_wait() returns -1 and errno is set
> appropriately. All the "error" fields in cmds are unchanged before they
> are executed, and if any cmds are executed, the "error" fields are set
> to a return code accordingly. See also epoll_ctl for more details of the
> return code.
Does this mean that callers should initialize the error fields to an
impossible value first so they can tell which commands were executed?
>
> When successful, epoll_mod_wait() returns the number of file
> descriptors ready for the requested I/O, or zero if no file descriptor
> became ready during the requested timeout milliseconds.
>
> If spec is NULL, it returns 0 if all the commands are successful, and -1
> if an error occured.
>
> ERRORS
>
> These errors apply on either the return value of epoll_mod_wait or error
> status for each command, respectively.
Please clarify which errors are returned overall and which are per-command.
Thanks,
Andy
^ permalink raw reply
* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-20 23:03 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers
In-Reply-To: <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Jan 20, 2015 at 02:40:32PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> > epoll_mod_wait - modify and wait for I/O events on an epoll file
> > descriptor
> >
> > SYNOPSIS
> >
> > int epoll_mod_wait(int epfd, int flags,
> > int ncmds, struct epoll_mod_cmd *cmds,
> > struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> > The epoll_mod_wait() system call can be seen as an enhanced combination
> > of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> > call. It is superior in two cases:
> >
> > 1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> > will save context switches between user mode and kernel mode;
> >
> > 2) When you need higher precision than microsecond for wait timeout.
> >
> > The epoll_ctl(2) operations are embedded into this call by with ncmds
> > and cmds. The latter is an array of command structs:
> >
> > struct epoll_mod_cmd {
> >
> > /* Reserved flags for future extension, must be 0 for now. */
> > int flags;
> >
> > /* The same as epoll_ctl() op parameter. */
> > int op;
> >
> > /* The same as epoll_ctl() fd parameter. */
> > int fd;
> >
> > /* The same as the "events" field in struct epoll_event. */
> > uint32_t events;
> >
> > /* The same as the "data" field in struct epoll_event. */
> > uint64_t data;
> >
> > /* Output field, will be set to the return code once this
> > * command is executed by kernel */
> > int error;
> > };
>
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.
*shrug*, but if you do so, enforce that it has a value of 0 or return
-EINVAL, just like a flags field. Alternatively, move the last field
earlier and make flags a uint64_t.
- Josh Triplett
^ permalink raw reply
* [PATCH v11 0/2] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2015-01-21 1:18 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
linux-crypto, linux-api, Neil Horman
Hi,
This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.
Both, AEAD and RNG implementations are stand-alone and do not depend
other AF_ALG interfaces (like hash or skcipher).
The AEAD implementation uses the same approach as provided with
skcipher by offering the following interfaces:
* sendmsg and recvmsg interfaces allowing multiple
invocations supporting a threaded user space. To support
multi-threaded user space, kernel-side buffering
is implemented similarly to skcipher.
* splice / vmsplice interfaces allowing a zero-copy
invocation
The RNG interface only implements the recvmsg interface as
zero-copy is not applicable.
The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces. The testing covers:
* use of the sendmsg/recvmsg interface
* use of the splice / vmsplice interface
* invocation of all AF_ALG types (aead, rng, skcipher, hash)
* using all types of operation (encryption, decryption, keyed MD,
MD, random numbers, AEAD decryption with positive and negative
authentication verification)
* stress testing by running all tests for 30 minutes in an
endless loop
* test execution on 64 bit and 32 bit
[1] http://www.chronox.de/libkcapi.html
Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
<dborkman@redhat.com>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
definitions
Changes v3:
* remove getsockopt interface
* AEAD: associated data is set prepended to the plain/ciphertext
* AEAD: allowing arbitrary associated data lengths
* remove setkey patch as protection was already in the existing code
Changes v4:
* stand-alone implementation of AEAD
* testing of all interfaces offered by AEAD
* stress testing of AEAD and RNG
Changes v5:
* AEAD: add outer while(size) loop in aead_sendmsg to ensure all data is
copied into the kernel (reporter Herbert Xu)
* AEAD: aead_sendmsg bug fix: change size -= len; to size -= plen;
* AF_ALG / AEAD: add aead_setauthsize and associated extension to
struct af_alg_type as well as alg_setsockopt (reporter Herbert Xu)
* RNG: rng_recvmsg: use 128 byte stack variable for output of RNG instead
of ctx->result (reporter Herbert Xu)
* RNG / AF_ALG: allow user space to seed RNG via setsockopt
* RNG: rng_recvmsg bug fix: use genlen as result variable for
crypto_rng_get_bytes as previously no negative errors were obtained
* AF_ALG: alg_setop: zeroize buffer before free
Changes v6:
* AEAD/RNG: port to 3.19-rc1 with the iov_iter handling
* RNG: use the setkey interface to obtain the seed and drop the patch adding
a separate reseeding interface
* extract the zeroization patch for alg_setkey into a stand-alone patch
submission
* fix bug in aead_sufficient_data (reporter Herbert Xu)
* testing of all interfaces with test application provided with libkcapi version
0.6.2
Changes v7:
* AEAD: aead_recvmsg: change error code from ENOMEM to EINVAL
* AEAD: drop aead_readable/aead_sufficient_data and only use ctx->more to decide
whether the read side shall become active. This change requires that the
patch for crypto_aead_decrypt ensuring that the ciphertext contains the
authentication tag was added -- see https://lkml.org/lkml/2014/12/30/200.
Otherwise, user space can trigger a kernel crash.
* RNG: patch dropped as it was applied
* AEAD: port Kconfig/Makefile patch forward to current code base
Changes v8:
* Removed check for aead_assoclen in aead_sendmsg
* Fix endless loop bug in aead_sendmsg (check for sgl->cur > ALG_MAX_PAGES in
while condition removed -- this condition is checked within the loop already)
* Resurrect aead_sufficient_data and call it in aead_sendmsg, aead_sendpage to
notify caller about wrong invocation
* Re-add aead_sufficient_data to aead_recvmsg to verify user input data before
using them to ensure the kernel protects against malicious parameters
* Allow arbitrary size of AD (i.e. up to the maximum buffer size of
ALG_MAX_PAGES)
* When aead_recvmsg receives an error from decryption, release all pages if the
error is EBADMSG -- this error implies that a proper decryption was performed
but the integrity of the message is lost. This error is considered to be a
valid decryption result.
* Add test cases for sendmsg and splice interface to test large AD sizes (in
case of sendmsg, use 65504 bytes AD and 32 bytes plaintext; in case of splice
use 15 pages AD and 32 bytes in the 16th page for plaintext). See [1] for
updated test case.
Changes v9:
* if socket is not writable during sendmsg/sendpage due to insufficient memory
and a recvmsg operation is forced, inform userspace about truncated operation
via MSG_TRUNC
* use -EMSGSIZE in case insufficient data was provided in sendmsg/sendpage
* release all buffers in case insufficient data was provided in sendmsg/sendpage
* bug fix in sendmsg: when a new page is allocated, reset sg->offset to 0 --
the error is visible with the new tests in [1] when using the -d flag
with the test application
Changes v10:
* initialize ctx->trunc in aead_accept_parent to zero
* fix one line with code formatting problems
Changes v11:
* return an error if user space sends too much data instead of waiting until
reader side catches up with operation (suggested by Herbert Xu)
* remove now unneeded aead_wait_for_wmem service function
* remove now unneeded ctx->trunc and MSG_TRUNC error return
Stephan Mueller (2):
crypto: AF_ALG: add AEAD support
crypto: AF_ALG: enable AEAD interface compilation
crypto/Kconfig | 9 +
crypto/Makefile | 1 +
crypto/algif_aead.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 648 insertions(+)
create mode 100644 crypto/algif_aead.c
--
2.1.0
^ permalink raw reply
* [PATCH v11 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-01-21 1:19 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
linux-crypto-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, Neil Horman
In-Reply-To: <1923793.K38mGRD6eo-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>
This patch adds the AEAD support for AF_ALG.
The implementation is based on algif_skcipher, but contains heavy
modifications to streamline the interface for AEAD uses.
To use AEAD, the user space consumer has to use the salg_type named
"aead".
The AEAD implementation includes some overhead to calculate the size of
the ciphertext, because the AEAD implementation of the kernel crypto API
makes implied assumption on the location of the authentication tag. When
performing an encryption, the tag will be added to the created
ciphertext (note, the tag is placed adjacent to the ciphertext). For
decryption, the caller must hand in the ciphertext with the tag appended
to the ciphertext. Therefore, the selection of the used memory
needs to add/subtract the tag size from the source/destination buffers
depending on the encryption type. The code is provided with comments
explaining when and how that operation is performed.
A fully working example using all aspects of AEAD is provided at
http://www.chronox.de/libkcapi.html
Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
---
crypto/algif_aead.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 638 insertions(+)
create mode 100644 crypto/algif_aead.c
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
new file mode 100644
index 0000000..5a667fb
--- /dev/null
+++ b/crypto/algif_aead.c
@@ -0,0 +1,638 @@
+/*
+ * algif_aead: User-space interface for AEAD algorithms
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This file is derived from algif_skcipher.c.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <crypto/scatterwalk.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct aead_sg_list {
+ unsigned int cur;
+ struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+struct aead_ctx {
+ struct aead_sg_list tsgl;
+ struct af_alg_sgl rsgl;
+
+ void *iv;
+
+ struct af_alg_completion completion;
+
+ unsigned long used;
+
+ unsigned int len;
+ bool more;
+ bool merge;
+ bool enc;
+
+ size_t aead_assoclen;
+ struct aead_request aead_req;
+};
+
+static inline int aead_sndbuf(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+
+ return max_t(int, max_t(int, sk->sk_sndbuf & PAGE_MASK, PAGE_SIZE) -
+ ctx->used, 0);
+}
+
+static inline bool aead_writable(struct sock *sk)
+{
+ return PAGE_SIZE <= aead_sndbuf(sk);
+}
+
+static inline bool aead_sufficient_data(struct aead_ctx *ctx)
+{
+ unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+
+ return (ctx->used >= (ctx->aead_assoclen + (ctx->enc ? 0 : as)));
+}
+
+static void aead_put_sgl(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct scatterlist *sg = sgl->sg;
+ unsigned int i;
+
+ for (i = 0; i < sgl->cur; i++) {
+ if (!sg_page(sg + i))
+ continue;
+
+ put_page(sg_page(sg + i));
+ sg_assign_page(sg + i, NULL);
+ }
+ sgl->cur = 0;
+ ctx->used = 0;
+ ctx->more = 0;
+ ctx->merge = 0;
+}
+
+static void aead_wmem_wakeup(struct sock *sk)
+{
+ struct socket_wq *wq;
+
+ if (!aead_writable(sk))
+ return;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+ POLLRDNORM |
+ POLLRDBAND);
+ sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+ rcu_read_unlock();
+}
+
+static int aead_wait_for_data(struct sock *sk, unsigned flags)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ long timeout;
+ DEFINE_WAIT(wait);
+ int err = -ERESTARTSYS;
+
+ if (flags & MSG_DONTWAIT) {
+ return -EAGAIN;
+ }
+
+ set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+ for (;;) {
+ if (signal_pending(current))
+ break;
+ prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ timeout = MAX_SCHEDULE_TIMEOUT;
+ if (sk_wait_event(sk, &timeout, !ctx->more)) {
+ err = 0;
+ break;
+ }
+ }
+ finish_wait(sk_sleep(sk), &wait);
+
+ clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+ return err;
+}
+
+static void aead_data_wakeup(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct socket_wq *wq;
+
+ if (ctx->more)
+ return;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+ POLLRDNORM |
+ POLLRDBAND);
+ sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+ rcu_read_unlock();
+}
+
+static int aead_sendmsg(struct kiocb *unused, struct socket *sock,
+ struct msghdr *msg, size_t size)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned ivsize =
+ crypto_aead_ivsize(crypto_aead_reqtfm(&ctx->aead_req));
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct af_alg_control con = {};
+ long copied = 0;
+ bool enc = 0;
+ bool init = 0;
+ int err = -EINVAL;
+
+ if (msg->msg_controllen) {
+ err = af_alg_cmsg_send(msg, &con);
+ if (err)
+ return err;
+
+ init = 1;
+ switch (con.op) {
+ case ALG_OP_ENCRYPT:
+ enc = 1;
+ break;
+ case ALG_OP_DECRYPT:
+ enc = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (con.iv && con.iv->ivlen != ivsize)
+ return -EINVAL;
+ }
+
+ lock_sock(sk);
+ if (!ctx->more && ctx->used)
+ goto unlock;
+
+ if (init) {
+ ctx->enc = enc;
+ if (con.iv)
+ memcpy(ctx->iv, con.iv->iv, ivsize);
+
+ ctx->aead_assoclen = con.aead_assoclen;
+ }
+
+ while (size) {
+ unsigned long len = size;
+ struct scatterlist *sg = NULL;
+
+ /* use the existing memory in an allocated page */
+ if (ctx->merge) {
+ sg = sgl->sg + sgl->cur - 1;
+ len = min_t(unsigned long, len,
+ PAGE_SIZE - sg->offset - sg->length);
+ err = memcpy_from_msg(page_address(sg_page(sg)) +
+ sg->offset + sg->length,
+ msg, len);
+ if (err)
+ goto unlock;
+
+ sg->length += len;
+ ctx->merge = (sg->offset + sg->length) &
+ (PAGE_SIZE - 1);
+
+ ctx->used += len;
+ copied += len;
+ size -= len;
+ }
+
+ if (!aead_writable(sk)) {
+ /* user space sent too much data */
+ aead_put_sgl(sk);
+ err = -EMSGSIZE;
+ goto unlock;
+ }
+
+ /* allocate a new page */
+ len = min_t(unsigned long, size, aead_sndbuf(sk));
+ while (len) {
+ int plen = 0;
+
+ if (sgl->cur >= ALG_MAX_PAGES) {
+ err = -E2BIG;
+ goto unlock;
+ }
+
+ sg = sgl->sg + sgl->cur;
+ plen = min_t(int, len, PAGE_SIZE);
+
+ sg_assign_page(sg, alloc_page(GFP_KERNEL));
+ err = -ENOMEM;
+ if (!sg_page(sg))
+ goto unlock;
+
+ err = memcpy_from_msg(page_address(sg_page(sg)),
+ msg, plen);
+ if (err) {
+ __free_page(sg_page(sg));
+ sg_assign_page(sg, NULL);
+ goto unlock;
+ }
+
+ sg->offset = 0;
+ sg->length = plen;
+ len -= plen;
+ ctx->used += plen;
+ copied += plen;
+ sgl->cur++;
+ size -= plen;
+ ctx->merge = plen & (PAGE_SIZE - 1);
+ }
+ }
+
+ err = 0;
+
+ ctx->more = msg->msg_flags & MSG_MORE;
+ if (!ctx->more && !aead_sufficient_data(ctx)) {
+ aead_put_sgl(sk);
+ err = -EMSGSIZE;
+ }
+
+unlock:
+ aead_data_wakeup(sk);
+ release_sock(sk);
+
+ return err ?: copied;
+}
+
+static ssize_t aead_sendpage(struct socket *sock, struct page *page,
+ int offset, size_t size, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ int err = -EINVAL;
+
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ flags |= MSG_MORE;
+
+ if (sgl->cur >= ALG_MAX_PAGES)
+ return -E2BIG;
+
+ lock_sock(sk);
+ if (!ctx->more && ctx->used)
+ goto unlock;
+
+ if (!size)
+ goto done;
+
+ if (!aead_writable(sk)) {
+ /* user space sent too much data */
+ aead_put_sgl(sk);
+ err = -EMSGSIZE;
+ goto unlock;
+ }
+
+ ctx->merge = 0;
+
+ get_page(page);
+ sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+ sgl->cur++;
+ ctx->used += size;
+
+ err = 0;
+
+done:
+ ctx->more = flags & MSG_MORE;
+ if (!ctx->more && !aead_sufficient_data(ctx)) {
+ aead_put_sgl(sk);
+ err = -EMSGSIZE;
+ }
+
+unlock:
+ aead_data_wakeup(sk);
+ release_sock(sk);
+
+ return err ?: size;
+}
+
+static int aead_recvmsg(struct kiocb *unused, struct socket *sock,
+ struct msghdr *msg, size_t ignored, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned bs = crypto_aead_blocksize(crypto_aead_reqtfm(&ctx->aead_req));
+ unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct scatterlist *sg = sgl->sg;
+ struct scatterlist assoc[ALG_MAX_PAGES];
+ size_t assoclen = 0;
+ unsigned int i = 0;
+ int err = -EINVAL;
+ unsigned long used = 0;
+ unsigned long outlen = 0;
+
+ /*
+ * Require exactly one IOV block as the AEAD operation is a one shot
+ * due to the authentication tag.
+ */
+ if (msg->msg_iter.nr_segs != 1)
+ return -ENOMSG;
+
+ lock_sock(sk);
+ /*
+ * AEAD memory structure: For encryption, the tag is appended to the
+ * ciphertext which implies that the memory allocated for the ciphertext
+ * must be increased by the tag length. For decryption, the tag
+ * is expected to be concatenated to the ciphertext. The plaintext
+ * therefore has a memory size of the ciphertext minus the tag length.
+ *
+ * The memory structure for cipher operation has the following
+ * structure:
+ * AEAD encryption input: assoc data || plaintext
+ * AEAD encryption output: cipherntext || auth tag
+ * AEAD decryption input: assoc data || ciphertext || auth tag
+ * AEAD decryption output: plaintext
+ */
+
+ if (ctx->more) {
+ err = aead_wait_for_data(sk, flags);
+ if (err)
+ goto unlock;
+ }
+
+ used = ctx->used;
+
+ /*
+ * Make sure sufficient data is present -- note, the same check is
+ * is also present in sendmsg/sendpage. The checks in sendpage/sendmsg
+ * shall provide an information to the data sender that something is
+ * wrong, but they are irrelevant to maintain the kernel integrity.
+ * We need this check here too in case user space decides to not honor
+ * the error message in sendmsg/sendpage and still call recvmsg. This
+ * check here protects the kernel integrity.
+ */
+ if (!aead_sufficient_data(ctx))
+ goto unlock;
+
+ /*
+ * The cipher operation input data is reduced by the associated data
+ * length as this data is processed separately later on.
+ */
+ used -= ctx->aead_assoclen;
+
+ if (ctx->enc) {
+ /* round up output buffer to multiple of block size */
+ outlen = ((used + bs - 1) / bs * bs);
+ /* add the size needed for the auth tag to be created */
+ outlen += as;
+ } else {
+ /* output data size is input without the authentication tag */
+ outlen = used - as;
+ /* round up output buffer to multiple of block size */
+ outlen = ((outlen + bs - 1) / bs * bs);
+ }
+
+ /* ensure output buffer is sufficiently large */
+ if (msg->msg_iter.iov->iov_len < outlen)
+ goto unlock;
+
+ outlen = af_alg_make_sg(&ctx->rsgl, msg->msg_iter.iov->iov_base,
+ outlen, 1);
+ err = outlen;
+ if (err < 0)
+ goto unlock;
+
+ err = -EINVAL;
+ sg_init_table(assoc, ALG_MAX_PAGES);
+ assoclen = ctx->aead_assoclen;
+ /*
+ * Split scatterlist into two: first part becomes AD, second part
+ * is plaintext / ciphertext. The first part is assigned to assoc
+ * scatterlist. When this loop finishes, sg points to the start of the
+ * plaintext / ciphertext.
+ */
+ for (i = 0; i < ctx->tsgl.cur; i++) {
+ sg = sgl->sg + i;
+ if (sg->length <= assoclen) {
+ /* AD is larger than one page */
+ sg_set_page(assoc + i, sg_page(sg),
+ sg->length, sg->offset);
+ assoclen -= sg->length;
+ if (i >= ctx->tsgl.cur)
+ goto unlock;
+ } else if (!assoclen) {
+ /* current page is to start of plaintext / ciphertext */
+ if (i)
+ /* AD terminates at page boundary */
+ sg_mark_end(assoc + i - 1);
+ else
+ /* AD size is zero */
+ sg_mark_end(assoc);
+ break;
+ } else {
+ /* AD does not terminate at page boundary */
+ sg_set_page(assoc + i, sg_page(sg),
+ assoclen, sg->offset);
+ sg_mark_end(assoc + i);
+ /* plaintext / ciphertext starts after AD */
+ sg->length -= assoclen;
+ sg->offset += assoclen;
+ break;
+ }
+ }
+
+ aead_request_set_assoc(&ctx->aead_req, assoc, ctx->aead_assoclen);
+ aead_request_set_crypt(&ctx->aead_req, sg, ctx->rsgl.sg, used, ctx->iv);
+
+ err = af_alg_wait_for_completion(ctx->enc ?
+ crypto_aead_encrypt(&ctx->aead_req) :
+ crypto_aead_decrypt(&ctx->aead_req),
+ &ctx->completion);
+
+ af_alg_free_sg(&ctx->rsgl);
+
+ if (err) {
+ /* EBADMSG implies a valid cipher operation took place */
+ if (err == -EBADMSG)
+ aead_put_sgl(sk);
+ goto unlock;
+ }
+
+ aead_put_sgl(sk);
+
+ err = 0;
+
+unlock:
+ aead_wmem_wakeup(sk);
+ release_sock(sk);
+
+ return err ? err : outlen;
+}
+
+static unsigned int aead_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned int mask;
+
+ sock_poll_wait(file, sk_sleep(sk), wait);
+ mask = 0;
+
+ if (!ctx->more)
+ mask |= POLLIN | POLLRDNORM;
+
+ if (aead_writable(sk))
+ mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+ return mask;
+}
+
+static struct proto_ops algif_aead_ops = {
+ .family = PF_ALG,
+
+ .connect = sock_no_connect,
+ .socketpair = sock_no_socketpair,
+ .getname = sock_no_getname,
+ .ioctl = sock_no_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = sock_no_shutdown,
+ .getsockopt = sock_no_getsockopt,
+ .mmap = sock_no_mmap,
+ .bind = sock_no_bind,
+ .accept = sock_no_accept,
+ .setsockopt = sock_no_setsockopt,
+
+ .release = af_alg_release,
+ .sendmsg = aead_sendmsg,
+ .sendpage = aead_sendpage,
+ .recvmsg = aead_recvmsg,
+ .poll = aead_poll,
+};
+
+static void *aead_bind(const char *name, u32 type, u32 mask)
+{
+ return crypto_alloc_aead(name, type, mask);
+}
+
+static void aead_release(void *private)
+{
+ crypto_free_aead(private);
+}
+
+static int aead_setauthsize(void *private, unsigned int authsize)
+{
+ return crypto_aead_setauthsize(private, authsize);
+}
+
+static int aead_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+ return crypto_aead_setkey(private, key, keylen);
+}
+
+static void aead_sock_destruct(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned int ivlen = crypto_aead_ivsize(
+ crypto_aead_reqtfm(&ctx->aead_req));
+
+ aead_put_sgl(sk);
+ sock_kzfree_s(sk, ctx->iv, ivlen);
+ sock_kfree_s(sk, ctx, ctx->len);
+ af_alg_release_parent(sk);
+}
+
+static int aead_accept_parent(void *private, struct sock *sk)
+{
+ struct aead_ctx *ctx;
+ struct alg_sock *ask = alg_sk(sk);
+ unsigned int len = sizeof(*ctx) + crypto_aead_reqsize(private);
+ unsigned int ivlen = crypto_aead_ivsize(private);
+
+ ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ memset(ctx, 0, len);
+
+ ctx->iv = sock_kmalloc(sk, ivlen, GFP_KERNEL);
+ if (!ctx->iv) {
+ sock_kfree_s(sk, ctx, len);
+ return -ENOMEM;
+ }
+ memset(ctx->iv, 0, ivlen);
+
+ ctx->len = len;
+ ctx->used = 0;
+ ctx->more = 0;
+ ctx->merge = 0;
+ ctx->enc = 0;
+ ctx->tsgl.cur = 0;
+ ctx->aead_assoclen = 0;
+ af_alg_init_completion(&ctx->completion);
+ sg_init_table(ctx->tsgl.sg, ALG_MAX_PAGES);
+
+ ask->private = ctx;
+
+ aead_request_set_tfm(&ctx->aead_req, private);
+ aead_request_set_callback(&ctx->aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+ af_alg_complete, &ctx->completion);
+
+ sk->sk_destruct = aead_sock_destruct;
+
+ return 0;
+}
+
+static const struct af_alg_type algif_type_aead = {
+ .bind = aead_bind,
+ .release = aead_release,
+ .setkey = aead_setkey,
+ .setauthsize = aead_setauthsize,
+ .accept = aead_accept_parent,
+ .ops = &algif_aead_ops,
+ .name = "aead",
+ .owner = THIS_MODULE
+};
+
+static int __init algif_aead_init(void)
+{
+ return af_alg_register_type(&algif_type_aead);
+}
+
+static void __exit algif_aead_exit(void)
+{
+ int err = af_alg_unregister_type(&algif_type_aead);
+ BUG_ON(err);
+}
+
+module_init(algif_aead_init);
+module_exit(algif_aead_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>");
+MODULE_DESCRIPTION("AEAD kernel crypto API user space interface");
--
2.1.0
^ permalink raw reply related
* [PATCH v11 2/2] crypto: AF_ALG: enable AEAD interface compilation
From: Stephan Mueller @ 2015-01-21 1:19 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
linux-crypto, linux-api, Neil Horman
In-Reply-To: <1923793.K38mGRD6eo@tachyon.chronox.de>
Enable compilation of the AEAD AF_ALG support and provide a Kconfig
option to compile the AEAD AF_ALG support.
Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
crypto/Kconfig | 9 +++++++++
crypto/Makefile | 1 +
2 files changed, 10 insertions(+)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 50f4da4..41a3fc5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1523,6 +1523,15 @@ config CRYPTO_USER_API_RNG
This option enables the user-spaces interface for random
number generator algorithms.
+config CRYPTO_USER_API_AEAD
+ tristate "User-space interface for AEAD cipher algorithms"
+ depends on NET
+ select CRYPTO_AEAD
+ select CRYPTO_USER_API
+ help
+ This option enables the user-spaces interface for AEAD
+ cipher algorithms.
+
config CRYPTO_HASH_INFO
bool
diff --git a/crypto/Makefile b/crypto/Makefile
index ba19465..97b7d3a 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
+obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
#
# generic algorithms and the async_tx api
--
2.1.0
^ permalink raw reply related
* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-21 4:59 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Peter Zijlstra <peter>
In-Reply-To: <54BE4F1D.7090807@gmail.com>
On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
>
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This syscall is a sequence of
> >
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> >
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications.
>
> Which applications? Could we have some specific examples? This is a
> complex API, and it needs good justification.
OK, I'll explain more in v2.
>
> > For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> >
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> ^^^^^^^^^^^^^^ should be epoll_mod_wait
>
> I think you mean to say:
>
> The ability to batch multiple "epoll_ctl" operations into a single call
> means that even when no wait events are requested (i.e., spec == NULL),
> poll_mod_wait() provides a performance optimization over using multiple
> epoll_ctl() calls.
>
> Right? If yes, please amend the commit message, and this text should
> also make its way into the revised man page under a heading "NOTES".
OK.
>
> > The only complexity is returning the result of each operation. For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> >
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
>
> The "cmds' are not executed in a specified order plus the need to
> initialize the 'errors' fields to a positive value feels a bit ugly.
> And indeed the whole "command list was only partially run" case
> is not pretty. Am I correct to understand that if an error is found
> during execution of one of the "epoll_ctl" commands in 'cmds' then
> the system call will return -1 with errno set, indicating an error,
> even though the epoll interest list may have changed because some
> of the earlier 'cmds' executed successfully? This all seems a bit of
> a headache for user space.
This is the trade-off for batching. The best we can do is probably make this
transactional: none or all of the commands succeeds. It will require a much
more complex implementation, though. But even with that, the error reporting on
which command failed is a complication.
>
> I have a couple of questions:
>
> Q1. I can see that batching "epoll_ctl" commands might be useful,
> since it results in fewer systems calls. But, does it really
> need to be bound together with the "epoll_pwait" functionality?
> (Perhaps this point was covered in previous discussions, but
> neither the message accompanying this patch nor the 0/6 man page
> provide a compelling rationale for the need to bind these two
> operations together.)
>
> Yes, I realize you might save a system call, but it makes for a
> cumbersome API that has the above headache, and also forces the
> need for double pointer indirection in the 'spec' argument (i.e.,
> spec is a pointer to an array of structures where each element
> in turn includes an 'events' pointer that points to another array).
>
> Why not a simpler API with two syscalls such as:
>
> epoll_ctl_batch(int epfd, int flags,
> int ncmds, struct epoll_mod_cmd *cmds);
>
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents,
> struct timespec *timeout, int clock_id,
> const sigset_t *sigmask, size_t sigsetsize);
The problem is that there is no room for flags field in epoll_pwait1, which is
asked for, in previous discussion thread [1].
I don't see epoll_mod_wait as a *significantly more* complicated interface
compared to epoll_ctl_batch and epoll_pwait1 above. In epoll_mod_wait, if you
leave out ncmds and cmds, it is effectively a poll without batch; and if
leaving out spec, it is effectively a batch without poll.
The most important change here is the timeout. IMO I wouldn't mind leaving out
batching. Integrating it is requested by Andy:
[1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
number of times right before epoll_wait.
[Sorry for not putting anything into cover letter changelog, but it is also
interesting to see people's reaction on the patch itself without bias of
others' opinions. This indeed brings in more points. :]
>
> This gives us much of the benefit of reducing system calls, but
> with greater simplicity. And epoll_ctl_batch() could simply return
> the number of 'cmds' that were successfully executed.)
>
> Q2. In the man page in 0/6 you said that the 'cmds' were not
> guaranteed to be executed in order. Why not? If you did provide
> such a guarantee, then, when using your current epoll_mod_wait(),
> user space could do the following:
I guess we can make a guarentee on that.
>
> 1. Initialize the cmd.errors fields to zero.
> 2. Call epoll_ctl_mod()
> 3. Iterate through cmd.errors looking for the first nonzero
> field.
It's close, but zero is not good enough, if copy_from_user of cmds failed in
the first place. Impossible value, or error value, will be safer.
>
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision.
>
> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> though pselect() already had demonstrated the need for higher precision.
> I should have called it out way back then :-{.
>
> > The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> >
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> > fs/eventpoll.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/syscalls.h | 5 ++++
> > 2 files changed, 65 insertions(+)
> >
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> > sigmask ? &ksigmask : NULL);
> > }
> >
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > + int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > + struct epoll_wait_spec __user *, spec)
> > +{
> > + struct epoll_mod_cmd *kcmds = NULL;
> > + int i, ret = 0;
> > + int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > + if (flags)
> > + return -EINVAL;
> > + if (ncmds) {
> > + if (!cmds)
> > + return -EINVAL;
> > + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > + if (!kcmds)
> > + return -ENOMEM;
> > + if (copy_from_user(kcmds, cmds, cmd_size)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > + }
> > + for (i = 0; i < ncmds; i++) {
> > + struct epoll_event ev = (struct epoll_event) {
> > + .events = kcmds[i].events,
> > + .data = kcmds[i].data,
> > + };
> > + if (kcmds[i].flags) {
> > + kcmds[i].error = ret = -EINVAL;
>
> To make the 'ret' change a little more obvious, maybe it's better to write
>
> ret = kcmds[i].error = -EINVAL;
>
> > + goto out;
> > + }
> > + kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>
> Likewise:
> ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>
> > + if (ret)
> > + goto out;
> > + }
> > + if (spec) {
> > + sigset_t ksigmask;
> > + struct epoll_wait_spec kspec;
> > + ktime_t timeout;
> > +
> > + if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
>
> Cosmetic point: s/if(/if (/
>
> > + return -EFAULT;
> > + if (kspec.sigmask) {
> > + if (kspec.sigsetsize != sizeof(sigset_t))
> > + return -EINVAL;
> > + if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > + return -EFAULT;
> > + }
> > + timeout = timespec_to_ktime(kspec.timeout);
> > + ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > + kspec.clockid, timeout,
> > + kspec.sigmask ? &ksigmask : NULL);
>
> If I understand correctly, the implementation means that the
> 'size_t sigsetsize' field will probably need to be exposed to
> applications. In the existing epoll_pwait() call (as in ppoll()
> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> However, unless we expect glibc to do some structure copying to/from
> a structure that hides this field, then we're going end up exposing
> 'size_t sigsetsize' to applications. (This could be avoided, if we
> split the API as I suggest above. glibc would do the same thing
> in epoll_pwait1() that it currently does in epoll_pwait().)
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply
* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Michael Kerrisk (man-pages) @ 2015-01-21 5:55 UTC (permalink / raw)
To: Andy Lutomirski, Fam Zheng
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers
In-Reply-To: <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 01/20/2015 11:40 PM, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> This adds a new system call, epoll_mod_wait. It's described as below:
[...]
>> There is no guartantee that all the commands are executed in order. Only
>> if all the commands are successfully executed (all the error fields are
>> set to 0), events are polled.
>
> If this doesn't happen, what error is returned?
If I read the code correctly: the error of the first epoll_ctl op that fails.
[...]
>> RETURN VALUE
>>
>> When any error occurs, epoll_mod_wait() returns -1 and errno is set
>> appropriately. All the "error" fields in cmds are unchanged before they
>> are executed, and if any cmds are executed, the "error" fields are set
>> to a return code accordingly. See also epoll_ctl for more details of the
>> return code.
>
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?
Yes. (Ugly!)
[...]
>> ERRORS
>>
>> These errors apply on either the return value of epoll_mod_wait or error
>> status for each command, respectively.
>
> Please clarify which errors are returned overall and which are per-command.
Yes, I think this would be valuable as well.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox