Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:12 UTC (permalink / raw)
  To: Johannes Stezenbach, Greg Kroah-Hartman
  Cc: mtk.manpages, arnd, ebiederm, gnomes, teg, jkosina, luto,
	linux-api, linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <20150120132437.GB7545@sig21.net>

On 01/20/2015 02:24 PM, Johannes Stezenbach wrote:
> On Tue, Jan 20, 2015 at 07:26:09PM +0800, Greg Kroah-Hartman wrote:
>> On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:

>>> My guess is that the people porting from QNX were just confused
>>> and their use of D-Bus was in error.  Maybe they should've used
>>> plain sockets, capnproto, ZeroMQ or whatever.
>>
>> I tend to trust that they knew what they were doing, they wouldn't have
>> picked D-Bus for no good reason.
> 
> The automotive developers I had the pleasure to work with would
> use anything which is available via a mouse click in the
> commercial Embedded Linux SDK IDE of their choice :)
> Let's face it: QNX has a single IPC solution while Linux has
> a confusing multitude of possibilities.

Greg, from my spell in IVI, I too have to say your faith in the
wisdom of IVI developers' choices is touching. I think D-Bus was 
in the main picked because it had some nice features, but then 
people realized it had no bandwidth, and the solution has been 
"make D-Bus faster", rather than "maybe we should explore 
other (mixed model) solutions". This isn't to say that I'm
against adding kdbus, but I don't think there's much strength to
the argument you make above.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd, ebiederm, gnomes, teg, jkosina, luto,
	linux-api, linux-kernel
  Cc: mtk.manpages, Daniel Mack, dh.herrmann, tixxdz
In-Reply-To: <1421435777-25306-1-git-send-email-gregkh@linuxfoundation.org>

[Bother. Futzed Daniel Mack's email address. Resending]

On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> kdbus is a kernel-level IPC implementation that aims for resemblance to
> the the protocol layer with the existing userspace D-Bus daemon while
> enabling some features that couldn't be implemented before in userspace.
> 
> The documentation in the first patch in this series explains the
> protocol and the API details.
> 
> Full details on what has changed from the v2 submission are at the
> bottom of this email.
> 
> Reasons why this should be done in the kernel, instead of userspace as
> it is currently done today include the following:
> 
> - performance: fewer process context switches, fewer copies, fewer
>   syscalls, larger memory chunks via memfd.  This is really important
>   for a whole class of userspace programs that are ported from other
>   operating systems that are run on tiny ARM systems that rely on
>   hundreds of thousands of messages passed at boot time, and at
>   "critical" times in their user interaction loops.
> - security: the peers which communicate do not have to trust each other,
>   as the only trustworthy compoenent in the game is the kernel which
>   adds metadata and ensures that all data passed as payload is either
>   copied or sealed, so that the receiver can parse the data without
>   having to protect against changing memory while parsing buffers.  Also,
>   all the data transfer is controlled by the kernel, so that LSMs can
>   track and control what is going on, without involving userspace.
>   Because of the LSM issue, security people are much happier with this
>   model than the current scheme of having to hook into dbus to mediate
>   things.
> - more metadata can be attached to messages than in userspace
> - semantics for apps with heavy data payloads (media apps, for instance)
>   with optinal priority message dequeuing, and global message ordering.
>   Some "crazy" people are playing with using kdbus for audio data in the
>   system.  I'm not saying that this is the best model for this, but
>   until now, there wasn't any other way to do this without having to
>   create custom "busses", one for each application library.
> - being in the kernle closes a lot of races which can't be fixed with
>   the current userspace solutions.  For example, with kdbus, there is a
>   way a client can disconnect from a bus, but do so only if no further
>   messages present in its queue, which is crucial for implementing
>   race-free "exit-on-idle" services
> - eavesdropping on the kernel level, so privileged users can hook into
>   the message stream without hacking support for that into their
>   userspace processes
> - a number of smaller benefits: for example kdbus learned a way to peek
>   full messages without dequeing them, which is really useful for
>   logging metadata when handling bus-activation requests.
> 
> Of course, some of the bits above could be implemented in userspace
> alone, for example with more sophisticated memory management APIs, but
> this is usually done by losing out on the other details.  For example,
> for many of the memory management APIs, it's hard to not require the
> communicating peers to fully trust each other.  And we _really_ don't
> want peers to have to trust each other.
> 
> Another benefit of having this in the kernel, rather than as a userspace
> daemon, is that you can now easily use the bus from the initrd, or up to
> the very end when the system shuts down.  On current userspace D-Bus,
> this is not really possible, as this requires passing the bus instance
> around between initrd and the "real" system.  Such a transition of all
> fds also requires keeping full state of what has already been read from
> the connection fds.  kdbus makes this much simpler, as we can change the
> ownership of the bus, just by passing one fd over from one part to the
> other.

I tend to think that much of the above should also be part of the 
documentation file (patch 01/13).

Cheers,

Michael


 
> Regarding binder: binder and kdbus follow very different design
> concepts.  Binder implies the use of thread-pools to dispatch incoming
> method calls.  This is a very efficient scheme, and completely natural
> in programming languages like Java.  On most Linux programs, however,
> there's a much stronger focus on central poll() loops that dispatch all
> sources a program cares about.  kdbus is much more usable in such
> environments, as it doesn't enforce a threading model, and it is happy
> with serialized dispatching.  In fact, this major difference had an
> effect on much of the design decisions: binder does not guarantee global
> message ordering due to the parallel dispatching in the thread-pools,
> but  kdbus does.  Moreover, there's also a difference in the way message
> handling.  In kdbus, every message is basically taken and dispatched as
> one blob, while in binder, continious connections to other peers are
> created, which are then used to send messages on.  Hence, the models are
> quite different, and they serve different needs.  I believe that the
> D-Bus/kdbus model is more compatible and friendly with how Linux
> programs are usually implemented.
> 
> This can also be found in a git tree, the kdbus branch of char-misc.git at:
>         https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/
> 
> Changes since v2:
> 
>   * Add FS_USERNS_MOUNT to the file system flags, so users can mount
>     their own kdbusfs instances without being root in the parent
>     user-ns. Spotted by Andy Lutomirski.
> 
>   * Rewrite major parts of the metadata implementation to allow for
>     per-recipient namespace translations. For this, namespaces are
>     now not pinned by domains anymore. Instead, metadata is recorded
>     in kernel scope, and exported into the currently active namespaces
>     at the time of message installing.
> 
>   * Split PID and TID from KDBUS_ITEM_CREDS into KDBUS_ITEM_PIDS.
>     The starttime is there to detect re-used PIDs, so move it to that
>     new item type as well. Consequently, introduce struct kdbus_pids
>     to accommodate the information. Requested by Andy Lutomirski.
> 
>   * Add {e,s,fs}{u,g}id to KDBUS_ITEM_CREDS, so users have a way to
>     get more fine-grained credential information.
> 
>   * Removed KDBUS_CMD_CANCEL. The interface was not usable from
>     threaded userspace implementation due to inherent races. Instead,
>     add an item type CANCEL_FD which can be used to pass a file
>     descriptor to the CMD_SEND ioctl. When the SEND is done
>     synchronously, it will get cancelled as soon as the passed
>     FD signals POLLIN.
> 
>   * Dropped startttime from KDBUS_ITEM_PIDS
> 
>   * Restrict names of custom endpoints to names with a "<uid>-" prefix,
>     just like we do for buses.
> 
>   * Provide module-parameter "kdbus.attach_flags_mask" to specify the
>     a mask of metadata items that is applied on all exported items.
> 
>   * Monitors are now entirely invisible (IOW, there won't be any
>     notification when they are created) and they don't need to install
>     filters for broadcast messages anymore.
> 
>   * All information exposed via a connection's pool now also reports
>     the length in addition to the offset. That way, userspace
>     applications can mmap() only parts of the pool on demand.
> 
>   * Due to the metadata rework, KDBUS_ITEM_PAYLOAD_OFF items now
>     describe the offset relative to the pool, where they used to be
>     relative to the message header.
> 
>   * Added return_flags bitmask to all kdbus_cmd_* structs, so the
>     kernel can report details of the command processing. This is
>     mostly reserved for future extensions.
> 
>   * Some fixes in kdbus.txt and tests, spotted by Harald Hoyer, Andy
>     Lutomirski, Michele Curti, Sergei Zviagintsev, Sheng Yong, Torstein
>     Husebø and Hristo Venev.
> 
>   * Fixed compiler warnings in test-message by Michele Curti
> 
>   * Unexpected items are now rejected with -EINVAL
> 
>   * Split signal and broadcast handling. Unicast signals are now
>     supported, and messages have a new KDBUS_MSG_SIGNAL flag.
> 
>   * KDBUS_CMD_MSG_SEND was renamed to KDBUS_CMD_SEND, and now takes
>     a struct kdbus_cmd_send instead of a kdbus_msg.
> 
>   * KDBUS_CMD_MSG_RECV was renamed to KDBUS_CMD_RECV.
> 
>   * Test case memory leak plugged, and various other cleanups and
>     fixes, by Rui Miguel Silva.
> 
>   * Build fix for s390
> 
>   * Test case fix for 32bit archs
> 
>   * The test framework now supports mount, pid and user namespaces.
> 
>   * The test framework learned a --tap command line parameter to
>     format its output in the "Test Anything Protocol". This format
>     is chosen by default when "make kselftest" is invoked.
> 
>   * Fixed buses and custom endpoints name validation, reported by
>     Andy Lutomirski.
> 
>   * copy_from_user() return code issue fixed, reported by
>     Dan Carpenter.
> 
>   * Avoid signed int overflow on archs without atomic_sub
> 
>   * Avoid variable size stack items. Fixes a sparse warning in queue.c.
> 
>   * New test case for kernel notification quota
> 
>   * Switched back to enums for the list of ioctls. This has advantages
>     for userspace code as gdb, for instance, is able to resolve the
>     numbers into names. Added features can easily be detected with
>     autotools, and new iotcls can get #defines as well. Having #defines
>     for the initial set of ioctls is uncecessary.
> 
> Daniel Mack (13):
>   kdbus: add documentation
>   kdbus: add header file
>   kdbus: add driver skeleton, ioctl entry points and utility functions
>   kdbus: add connection pool implementation
>   kdbus: add connection, queue handling and message validation code
>   kdbus: add node and filesystem implementation
>   kdbus: add code to gather metadata
>   kdbus: add code for notifications and matches
>   kdbus: add code for buses, domains and endpoints
>   kdbus: add name registry implementation
>   kdbus: add policy database implementation
>   kdbus: add Makefile, Kconfig and MAINTAINERS entry
>   kdbus: add selftests
> 
>  Documentation/ioctl/ioctl-number.txt              |    1 +
>  Documentation/kdbus.txt                           | 2107 +++++++++++++++++++++
>  MAINTAINERS                                       |   12 +
>  include/uapi/linux/Kbuild                         |    1 +
>  include/uapi/linux/kdbus.h                        | 1049 ++++++++++
>  include/uapi/linux/magic.h                        |    2 +
>  init/Kconfig                                      |   12 +
>  ipc/Makefile                                      |    2 +-
>  ipc/kdbus/Makefile                                |   22 +
>  ipc/kdbus/bus.c                                   |  553 ++++++
>  ipc/kdbus/bus.h                                   |  103 +
>  ipc/kdbus/connection.c                            | 2004 ++++++++++++++++++++
>  ipc/kdbus/connection.h                            |  262 +++
>  ipc/kdbus/domain.c                                |  350 ++++
>  ipc/kdbus/domain.h                                |   84 +
>  ipc/kdbus/endpoint.c                              |  232 +++
>  ipc/kdbus/endpoint.h                              |   68 +
>  ipc/kdbus/fs.c                                    |  519 +++++
>  ipc/kdbus/fs.h                                    |   25 +
>  ipc/kdbus/handle.c                                | 1134 +++++++++++
>  ipc/kdbus/handle.h                                |   20 +
>  ipc/kdbus/item.c                                  |  309 +++
>  ipc/kdbus/item.h                                  |   57 +
>  ipc/kdbus/limits.h                                |   95 +
>  ipc/kdbus/main.c                                  |   72 +
>  ipc/kdbus/match.c                                 |  535 ++++++
>  ipc/kdbus/match.h                                 |   32 +
>  ipc/kdbus/message.c                               |  598 ++++++
>  ipc/kdbus/message.h                               |  133 ++
>  ipc/kdbus/metadata.c                              | 1066 +++++++++++
>  ipc/kdbus/metadata.h                              |   52 +
>  ipc/kdbus/names.c                                 |  891 +++++++++
>  ipc/kdbus/names.h                                 |   82 +
>  ipc/kdbus/node.c                                  |  910 +++++++++
>  ipc/kdbus/node.h                                  |   87 +
>  ipc/kdbus/notify.c                                |  244 +++
>  ipc/kdbus/notify.h                                |   30 +
>  ipc/kdbus/policy.c                                |  481 +++++
>  ipc/kdbus/policy.h                                |   51 +
>  ipc/kdbus/pool.c                                  |  784 ++++++++
>  ipc/kdbus/pool.h                                  |   47 +
>  ipc/kdbus/queue.c                                 |  505 +++++
>  ipc/kdbus/queue.h                                 |  108 ++
>  ipc/kdbus/reply.c                                 |  262 +++
>  ipc/kdbus/reply.h                                 |   68 +
>  ipc/kdbus/util.c                                  |  317 ++++
>  ipc/kdbus/util.h                                  |  133 ++
>  tools/testing/selftests/Makefile                  |    1 +
>  tools/testing/selftests/kdbus/.gitignore          |   11 +
>  tools/testing/selftests/kdbus/Makefile            |   46 +
>  tools/testing/selftests/kdbus/kdbus-enum.c        |   95 +
>  tools/testing/selftests/kdbus/kdbus-enum.h        |   14 +
>  tools/testing/selftests/kdbus/kdbus-test.c        |  920 +++++++++
>  tools/testing/selftests/kdbus/kdbus-test.h        |   85 +
>  tools/testing/selftests/kdbus/kdbus-util.c        | 1646 ++++++++++++++++
>  tools/testing/selftests/kdbus/kdbus-util.h        |  216 +++
>  tools/testing/selftests/kdbus/test-activator.c    |  319 ++++
>  tools/testing/selftests/kdbus/test-attach-flags.c |  751 ++++++++
>  tools/testing/selftests/kdbus/test-benchmark.c    |  427 +++++
>  tools/testing/selftests/kdbus/test-bus.c          |  174 ++
>  tools/testing/selftests/kdbus/test-chat.c         |  123 ++
>  tools/testing/selftests/kdbus/test-connection.c   |  611 ++++++
>  tools/testing/selftests/kdbus/test-daemon.c       |   66 +
>  tools/testing/selftests/kdbus/test-endpoint.c     |  344 ++++
>  tools/testing/selftests/kdbus/test-fd.c           |  710 +++++++
>  tools/testing/selftests/kdbus/test-free.c         |   36 +
>  tools/testing/selftests/kdbus/test-match.c        |  442 +++++
>  tools/testing/selftests/kdbus/test-message.c      |  658 +++++++
>  tools/testing/selftests/kdbus/test-metadata-ns.c  |  507 +++++
>  tools/testing/selftests/kdbus/test-monitor.c      |  158 ++
>  tools/testing/selftests/kdbus/test-names.c        |  184 ++
>  tools/testing/selftests/kdbus/test-policy-ns.c    |  633 +++++++
>  tools/testing/selftests/kdbus/test-policy-priv.c  | 1270 +++++++++++++
>  tools/testing/selftests/kdbus/test-policy.c       |   81 +
>  tools/testing/selftests/kdbus/test-race.c         |  313 +++
>  tools/testing/selftests/kdbus/test-sync.c         |  368 ++++
>  tools/testing/selftests/kdbus/test-timeout.c      |   99 +
>  77 files changed, 27818 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/kdbus.txt
>  create mode 100644 include/uapi/linux/kdbus.h
>  create mode 100644 ipc/kdbus/Makefile
>  create mode 100644 ipc/kdbus/bus.c
>  create mode 100644 ipc/kdbus/bus.h
>  create mode 100644 ipc/kdbus/connection.c
>  create mode 100644 ipc/kdbus/connection.h
>  create mode 100644 ipc/kdbus/domain.c
>  create mode 100644 ipc/kdbus/domain.h
>  create mode 100644 ipc/kdbus/endpoint.c
>  create mode 100644 ipc/kdbus/endpoint.h
>  create mode 100644 ipc/kdbus/fs.c
>  create mode 100644 ipc/kdbus/fs.h
>  create mode 100644 ipc/kdbus/handle.c
>  create mode 100644 ipc/kdbus/handle.h
>  create mode 100644 ipc/kdbus/item.c
>  create mode 100644 ipc/kdbus/item.h
>  create mode 100644 ipc/kdbus/limits.h
>  create mode 100644 ipc/kdbus/main.c
>  create mode 100644 ipc/kdbus/match.c
>  create mode 100644 ipc/kdbus/match.h
>  create mode 100644 ipc/kdbus/message.c
>  create mode 100644 ipc/kdbus/message.h
>  create mode 100644 ipc/kdbus/metadata.c
>  create mode 100644 ipc/kdbus/metadata.h
>  create mode 100644 ipc/kdbus/names.c
>  create mode 100644 ipc/kdbus/names.h
>  create mode 100644 ipc/kdbus/node.c
>  create mode 100644 ipc/kdbus/node.h
>  create mode 100644 ipc/kdbus/notify.c
>  create mode 100644 ipc/kdbus/notify.h
>  create mode 100644 ipc/kdbus/policy.c
>  create mode 100644 ipc/kdbus/policy.h
>  create mode 100644 ipc/kdbus/pool.c
>  create mode 100644 ipc/kdbus/pool.h
>  create mode 100644 ipc/kdbus/queue.c
>  create mode 100644 ipc/kdbus/queue.h
>  create mode 100644 ipc/kdbus/reply.c
>  create mode 100644 ipc/kdbus/reply.h
>  create mode 100644 ipc/kdbus/util.c
>  create mode 100644 ipc/kdbus/util.h
>  create mode 100644 tools/testing/selftests/kdbus/.gitignore
>  create mode 100644 tools/testing/selftests/kdbus/Makefile
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.h
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-test.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-test.h
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-util.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-util.h
>  create mode 100644 tools/testing/selftests/kdbus/test-activator.c
>  create mode 100644 tools/testing/selftests/kdbus/test-attach-flags.c
>  create mode 100644 tools/testing/selftests/kdbus/test-benchmark.c
>  create mode 100644 tools/testing/selftests/kdbus/test-bus.c
>  create mode 100644 tools/testing/selftests/kdbus/test-chat.c
>  create mode 100644 tools/testing/selftests/kdbus/test-connection.c
>  create mode 100644 tools/testing/selftests/kdbus/test-daemon.c
>  create mode 100644 tools/testing/selftests/kdbus/test-endpoint.c
>  create mode 100644 tools/testing/selftests/kdbus/test-fd.c
>  create mode 100644 tools/testing/selftests/kdbus/test-free.c
>  create mode 100644 tools/testing/selftests/kdbus/test-match.c
>  create mode 100644 tools/testing/selftests/kdbus/test-message.c
>  create mode 100644 tools/testing/selftests/kdbus/test-metadata-ns.c
>  create mode 100644 tools/testing/selftests/kdbus/test-monitor.c
>  create mode 100644 tools/testing/selftests/kdbus/test-names.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy-ns.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy-priv.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy.c
>  create mode 100644 tools/testing/selftests/kdbus/test-race.c
>  create mode 100644 tools/testing/selftests/kdbus/test-sync.c
>  create mode 100644 tools/testing/selftests/kdbus/test-timeout.c
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: David Herrmann @ 2015-01-20 14:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack, Djalal Harouni, Daniel Mack,
	Johannes Stezenbach
In-Reply-To: <54BE5DC8.70706-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Hi Michael

On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>
>> kdbus is a system for low-latency, low-overhead, easy to use
>> interprocess communication (IPC).
>>
>> The interface to all functions in this driver is implemented via ioctls
>> on files exposed through a filesystem called 'kdbusfs'. The default
>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
>> documentation about the kernel level API design.
>
> I have some details feedback on the contents of this file, and some
> bigger questions. I'll split them out into separate mails.
>
> So here, the bigger, general questions to start with. I've arrived late
> to this, so sorry if they've already been discussed, but the answers to
> some of the questions should actually be in this file, I would have
> expected.
>
> This is an enormous and complex API. Why is the API ioctl() based,
> rather than system-call-based? Have we learned nothing from the hydra
> that the futex() multiplexing syscall became? (And kdbus is an order
> of magnitude more complex, by the look of things.) At the very least,
> a *good* justification of why the API is ioctl()-based should be part
> of this documentation file.
>
> An observation: The documentation below is substantial, but this API is
> enormous, so the documentation still feels rather thin. What would
> really help would be some example code in the doc.
>
> And on the subject of code examples... Are there any (prototype)
> working user-space applications that exercise the current kdbus
> implementation? That is, can I install these kdbus patches, and
> then find a simple example application somewhere that does
> something to exercise kdbus?

If you run a 3.18 kernel, you can install kdbus.ko from our repository
and boot a full Fedora system running Gnome3 with kdbus, given that
you compiled systemd with --enable-kdbus (which is still
experimental). No legacy dbus1 daemon is running. Instead, we have a
bus-proxy that converts classic dbus1 to kdbus, so all
bus-communication runs on kdbus.

> And then: is there any substantial real-world application (e.g., a
> full D-Bus port) that is being developed in tandem with this kernel
> side patch? (I don't mean a user-space library; I mean a seriously
> large application.) This is an incredibly complex API whose
> failings are only going to become evident through real-world use.
> Solidifying an API in the kernel and then discovering the API
> problems later when writing real-world applications would make for
> a sad story. A story something like that of inotify, an API which
> is an order of magnitude less complex than kdbus. (I can't help but
> feel that many of inotify problems that I discuss at
> https://lwn.net/Articles/605128/ might have been fixed or mitigated
> if a few real-world applications had been implemented before the
> API  was set in stone.)

I think running a whole Gnome3 stack counts as "substantial real-world
application", right? Sure, it's a dbus1-to-kdbus layer, but all the
systemd tools use kdbus natively and it works just fine. In fact, we
all run kdbus on our main-systems every day.

We've spent over a year fixing races and API misdesigns, we've talked
to other toolkit developers (glib, qt, ..) and made sure we're
backwards compatible to dbus1. I don't think the API is perfect,
everyone makes mistakes. But with bus-proxy and systemd we have two
huge users of kdbus that put a lot of pressure on API design.

>> +For a kdbus specific userspace library implementation please refer to:
>> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
>
> Is this library intended just for systemd? More generally, is there an
> intention to provide a general purpose library API for kdbus? Or is the
> intention that each application will roll a library suitable to its
> needs? I think an answer to that question would be useful in this
> Documentation file.

kdbus is in no way bound to systemd. There are ongoing efforts to port
glib and qt to kdbus natively. The API is pretty simple and I don't
see how a libkdbus would simplify things. In fact, even our tests only
have slim wrappers around the ioctls to simplify error-handling in
test-scenarios.

Note that most of the toolkit work is on the marshaling level, which
is independent of kdbus. kdbus just provides the transport level. DBus
is just one, yet significant, application-layer on top of kdbus. Our
test-cases use kdbus exclusively to transport raw byte streams.

Thanks
David

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Josh Boyer @ 2015-01-20 14:42 UTC (permalink / raw)
  To: David Herrmann
  Cc: Michael Kerrisk (man-pages), Greg Kroah-Hartman, Arnd Bergmann,
	Eric W. Biederman, One Thousand Gnomes, Tom Gundersen,
	Jiri Kosina, Andy Lutomirski, Linux API, linux-kernel,
	Djalal Harouni, Daniel Mack, Johannes Stezenbach
In-Reply-To: <CANq1E4SjfZOKqTsdkt519vKc1Poeah5McVJBb_spdHjbKv4=7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Michael
>
> On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
>>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>>>
>>> kdbus is a system for low-latency, low-overhead, easy to use
>>> interprocess communication (IPC).
>>>
>>> The interface to all functions in this driver is implemented via ioctls
>>> on files exposed through a filesystem called 'kdbusfs'. The default
>>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
>>> documentation about the kernel level API design.
>>
>> I have some details feedback on the contents of this file, and some
>> bigger questions. I'll split them out into separate mails.
>>
>> So here, the bigger, general questions to start with. I've arrived late
>> to this, so sorry if they've already been discussed, but the answers to
>> some of the questions should actually be in this file, I would have
>> expected.
>>
>> This is an enormous and complex API. Why is the API ioctl() based,
>> rather than system-call-based? Have we learned nothing from the hydra
>> that the futex() multiplexing syscall became? (And kdbus is an order
>> of magnitude more complex, by the look of things.) At the very least,
>> a *good* justification of why the API is ioctl()-based should be part
>> of this documentation file.
>>
>> An observation: The documentation below is substantial, but this API is
>> enormous, so the documentation still feels rather thin. What would
>> really help would be some example code in the doc.
>>
>> And on the subject of code examples... Are there any (prototype)
>> working user-space applications that exercise the current kdbus
>> implementation? That is, can I install these kdbus patches, and
>> then find a simple example application somewhere that does
>> something to exercise kdbus?
>
> If you run a 3.18 kernel, you can install kdbus.ko from our repository
> and boot a full Fedora system running Gnome3 with kdbus, given that
> you compiled systemd with --enable-kdbus (which is still
> experimental). No legacy dbus1 daemon is running. Instead, we have a
> bus-proxy that converts classic dbus1 to kdbus, so all
> bus-communication runs on kdbus.

FWIW, we've been building a "playground" repository for the kernel
that contains this already for Fedora.  If you have a stock Fedora 21
or rawhide install, you can use:

https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/

which has the kernel+kdbus and systemd built with --enable-kdbus
already.  Easy enough to throw in a VM for testing.

josh

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Djalal Harouni @ 2015-01-20 14:53 UTC (permalink / raw)
  To: Josh Boyer
  Cc: David Herrmann, Michael Kerrisk (man-pages), Greg Kroah-Hartman,
	Arnd Bergmann, Eric W. Biederman, One Thousand Gnomes,
	Tom Gundersen, Jiri Kosina, Andy Lutomirski, Linux API,
	linux-kernel, Daniel Mack, Johannes Stezenbach
In-Reply-To: <CA+5PVA46wVa-L9K1zgHzt=Fwn6YFBYhxJUjiozSH4XBY03WMVw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

On Tue, Jan 20, 2015 at 09:42:59AM -0500, Josh Boyer wrote:
> On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Hi Michael
> >
> > On Tue, Jan 20, 2015 at 2:53 PM, Michael Kerrisk (man-pages)
> > <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> >>> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> >>>
> >>> kdbus is a system for low-latency, low-overhead, easy to use
> >>> interprocess communication (IPC).
> >>>
> >>> The interface to all functions in this driver is implemented via ioctls
> >>> on files exposed through a filesystem called 'kdbusfs'. The default
> >>> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> >>> documentation about the kernel level API design.
> >>
> >> I have some details feedback on the contents of this file, and some
> >> bigger questions. I'll split them out into separate mails.
> >>
> >> So here, the bigger, general questions to start with. I've arrived late
> >> to this, so sorry if they've already been discussed, but the answers to
> >> some of the questions should actually be in this file, I would have
> >> expected.
> >>
> >> This is an enormous and complex API. Why is the API ioctl() based,
> >> rather than system-call-based? Have we learned nothing from the hydra
> >> that the futex() multiplexing syscall became? (And kdbus is an order
> >> of magnitude more complex, by the look of things.) At the very least,
> >> a *good* justification of why the API is ioctl()-based should be part
> >> of this documentation file.
> >>
> >> An observation: The documentation below is substantial, but this API is
> >> enormous, so the documentation still feels rather thin. What would
> >> really help would be some example code in the doc.
> >>
> >> And on the subject of code examples... Are there any (prototype)
> >> working user-space applications that exercise the current kdbus
> >> implementation? That is, can I install these kdbus patches, and
> >> then find a simple example application somewhere that does
> >> something to exercise kdbus?
> >
> > If you run a 3.18 kernel, you can install kdbus.ko from our repository
> > and boot a full Fedora system running Gnome3 with kdbus, given that
> > you compiled systemd with --enable-kdbus (which is still
> > experimental). No legacy dbus1 daemon is running. Instead, we have a
> > bus-proxy that converts classic dbus1 to kdbus, so all
> > bus-communication runs on kdbus.
> 
> FWIW, we've been building a "playground" repository for the kernel
> that contains this already for Fedora.  If you have a stock Fedora 21
> or rawhide install, you can use:
> 
> https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/
> 
> which has the kernel+kdbus and systemd built with --enable-kdbus
> already.  Easy enough to throw in a VM for testing.
> 
> josh
Yes thanks josh!

Another addition, if kdbus is installed and loaded, you could also use
systemd-nspawn to boot a full system (systemd compiled with
--enable-kdbus) in a container [1], kdbusfs will be mounted in the
container.

There is also the busctl tool to query kdbus...

http://www.freedesktop.org/wiki/Software/systemd/VirtualizedTesting/

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Johannes Stezenbach @ 2015-01-20 16:08 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Josh Boyer, David Herrmann, Michael Kerrisk (man-pages),
	Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack
In-Reply-To: <20150120145353.GA8124@dztty>

Hi all,

On Tue, Jan 20, 2015 at 03:53:53PM +0100, Djalal Harouni wrote:
> On Tue, Jan 20, 2015 at 09:42:59AM -0500, Josh Boyer wrote:
> > On Tue, Jan 20, 2015 at 9:31 AM, David Herrmann <dh.herrmann@gmail.com> wrote:
> > >
> > > If you run a 3.18 kernel, you can install kdbus.ko from our repository
> > > and boot a full Fedora system running Gnome3 with kdbus, given that
> > > you compiled systemd with --enable-kdbus (which is still
> > > experimental). No legacy dbus1 daemon is running. Instead, we have a
> > > bus-proxy that converts classic dbus1 to kdbus, so all
> > > bus-communication runs on kdbus.
> > 
> > FWIW, we've been building a "playground" repository for the kernel
> > that contains this already for Fedora.  If you have a stock Fedora 21
> > or rawhide install, you can use:
> > 
> > https://copr.fedoraproject.org/coprs/jwboyer/kernel-playground/
> > 
> > which has the kernel+kdbus and systemd built with --enable-kdbus
> > already.  Easy enough to throw in a VM for testing.
> 
> Another addition, if kdbus is installed and loaded, you could also use
> systemd-nspawn to boot a full system (systemd compiled with
> --enable-kdbus) in a container [1], kdbusfs will be mounted in the
> container.
> 
> There is also the busctl tool to query kdbus...
> 
> http://www.freedesktop.org/wiki/Software/systemd/VirtualizedTesting/

It is reassuring that kdbus actually works :)

However, let me repeat and rephrase my previous questions:
Is there a noticable or measurable improvement from using kdbus?
IOW, is the added complexity of kdbus worth the result?

I have stated my believe that current usage of D-Bus is not
performance sensitive and the number of messages exchanged
is low.  I would love it if you would prove me wrong.
Or if you could show that any D-Bus related bug in Gnome3
is fixed by kdbus.

I would sooo love it if someone would finally post some
data that proves kdbus is useful beyond systemd.


Thanks,
Johannes

^ permalink raw reply

* [PATCH 01/05] fixup! virtio_pci: modern driver
From: Michael S. Tsirkin @ 2015-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421771093-1589-1-git-send-email-mst@redhat.com>

virtio_pci_modern: fix up vendor capability

Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.

Luckily we caught this in time.

This fixes commit "virtio_pci: modern driver"
and should be smashed with it.

Reported-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_pci.h    |  7 ++++---
 drivers/virtio/virtio_pci_modern.c | 23 +++++++++--------------
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 4e05423..a2b2e13 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -117,10 +117,11 @@ struct virtio_pci_cap {
 	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
 	__u8 cap_next;		/* Generic PCI field: next ptr. */
 	__u8 cap_len;		/* Generic PCI field: capability length */
-	__u8 type_and_bar;	/* Upper 3 bits: bar.
-				 * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+	__u8 cfg_type;		/* Identifies the structure. */
+	__u8 bar;		/* Where to find it. */
+	__u8 padding[3];	/* Pad to full dword. */
 	__le32 offset;		/* Offset within bar. */
-	__le32 length;		/* Length. */
+	__le32 length;		/* Length of the structure, in bytes. */
 };
 
 #define VIRTIO_PCI_CAP_BAR_SHIFT	5
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index e2f41c9..a3d8101 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -26,13 +26,13 @@ static void __iomem *map_capability(struct pci_dev *dev, int off,
 				    u32 start, u32 size,
 				    size_t *len)
 {
-	u8 type_and_bar, bar;
+	u8 bar;
 	u32 offset, length;
 	void __iomem *p;
 
 	pci_read_config_byte(dev, off + offsetof(struct virtio_pci_cap,
-						 type_and_bar),
-			     &type_and_bar);
+						 bar),
+			     &bar);
 	pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, offset),
 			     &offset);
 	pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, length),
@@ -76,9 +76,6 @@ static void __iomem *map_capability(struct pci_dev *dev, int off,
 	if (len)
 		*len = length;
 
-	bar = (type_and_bar >> VIRTIO_PCI_CAP_BAR_SHIFT) &
-		VIRTIO_PCI_CAP_BAR_MASK;
-
 	if (minlen + offset < minlen ||
 	    minlen + offset > pci_resource_len(dev, bar)) {
 		dev_err(&dev->dev,
@@ -438,15 +435,13 @@ static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type,
 	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
 	     pos > 0;
 	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
-		u8 type_and_bar, type, bar;
+		u8 type, bar;
 		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
-							 type_and_bar),
-				     &type_and_bar);
-
-		type = (type_and_bar >> VIRTIO_PCI_CAP_TYPE_SHIFT) &
-			VIRTIO_PCI_CAP_TYPE_MASK;
-		bar = (type_and_bar >> VIRTIO_PCI_CAP_BAR_SHIFT) &
-			VIRTIO_PCI_CAP_BAR_MASK;
+							 cfg_type),
+				     &type);
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+							 bar),
+				     &bar);
 
 		/* Ignore structures with reserved BAR values */
 		if (bar > 0x5)
-- 
MST

^ permalink raw reply related

* [PATCH 03/05] fixup! virtio_pci: macros for PCI layout offsets
From: Michael S. Tsirkin @ 2015-01-20 16:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1421771093-1589-1-git-send-email-mst@redhat.com>

virtio_pci_modern: fix up vendor capability macros

Gerd Hoffmann noticed that we implemented
capability layout from an old draft.
Unfortunately the code was copied to host as well,
so we didn't notice.

Luckily we caught this in time.

This fixes commit "virtio_pci: macros for PCI layout offsets"
and should be smashed with it.

Reported-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_pci.h    | 14 +++++---------
 drivers/virtio/virtio_pci_modern.c |  6 ++++--
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 0911c62..3b7e4d2 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -124,11 +124,6 @@ struct virtio_pci_cap {
 	__le32 length;		/* Length of the structure, in bytes. */
 };
 
-#define VIRTIO_PCI_CAP_BAR_SHIFT	5
-#define VIRTIO_PCI_CAP_BAR_MASK		0x7
-#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
-#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
-
 struct virtio_pci_notify_cap {
 	struct virtio_pci_cap cap;
 	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
@@ -164,11 +159,12 @@ struct virtio_pci_common_cfg {
 #define VIRTIO_PCI_CAP_VNDR		0
 #define VIRTIO_PCI_CAP_NEXT		1
 #define VIRTIO_PCI_CAP_LEN		2
-#define VIRTIO_PCI_CAP_TYPE_AND_BAR	3
-#define VIRTIO_PCI_CAP_OFFSET		4
-#define VIRTIO_PCI_CAP_LENGTH		8
+#define VIRTIO_PCI_CAP_CFG_TYPE		3
+#define VIRTIO_PCI_CAP_BAR		4
+#define VIRTIO_PCI_CAP_OFFSET		8
+#define VIRTIO_PCI_CAP_LENGTH		12
 
-#define VIRTIO_PCI_NOTIFY_CAP_MULT	12
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	16
 
 #define VIRTIO_PCI_COMMON_DFSELECT	0
 #define VIRTIO_PCI_COMMON_DF		4
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index c86594e..b2e707ad 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -474,8 +474,10 @@ static inline void check_offsets(void)
 		     offsetof(struct virtio_pci_cap, cap_next));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
 		     offsetof(struct virtio_pci_cap, cap_len));
-	BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
-		     offsetof(struct virtio_pci_cap, type_and_bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_CFG_TYPE !=
+		     offsetof(struct virtio_pci_cap, cfg_type));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_BAR !=
+		     offsetof(struct virtio_pci_cap, bar));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
 		     offsetof(struct virtio_pci_cap, offset));
 	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
-- 
MST

^ permalink raw reply related

* Re: [PATCH v5 0/5] power: Add max77693 charger driver
From: Sebastian Reichel @ 2015-01-20 16:33 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Dmitry Eremin-Solenikov, David Woodhouse, Samuel Ortiz, Lee Jones,
	linux-kernel, linux-pm, devicetree, linux-api, Kyungmin Park,
	Bartlomiej Zolnierkiewicz
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

[-- Attachment #1: Type: text/plain, Size: 406 bytes --]

Hi,

On Tue, Jan 20, 2015 at 11:00:51AM +0100, Krzysztof Kozlowski wrote:
> You already acked some earlier version of this patch. In that time
> there were more dependencies on MFD tree. Now the MFD patch (2/5)
> is acked by Lee Jones so could you pick up everything?

Thanks for the patchset, queued:

http://git.infradead.org/battery-2.6.git/commit/f8f847b51a0e1cb0c96e706d5da0ac507d96c605

-- Sebastian

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: David Herrmann @ 2015-01-20 17:00 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Djalal Harouni, Josh Boyer, Michael Kerrisk (man-pages),
	Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack
In-Reply-To: <20150120160842.GA9474-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org>

Hi

On Tue, Jan 20, 2015 at 5:08 PM, Johannes Stezenbach <js-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org> wrote:
> However, let me repeat and rephrase my previous questions:
> Is there a noticable or measurable improvement from using kdbus?
> IOW, is the added complexity of kdbus worth the result?
>
> I have stated my believe that current usage of D-Bus is not
> performance sensitive and the number of messages exchanged
> is low.  I would love it if you would prove me wrong.
> Or if you could show that any D-Bus related bug in Gnome3
> is fixed by kdbus.

DBus is not used for performance sensitive applications because DBus
is slow. We want to make it fast so we can finally use it for
low-latency, high-throughput applications. A simple DBus
method-call+reply takes 200us here, with kdbus it takes 8us (with UDS
about 2us). If I increase the packet size from 8k to 128k, kdbus even
tops UDS thanks to single-copy transfers.
The fact that there is no performance-critical application using DBus
is, imho, an argument *pro* kdbus. People haven't been capable of
making classic dbus1 fast enough for low-latency audio, thus, we
present kdbus.

Starting up 'gdm' sends ~5k dbus messages on my machine. It takes >1s
to transmit the messages alone. Each dbus1 message has to be copied 4
times for each direction. With kdbus, each message is copied only once
for each transmission (or not at all, if you use memfds, though that
doesn't mean it's necessarily faster). No intermediate context-switch
is needed. This makes kdbus capable to transmit low-latency audio data
*inline*.

DBus marshaling is the de-facto standard in all major(!) linux desktop
systems. It is well established and accepted by many DEs. It also
solves many other problems, including: policy,
authentication/authorization, well-known name registry, efficient
broadcasts/multicasts, peer discovery, bus discovery, metadata
transmission, and more.
It is a shame that we cannot use this well-established protocol for
low-latency applications. We, effectively, have to duplicate all this
code on custom UDS and other transports just because DBus is too slow.

kdbus tries to unify those efforts, so that we don't need multiple
policy implementations, name registries and peer discovery mechanisms.
Furthermore, kdbus implements comprehensive, yet optional, metadata
transmission that allows to identify and authenticate peers in a
race-free manner (which is *not* possible with UDS).
Also, kdbus provides a single transport bus with sequential message
numbering. If you use multiple channels, you cannot give any ordering
guarantees across peers (for instance, regarding parallel
name-registry changes).


Given these theoretical advantages, here're some examples:

*) The Tizen developers have been complaining about the high latency
of DBus for polkit'ish policy queries. That's why their authentication
framework uses custom UDS sockets (called 'Cynara'). If a
UI-interaction needs multiple authentication-queries, you don't want
it to take multiple milliseconds, given that you usually want to
render the result in the same frame.

*) PulseAudio doesn't use DBus for data transmission. They had to
implement their own marshaling code, transport layer and so on, just
because DBus1-latency is horrible. With kdbus, we can basically drop
this code-duplication and unify the IPC layer. Same is true for
Wayland, btw.

*) By moving broadcast-transmission into the kernel, we can use the
time-slices of the sender to perform heavy operations. This is also
true for policy decisions, etc. With a userspace daemon, we cannot
perform operations in a time-slice of the caller. This makes DoS
attacks much harder.

*) With priority-inheritance, we can do synchronous calls into trusted
peers and let them optionally use our time-slice to perform the
action. This allows syscall-like/binder-like method-calls into other
processes. Without priority-inheritance, this is not possible in a
secure manner (see 'priority-inheritance').

*) Logging-daemons often want to attach metadata to log-messages so
debugging/filtering gets easier. If short-lived programs send
log-messages, the destination peer might not be able to read such
metadata from /proc, as the process might no longer be available at
that time. Same is true for policy-decisions like polkit does. You
cannot send off method-calls and exit. You have to wait for a reply,
even though you might not even care for it. If you don't wait, the
other side might not be able to verify your identity and as such
reject the request.

*) Even though the dbus traffic on idle-systems might be low, this
doesn't mean it's not significant at boot-times or under high-load. If
you run a dbus-monitor of your choice, you will see there is an
significant number of messages exchanged during VT-switches, startup,
shutdown, suspend, wakeup, hotplugging and similar situations where
lots of control-messages are exchanged. We don't want to spend
hundreds of ms just to transmit those messages.

*) dbus-daemon is not available during early-boot or shutdown.


These are just examples off the top of my head, but I think they're
already pretty convincing.
David

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-20 17:50 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd, ebiederm,
	gnomes, teg, jkosina, luto, linux-api, linux-kernel
  Cc: daniel, dh.herrmann, tixxdz
In-Reply-To: <54BE5F1F.8070703@gmail.com>

Hi Michael,

Thanks a lot for for intense review of the documentation. Much appreciated.

I've addressed all but the below issues, following your suggestions.


On 01/20/2015 02:58 PM, Michael Kerrisk (man-pages) wrote:
>> +    and KDBUS_CMD_ENDPOINT_MAKE (see above).
>> +
>> +    Following items are expected for KDBUS_CMD_BUS_MAKE:
>> +    KDBUS_ITEM_MAKE_NAME
>> +      Contains a string to identify the bus name.
> 
> So, up to here, I've seen no definition of 'kdbus_item', which leaves me 
> asking questions like: what subfield is KDBUS_ITEM_MAKE_NAME stored in?
> which subfield holds the pointer to the string?
> 
> Somewhere earlier,  kdbus_item needs to be exaplained in more detail, 
> I think.

Hmm, you're quoting text from section 5, and section 4 actually
describes the concept of items quite well I believe?

>> +  __s64 priority;
>> +    With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
>> +    the queue with at least the given priority. If no such message is waiting
>> +    in the queue, -ENOMSG is returned.
> 
> ###
> How do I simply select the highest priority message, without knowing what 
> its priority is?

The wording is indeed unclear here. KDBUS_RECV_USE_PRIORITY causes the
messages to be dequeued by their priority. The 'priority' field is
simply a filter that request a minimum priority. By setting this field
to the highest possible value, you effectively bypass the filter. I've
added a better description now.

>> +  -ENOMEM	The kernel memory is exhausted
>> +  -ENOTTY	Illegal ioctl command issued for the file descriptor
> 
> Why ENOTTY here, rather than EINVAL? The latter is, I beleive, the usual 
> ioctl() error for invalid commands, I believe (If you keep ENOTTY, add an
> explanation in this document.)

Hmm, no. -ENOTTY is commonly used as return code when calling ioctls
that can't be handled by the FDs they're called on. 'man errno(3)' even
states: "ENOTTY   Inappropriate I/O control operation (POSIX.1)".

>> +  -EINVAL	Unsupported item attached to command
>> +
>> +For all ioctls that carry a struct as payload:
>> +
>> +  -EFAULT	The supplied data pointer was not 64-bit aligned, or was
>> +		inaccessible from the kernel side.
>> +  -EINVAL	The size inside the supplied struct was smaller than expected
>> +  -EMSGSIZE	The size inside the supplied struct was bigger than expected
> 
> Why two different errors for smaller and larger than expected? (If you keep things this
> way, pelase explain the reason in this document.)

Providing a struct that is smaller than the minimum doesn't give the
ioctl a valid set of information to process the request. Hence, I think
-EINVAL is appropriate. However, -EMSGSIZE is something that users might
hit when they make message payloads too big, and I think it's good to
have a change to distinguish the two cases in error handling. I added
something in the document now.


Again, thanks a lot for reading the documentation so accurately!

Daniel

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-01-20 18:23 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A, Johannes Stezenbach
In-Reply-To: <54BE5DC8.70706-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 01/20/2015 02:53 PM, Michael Kerrisk (man-pages) wrote:
> This is an enormous and complex API. Why is the API ioctl() based,
> rather than system-call-based? Have we learned nothing from the hydra
> that the futex() multiplexing syscall became? (And kdbus is an order
> of magnitude more complex, by the look of things.) At the very least,
> a *good* justification of why the API is ioctl()-based should be part
> of this documentation file.

I think the simplest reason is because we want to be able to build kdbus
as a module. It's rather an optional driver than a core kernel feature.
IMO, kernel primitives should be syscalls, but kdbus is not a primitive
but an elaborate subsystem.

Also, the context the kdbus commands operate on originate from a
mountable special-purpose file system. Hence, we decided not to use a
global kernel interface but specific ioctls on the nodes exposed by kdbusfs.


Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v5 5/5] tty/serial: Add Spreadtrum sc9836-uart driver support
From: Rob Herring @ 2015-01-20 20:17 UTC (permalink / raw)
  To: Lyra Zhang
  Cc: Chunyan Zhang, Greg Kroah-Hartman, Mark Rutland, Arnd Bergmann,
	One Thousand Gnomes, Mark Brown, Rob Herring, Pawel Moll,
	Ian Campbell, Kumar Gala, Will Deacon, Catalin Marinas,
	Jiri Slaby, Jason Cooper, Heiko Stübner, Florian Vaussard,
	Andrew Lunn, Robert Richter, Hayato Suzuki, Grant Likely,
	Antony Pavlov, Joel Schopp
In-Reply-To: <CAAfSe-tAwURc_P+-0+m22ao9r+Fud6Ae89JF8FGsWgg_49Mdhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jan 20, 2015 at 1:37 AM, Lyra Zhang <zhang.lyra-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi, Rob
>
> I still have a question to be conform, specific describes below:
>
> On Mon, Jan 19, 2015 at 10:11 PM, Rob Herring <robherring2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, Jan 19, 2015 at 3:55 AM, Lyra Zhang <zhang.lyra-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> On Sat, Jan 17, 2015 at 12:41 AM, Rob Herring <robherring2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Fri, Jan 16, 2015 at 4:00 AM, Chunyan Zhang
>>>> <chunyan.zhang-lxIno14LUO0EEoCn2XhGlw@public.gmane.org> wrote:
>>>>> Add a full sc9836-uart driver for SC9836 SoC which is based on the
>>>>> spreadtrum sharkl64 platform.
>>>>> This driver also support earlycon.
>>>>> This patch also replaced the spaces between the macros and their
>>>>> values with the tabs in serial_core.h
>>
>> [...]
>>
>>>>> +static int __init sprd_serial_init(void)
>>>>> +{
>>>>> +       int ret = 0;
>>>>> +
>>>>> +       ret = uart_register_driver(&sprd_uart_driver);
>>>>
>>>> This can be done in probe now. Then you can use module_platform_driver().
>>>>
>>>
>>> Question:
>>> 1. there are 4 uart ports configured in dt for sprd_serial, so probe
>>> will be called 4 times, but uart_register_driver only needs to be
>>> called one time, so can we use uart_driver.state to check if
>>> uart_register_driver has already been called ?
>>
>> Yes.
>>
>>> 2. if I use module_platform_driver() instead of
>>> module_init(sprd_serial_init)  and  module_exit(sprd_serial_exit) , I
>>> must move uart_unregister_driver() which is now processed in
>>> sprd_serial_exit() to sprd_remove(), there is a similar problem with
>>> probe(), sprd_remove() will also be called 4 times, and actually it
>>> should be called only one time. How can we deal with this case?
>>
>> Look at pl01x or Samsung UART drivers which have done this conversion.
>
> Samsung UART does use module_platform_driver, but pl010/pl011 doesn't.

That is because pl011 is an amba_device rather than a platform_device
and there is not an equivalent macro for this boilerplate (although I
think one has been posted recently).

> In the Samsung UART driver, uart_unregister_driver is processed in
> remove(), like below:
>
> static int s3c24xx_serial_remove(struct platform_device *dev)
> {
>     struct uart_port *port = s3c24xx_dev_to_port(&dev->dev);
>
>     if (port) {
>         s3c24xx_serial_cpufreq_deregister(to_ourport(port));
>         uart_remove_one_port(&s3c24xx_uart_drv, port);
>     }
>
>     uart_unregister_driver(&s3c24xx_uart_drv);
> }
>
> if this serial has more than one ports, uart_unregister_driver() must
> be called multiple times when the device need to be removed.
> I think there may be a problem because that uart_unregister_driver()
> will do kfree(drv->state) every time when it's called.

Yes. Looks like a bug still in the Samsung driver. So follow what the
pl01x driver does. It had the same problem, but there is a follow-up
commit to fix it.

Rob

^ permalink raw reply

* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
From: Alexei Starovoitov @ 2015-01-20 20:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, yrl.pp-manager.tt@hitachi.com,
	Jovi Zhangwei

On Tue, Jan 20, 2015 at 3:57 AM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
>
> Ok, BTW, would you think is it possible to use a reusable small scratchpad
> memory for passing arguments? (just a thought)

sure. doable, but what's the use case?

>> It's not usable for high frequency events which
>> need this in-kernel aggregation.
>> If events are rare, then just dumping everything
>> into trace buffer is just fine. No in-kernel program is needed.
>
> Hmm, let me ensure your point, the performance number is the reason why
> we need to do it in the kernel, right? Not mainly for the flexibility but speed.

if user space can do X at the same speed as kernel,
then user space is a better choice and more flexible.
In case of bpf programs two things user space cannot do:
- fast aggregation without adding penalty to things being traced
- access to in-kernel data structures
And often both used together.
Say, we want to monitor amount of network traffic per user.
So we'd use trace_net_dev_xmit() tracepoint and do
map[current_uid()] += skb_len
as part of the program.
Overhead will be tiny and users won't notice any slowdown.
Trying to do the same in user space by enabling
this tracepoint has two problems:
high overhead and events are hard to aggregate
per user, since trace has 'pid', but short lived
processes will have dead pids in trace output.

> - perf probe and kprobe-event gives us a complete understandable
>  interface for what will be recorded at where.
>  (we can see the event definitions via kprobe_events interface,
>   without any tools)
> - kprobe-event gives a completely same interface as other tracepoint
>   events.
> - it also doesn't require any build-binary parts :) nor special tools.
>   We can play with ftrace on just a small busybox.

yeah, when debugging in busybox is the goal
and 'cat' and 'echo' are your only tools, then
debugfs interface is the only choice :)

> However, this does NOT interfere your patch upstreaming. I just said current
> ftrace method is also meaningful for some reasons :)

of course :)
To emphasize the point I was trying to make with tracex1:
The program is a filter/aggregator. The bpf maps
are not suitable for streaming the events. That's the job
of ring buffer/trace_pipe. The program may choose
to aggregate some events and discard them (by
returning 0 from the program), and the rest of
the events will be streamed to user space via
ring buffer in the format statically defined by tracepoint
or by kprobe arguments.
The tracex1 example loads the program and then
reads /sys/kernel/debugfs/tracing/trace_pipe...

That part I was trying to improve with bpf_trace_printk:
to give ability to programs to stream data in a format
different from the one statically defined by tracepoints.
But trace_printk has its disadvantages, so probably
something cleaner is needed.
Like in my earlier example of trace_net_dev_xmit,
if the program could add printing of uid to arguments
already printed, it would have helped user space.

> By the way, I concern about that bpf compiler can become another systemtap,
> especially if you build it on llvm.
> Would you plan to develop it on kernel
> tree? or apart from the kernel-side development?

I'm not sure I completely understand the concern.
perf is using a bunch of out-of-tree libraries.
mcjit of llvm or libgccjit are another libraries.
Or may be eventually eBPF can be generated
by something like libpcap.

Ideally I would like to see 'perf run script.txt'
where script.txt is a program in a language suited
for tracing. The tracing language not necessary
will fit networking use cases. Currently I'm
using C for both and it's the most convenient,
but some folks complained that 'restricted'
nature of this C is hard to grasp, so I can only
encourage Jovi to do ktap language to bpf
translator. If it generates bpf directly that's great,
if it uses gcc or llvm backend that's fine too.

> I think it is hard to sync the development if you do it out-of-tree.

I think some pieces would have to be out of tree.
I've kept standalone llvm backend across 3.2, 3.3 and 3.4
but it gets polluted with ifdefs and not really a long term
solution, so now I'm working on upstreaming it
and feedback/codereviews I got, definitely improved
the quality of the bpf backend.
In case of backends the only bit to sync is instruction
set itself, which is stable. New instructions may be
added, but that's not a concern.
llvm backend doesn't care what language is
used in front-end or how programs are attached
to tracepoints or what set of bpf helper
functions is available.
All such bits and the main interface for
dynamic tracer, imo, should be in perf binary.
What it does underneath and how
many times it calls into llvm/gcc lib, won't be visible.
In case of systemtap compile time, for whatever
reason, is slow to the point of being annoying,
but here it should be instant.

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Johannes Stezenbach @ 2015-01-20 22:00 UTC (permalink / raw)
  To: David Herrmann
  Cc: Djalal Harouni, Josh Boyer, Michael Kerrisk (man-pages),
	Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack
In-Reply-To: <CANq1E4Q9m7JRXNYFY4okh3brM-kHVoVGOdF_SWjNoNURX930HQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi David,

On Tue, Jan 20, 2015 at 06:00:28PM +0100, David Herrmann wrote:
[big snip]
> These are just examples off the top of my head, but I think they're
> already pretty convincing.

Thank you for writing this up.  This is the information I was
looking for which puts kdbus into context and explains
the motivation for its development.  Naturally I don't agree
with all of it, but I'm content with what I learned so far.

Daniel informed me off-list that he (and probably others) does
not understand what my questions were aiming at.  I'm sorry
about that, I thought it was clear I was just lacking
the background information to understand what kdbus is and
what it is not, and why it exists -- information I couldn't find
in some hours of googling.


Thanks,
Johannes

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Andy Lutomirski @ 2015-01-20 22:40 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra <pe>
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
>
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
>
> SYNOPSIS
>
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
>
> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.
>
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
>
>            struct epoll_mod_cmd {
>
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
>
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
>
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
>
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
>
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
>
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };

I would add an extra u32 at the end so that the structure size will be
a multiple of 8 bytes on all platforms.

>
>        There is no guartantee that all the commands are executed in order. Only
>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

If this doesn't happen, what error is returned?

>            struct epoll_wait_spec {
>
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
>
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
>
>                   /* Which clock to use for timeout */
>                   int clockid;
>
>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;
>
>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;
>
>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

I think the convention is to align the structure's fields manually
rather than declaring it to be packed.

>
> RETURN VALUE
>
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.

Does this mean that callers should initialize the error fields to an
impossible value first so they can tell which commands were executed?

>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.
>
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.
>
> ERRORS
>
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.

Please clarify which errors are returned overall and which are per-command.

Thanks,
Andy

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-20 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers
In-Reply-To: <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jan 20, 2015 at 02:40:32PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

*shrug*, but if you do so, enforce that it has a value of 0 or return
-EINVAL, just like a flags field.  Alternatively, move the last field
earlier and make flags a uint64_t.

- Josh Triplett

^ permalink raw reply

* [PATCH v11 0/2] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2015-01-21  1:18 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api, Neil Horman

Hi,

This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.

Both, AEAD and RNG implementations are stand-alone and do not depend
other AF_ALG interfaces (like hash or skcipher).

The AEAD implementation uses the same approach as provided with
skcipher by offering the following interfaces:

	* sendmsg and recvmsg interfaces allowing multiple
	  invocations supporting a threaded user space. To support
	  multi-threaded user space, kernel-side buffering
	  is implemented similarly to skcipher.

	* splice / vmsplice interfaces allowing a zero-copy
	  invocation

The RNG interface only implements the recvmsg interface as
zero-copy is not applicable.

The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces. The testing covers:

	* use of the sendmsg/recvmsg interface

	* use of the splice / vmsplice interface

	* invocation of all AF_ALG types (aead, rng, skcipher, hash)

	* using all types of operation (encryption, decryption, keyed MD,
	  MD, random numbers, AEAD decryption with positive and negative
	  authentication verification)

	* stress testing by running all tests for 30 minutes in an
	  endless loop

	* test execution on 64 bit and 32 bit

[1] http://www.chronox.de/libkcapi.html

Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
  <dborkman@redhat.com>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
  definitions

Changes v3:
* remove getsockopt interface
* AEAD: associated data is set prepended to the plain/ciphertext
* AEAD: allowing arbitrary associated data lengths
* remove setkey patch as protection was already in the existing code

Changes v4:
* stand-alone implementation of AEAD
* testing of all interfaces offered by AEAD
* stress testing of AEAD and RNG

Changes v5:
* AEAD: add outer while(size) loop in aead_sendmsg to ensure all data is
  copied into the kernel (reporter Herbert Xu)
* AEAD: aead_sendmsg bug fix: change size -= len; to size -= plen;
* AF_ALG / AEAD: add aead_setauthsize and associated extension to
  struct af_alg_type as well as alg_setsockopt (reporter Herbert Xu)
* RNG: rng_recvmsg: use 128 byte stack variable for output of RNG instead
  of ctx->result (reporter Herbert Xu)
* RNG / AF_ALG: allow user space to seed RNG via setsockopt
* RNG: rng_recvmsg bug fix: use genlen as result variable for
  crypto_rng_get_bytes as previously no negative errors were obtained
* AF_ALG: alg_setop: zeroize buffer before free

Changes v6:
* AEAD/RNG: port to 3.19-rc1 with the iov_iter handling
* RNG: use the setkey interface to obtain the seed and drop the patch adding
  a separate reseeding interface
* extract the zeroization patch for alg_setkey into a stand-alone patch
  submission
* fix bug in aead_sufficient_data (reporter Herbert Xu)
* testing of all interfaces with test application provided with libkcapi version
  0.6.2

Changes v7:
* AEAD: aead_recvmsg: change error code from ENOMEM to EINVAL
* AEAD: drop aead_readable/aead_sufficient_data and only use ctx->more to decide
  whether the read side shall become active. This change requires that the
  patch for crypto_aead_decrypt ensuring that the ciphertext contains the
  authentication tag was added -- see https://lkml.org/lkml/2014/12/30/200.
  Otherwise, user space can trigger a kernel crash.
* RNG: patch dropped as it was applied
* AEAD: port Kconfig/Makefile patch forward to current code base

Changes v8:
* Removed check for aead_assoclen in aead_sendmsg
* Fix endless loop bug in aead_sendmsg (check for sgl->cur > ALG_MAX_PAGES in
  while condition removed -- this condition is checked within the loop already)
* Resurrect aead_sufficient_data and call it in aead_sendmsg, aead_sendpage to
  notify caller about wrong invocation
* Re-add aead_sufficient_data to aead_recvmsg to verify user input data before
  using them to ensure the kernel protects against malicious parameters
* Allow arbitrary size of AD (i.e. up to the maximum buffer size of
  ALG_MAX_PAGES)
* When aead_recvmsg receives an error from decryption, release all pages if the
  error is EBADMSG -- this error implies that a proper decryption was performed
  but the integrity of the message is lost. This error is considered to be a
  valid decryption result.
* Add test cases for sendmsg and splice interface to test large AD sizes (in
  case of sendmsg, use 65504 bytes AD and 32 bytes plaintext; in case of splice
  use 15 pages AD and 32 bytes in the 16th page for plaintext). See [1] for
  updated test case.

Changes v9:
* if socket is not writable during sendmsg/sendpage due to insufficient memory
  and a recvmsg operation is forced, inform userspace about truncated operation
  via MSG_TRUNC
* use -EMSGSIZE in case insufficient data was provided in sendmsg/sendpage
* release all buffers in case insufficient data was provided in sendmsg/sendpage
* bug fix in sendmsg: when a new page is allocated, reset sg->offset to 0 --
  the error is visible with the new tests in [1] when using the -d flag
  with the test application

Changes v10:
* initialize ctx->trunc in aead_accept_parent to zero
* fix one line with code formatting problems

Changes v11:
* return an error if user space sends too much data instead of waiting until
  reader side catches up with operation (suggested by Herbert Xu)
* remove now unneeded aead_wait_for_wmem service function
* remove now unneeded ctx->trunc and MSG_TRUNC error return

Stephan Mueller (2):
  crypto: AF_ALG: add AEAD support
  crypto: AF_ALG: enable AEAD interface compilation

 crypto/Kconfig      |   9 +
 crypto/Makefile     |   1 +
 crypto/algif_aead.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 648 insertions(+)
 create mode 100644 crypto/algif_aead.c

-- 
2.1.0

^ permalink raw reply

* [PATCH v11 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-01-21  1:19 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Neil Horman
In-Reply-To: <1923793.K38mGRD6eo-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

This patch adds the AEAD support for AF_ALG.

The implementation is based on algif_skcipher, but contains heavy
modifications to streamline the interface for AEAD uses.

To use AEAD, the user space consumer has to use the salg_type named
"aead".

The AEAD implementation includes some overhead to calculate the size of
the ciphertext, because the AEAD implementation of the kernel crypto API
makes implied assumption on the location of the authentication tag. When
performing an encryption, the tag will be added to the created
ciphertext (note, the tag is placed adjacent to the ciphertext). For
decryption, the caller must hand in the ciphertext with the tag appended
to the ciphertext. Therefore, the selection of the used memory
needs to add/subtract the tag size from the source/destination buffers
depending on the encryption type. The code is provided with comments
explaining when and how that operation is performed.

A fully working example using all aspects of AEAD is provided at
http://www.chronox.de/libkcapi.html

Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
---
 crypto/algif_aead.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 638 insertions(+)
 create mode 100644 crypto/algif_aead.c

diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
new file mode 100644
index 0000000..5a667fb
--- /dev/null
+++ b/crypto/algif_aead.c
@@ -0,0 +1,638 @@
+/*
+ * algif_aead: User-space interface for AEAD algorithms
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This file is derived from algif_skcipher.c.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <crypto/scatterwalk.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct aead_sg_list {
+	unsigned int cur;
+	struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+struct aead_ctx {
+	struct aead_sg_list tsgl;
+	struct af_alg_sgl rsgl;
+
+	void *iv;
+
+	struct af_alg_completion completion;
+
+	unsigned long used;
+
+	unsigned int len;
+	bool more;
+	bool merge;
+	bool enc;
+
+	size_t aead_assoclen;
+	struct aead_request aead_req;
+};
+
+static inline int aead_sndbuf(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+
+	return max_t(int, max_t(int, sk->sk_sndbuf & PAGE_MASK, PAGE_SIZE) -
+			  ctx->used, 0);
+}
+
+static inline bool aead_writable(struct sock *sk)
+{
+	return PAGE_SIZE <= aead_sndbuf(sk);
+}
+
+static inline bool aead_sufficient_data(struct aead_ctx *ctx)
+{
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+
+	return (ctx->used >= (ctx->aead_assoclen + (ctx->enc ? 0 : as)));
+}
+
+static void aead_put_sgl(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = sgl->sg;
+	unsigned int i;
+
+	for (i = 0; i < sgl->cur; i++) {
+		if (!sg_page(sg + i))
+			continue;
+
+		put_page(sg_page(sg + i));
+		sg_assign_page(sg + i, NULL);
+	}
+	sgl->cur = 0;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+}
+
+static void aead_wmem_wakeup(struct sock *sk)
+{
+	struct socket_wq *wq;
+
+	if (!aead_writable(sk))
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	rcu_read_unlock();
+}
+
+static int aead_wait_for_data(struct sock *sk, unsigned flags)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	long timeout;
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT) {
+		return -EAGAIN;
+	}
+
+	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		timeout = MAX_SCHEDULE_TIMEOUT;
+		if (sk_wait_event(sk, &timeout, !ctx->more)) {
+			err = 0;
+			break;
+		}
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	return err;
+}
+
+static void aead_data_wakeup(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct socket_wq *wq;
+
+	if (ctx->more)
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	rcu_read_unlock();
+}
+
+static int aead_sendmsg(struct kiocb *unused, struct socket *sock,
+		        struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned ivsize =
+		crypto_aead_ivsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct af_alg_control con = {};
+	long copied = 0;
+	bool enc = 0;
+	bool init = 0;
+	int err = -EINVAL;
+
+	if (msg->msg_controllen) {
+		err = af_alg_cmsg_send(msg, &con);
+		if (err)
+			return err;
+
+		init = 1;
+		switch (con.op) {
+		case ALG_OP_ENCRYPT:
+			enc = 1;
+			break;
+		case ALG_OP_DECRYPT:
+			enc = 0;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (con.iv && con.iv->ivlen != ivsize)
+			return -EINVAL;
+	}
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (init) {
+		ctx->enc = enc;
+		if (con.iv)
+			memcpy(ctx->iv, con.iv->iv, ivsize);
+
+		ctx->aead_assoclen = con.aead_assoclen;
+	}
+
+	while (size) {
+		unsigned long len = size;
+		struct scatterlist *sg = NULL;
+
+		/* use the existing memory in an allocated page */
+		if (ctx->merge) {
+			sg = sgl->sg + sgl->cur - 1;
+			len = min_t(unsigned long, len,
+				    PAGE_SIZE - sg->offset - sg->length);
+			err = memcpy_from_msg(page_address(sg_page(sg)) +
+					      sg->offset + sg->length,
+					      msg, len);
+			if (err)
+				goto unlock;
+
+			sg->length += len;
+			ctx->merge = (sg->offset + sg->length) &
+				     (PAGE_SIZE - 1);
+
+			ctx->used += len;
+			copied += len;
+			size -= len;
+		}
+
+		if (!aead_writable(sk)) {
+			/* user space sent too much data */
+			aead_put_sgl(sk);
+			err = -EMSGSIZE;
+			goto unlock;
+		}
+
+		/* allocate a new page */
+		len = min_t(unsigned long, size, aead_sndbuf(sk));
+		while (len) {
+			int plen = 0;
+
+			if (sgl->cur >= ALG_MAX_PAGES) {
+				err = -E2BIG;
+				goto unlock;
+			}
+
+			sg = sgl->sg + sgl->cur;
+			plen = min_t(int, len, PAGE_SIZE);
+
+			sg_assign_page(sg, alloc_page(GFP_KERNEL));
+			err = -ENOMEM;
+			if (!sg_page(sg))
+				goto unlock;
+
+			err = memcpy_from_msg(page_address(sg_page(sg)),
+					      msg, plen);
+			if (err) {
+				__free_page(sg_page(sg));
+				sg_assign_page(sg, NULL);
+				goto unlock;
+			}
+
+			sg->offset = 0;
+			sg->length = plen;
+			len -= plen;
+			ctx->used += plen;
+			copied += plen;
+			sgl->cur++;
+			size -= plen;
+			ctx->merge = plen & (PAGE_SIZE - 1);
+		}
+	}
+
+	err = 0;
+
+	ctx->more = msg->msg_flags & MSG_MORE;
+	if (!ctx->more && !aead_sufficient_data(ctx)) {
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+	}
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: copied;
+}
+
+static ssize_t aead_sendpage(struct socket *sock, struct page *page,
+			     int offset, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	int err = -EINVAL;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	if (sgl->cur >= ALG_MAX_PAGES)
+		return -E2BIG;
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (!size)
+		goto done;
+
+	if (!aead_writable(sk)) {
+		/* user space sent too much data */
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+		goto unlock;
+	}
+
+	ctx->merge = 0;
+
+	get_page(page);
+	sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+	sgl->cur++;
+	ctx->used += size;
+
+	err = 0;
+
+done:
+	ctx->more = flags & MSG_MORE;
+	if (!ctx->more && !aead_sufficient_data(ctx)) {
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+	}
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: size;
+}
+
+static int aead_recvmsg(struct kiocb *unused, struct socket *sock,
+			struct msghdr *msg, size_t ignored, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned bs = crypto_aead_blocksize(crypto_aead_reqtfm(&ctx->aead_req));
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = sgl->sg;
+	struct scatterlist assoc[ALG_MAX_PAGES];
+	size_t assoclen = 0;
+	unsigned int i = 0;
+	int err = -EINVAL;
+	unsigned long used = 0;
+	unsigned long outlen = 0;
+
+	/*
+	 * Require exactly one IOV block as the AEAD operation is a one shot
+	 * due to the authentication tag.
+	 */
+	if (msg->msg_iter.nr_segs != 1)
+		return -ENOMSG;
+
+	lock_sock(sk);
+	/*
+	* AEAD memory structure: For encryption, the tag is appended to the
+	* ciphertext which implies that the memory allocated for the ciphertext
+	* must be increased by the tag length. For decryption, the tag
+	* is expected to be concatenated to the ciphertext. The plaintext
+	* therefore has a memory size of the ciphertext minus the tag length.
+	*
+	* The memory structure for cipher operation has the following
+	* structure:
+	*	AEAD encryption input:  assoc data || plaintext
+	*	AEAD encryption output: cipherntext || auth tag
+	*	AEAD decryption input:  assoc data || ciphertext || auth tag
+	*	AEAD decryption output: plaintext
+	*/
+
+	if (ctx->more) {
+		err = aead_wait_for_data(sk, flags);
+		if (err)
+			goto unlock;
+	}
+
+	used = ctx->used;
+
+	/*
+	 * Make sure sufficient data is present -- note, the same check is
+	 * is also present in sendmsg/sendpage. The checks in sendpage/sendmsg
+	 * shall provide an information to the data sender that something is
+	 * wrong, but they are irrelevant to maintain the kernel integrity.
+	 * We need this check here too in case user space decides to not honor
+	 * the error message in sendmsg/sendpage and still call recvmsg. This
+	 * check here protects the kernel integrity.
+	 */
+	if (!aead_sufficient_data(ctx))
+		goto unlock;
+
+	/*
+	 * The cipher operation input data is reduced by the associated data
+	 * length as this data is processed separately later on.
+	 */
+	used -= ctx->aead_assoclen;
+
+	if (ctx->enc) {
+		/* round up output buffer to multiple of block size */
+		outlen = ((used + bs - 1) / bs * bs);
+		/* add the size needed for the auth tag to be created */
+		outlen += as;
+	} else {
+		/* output data size is input without the authentication tag */
+		outlen = used - as;
+		/* round up output buffer to multiple of block size */
+		outlen = ((outlen + bs - 1) / bs * bs);
+	}
+
+	/* ensure output buffer is sufficiently large */
+	if (msg->msg_iter.iov->iov_len < outlen)
+		goto unlock;
+
+	outlen = af_alg_make_sg(&ctx->rsgl, msg->msg_iter.iov->iov_base,
+				outlen, 1);
+	err = outlen;
+	if (err < 0)
+		goto unlock;
+
+	err = -EINVAL;
+	sg_init_table(assoc, ALG_MAX_PAGES);
+	assoclen = ctx->aead_assoclen;
+	/*
+	 * Split scatterlist into two: first part becomes AD, second part
+	 * is plaintext / ciphertext. The first part is assigned to assoc
+	 * scatterlist. When this loop finishes, sg points to the start of the
+	 * plaintext / ciphertext.
+	 */
+	for (i = 0; i < ctx->tsgl.cur; i++) {
+		sg = sgl->sg + i;
+		if (sg->length <= assoclen) {
+			/* AD is larger than one page */
+			sg_set_page(assoc + i, sg_page(sg),
+				    sg->length, sg->offset);
+			assoclen -= sg->length;
+			if (i >= ctx->tsgl.cur)
+				goto unlock;
+		} else if (!assoclen) {
+			/* current page is to start of plaintext / ciphertext */
+			if (i)
+				/* AD terminates at page boundary */
+				sg_mark_end(assoc + i - 1);
+			else
+				/* AD size is zero */
+				sg_mark_end(assoc);
+			break;
+		} else {
+			/* AD does not terminate at page boundary */
+			sg_set_page(assoc + i, sg_page(sg),
+				    assoclen, sg->offset);
+			sg_mark_end(assoc + i);
+			/* plaintext / ciphertext starts after AD */
+			sg->length -= assoclen;
+			sg->offset += assoclen;
+			break;
+		}
+	}
+
+	aead_request_set_assoc(&ctx->aead_req, assoc, ctx->aead_assoclen);
+	aead_request_set_crypt(&ctx->aead_req, sg, ctx->rsgl.sg, used, ctx->iv);
+
+	err = af_alg_wait_for_completion(ctx->enc ?
+					 crypto_aead_encrypt(&ctx->aead_req) :
+					 crypto_aead_decrypt(&ctx->aead_req),
+					 &ctx->completion);
+
+	af_alg_free_sg(&ctx->rsgl);
+
+	if (err) {
+		/* EBADMSG implies a valid cipher operation took place */
+		if (err == -EBADMSG)
+			aead_put_sgl(sk);
+		goto unlock;
+	}
+
+	aead_put_sgl(sk);
+
+	err = 0;
+
+unlock:
+	aead_wmem_wakeup(sk);
+	release_sock(sk);
+
+	return err ? err : outlen;
+}
+
+static unsigned int aead_poll(struct file *file, struct socket *sock,
+			      poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int mask;
+
+	sock_poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (!ctx->more)
+		mask |= POLLIN | POLLRDNORM;
+
+	if (aead_writable(sk))
+		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+	return mask;
+}
+
+static struct proto_ops algif_aead_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.accept		=	sock_no_accept,
+	.setsockopt	=	sock_no_setsockopt,
+
+	.release	=	af_alg_release,
+	.sendmsg	=	aead_sendmsg,
+	.sendpage	=	aead_sendpage,
+	.recvmsg	=	aead_recvmsg,
+	.poll		=	aead_poll,
+};
+
+static void *aead_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_aead(name, type, mask);
+}
+
+static void aead_release(void *private)
+{
+	crypto_free_aead(private);
+}
+
+static int aead_setauthsize(void *private, unsigned int authsize)
+{
+	return crypto_aead_setauthsize(private, authsize);
+}
+
+static int aead_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+	return crypto_aead_setkey(private, key, keylen);
+}
+
+static void aead_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int ivlen = crypto_aead_ivsize(
+				crypto_aead_reqtfm(&ctx->aead_req));
+
+	aead_put_sgl(sk);
+	sock_kzfree_s(sk, ctx->iv, ivlen);
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int aead_accept_parent(void *private, struct sock *sk)
+{
+	struct aead_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned int len = sizeof(*ctx) + crypto_aead_reqsize(private);
+	unsigned int ivlen = crypto_aead_ivsize(private);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	memset(ctx, 0, len);
+
+	ctx->iv = sock_kmalloc(sk, ivlen, GFP_KERNEL);
+	if (!ctx->iv) {
+		sock_kfree_s(sk, ctx, len);
+		return -ENOMEM;
+	}
+	memset(ctx->iv, 0, ivlen);
+
+	ctx->len = len;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+	ctx->enc = 0;
+	ctx->tsgl.cur = 0;
+	ctx->aead_assoclen = 0;
+	af_alg_init_completion(&ctx->completion);
+	sg_init_table(ctx->tsgl.sg, ALG_MAX_PAGES);
+
+	ask->private = ctx;
+
+	aead_request_set_tfm(&ctx->aead_req, private);
+	aead_request_set_callback(&ctx->aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				  af_alg_complete, &ctx->completion);
+
+	sk->sk_destruct = aead_sock_destruct;
+
+	return 0;
+}
+
+static const struct af_alg_type algif_type_aead = {
+	.bind		=	aead_bind,
+	.release	=	aead_release,
+	.setkey		=	aead_setkey,
+	.setauthsize	=	aead_setauthsize,
+	.accept		=	aead_accept_parent,
+	.ops		=	&algif_aead_ops,
+	.name		=	"aead",
+	.owner		=	THIS_MODULE
+};
+
+static int __init algif_aead_init(void)
+{
+	return af_alg_register_type(&algif_type_aead);
+}
+
+static void __exit algif_aead_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_aead);
+	BUG_ON(err);
+}
+
+module_init(algif_aead_init);
+module_exit(algif_aead_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>");
+MODULE_DESCRIPTION("AEAD kernel crypto API user space interface");
-- 
2.1.0

^ permalink raw reply related

* [PATCH v11 2/2] crypto: AF_ALG: enable AEAD interface compilation
From: Stephan Mueller @ 2015-01-21  1:19 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api, Neil Horman
In-Reply-To: <1923793.K38mGRD6eo@tachyon.chronox.de>

Enable compilation of the AEAD AF_ALG support and provide a Kconfig
option to compile the AEAD AF_ALG support.

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/Kconfig  | 9 +++++++++
 crypto/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 50f4da4..41a3fc5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1523,6 +1523,15 @@ config CRYPTO_USER_API_RNG
 	  This option enables the user-spaces interface for random
 	  number generator algorithms.
 
+config CRYPTO_USER_API_AEAD
+	tristate "User-space interface for AEAD cipher algorithms"
+	depends on NET
+	select CRYPTO_AEAD
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for AEAD
+	  cipher algorithms.
+
 config CRYPTO_HASH_INFO
 	bool
 
diff --git a/crypto/Makefile b/crypto/Makefile
index ba19465..97b7d3a 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
+obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
 
 #
 # generic algorithms and the async_tx api
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-21  4:59 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra <peter>
In-Reply-To: <54BE4F1D.7090807@gmail.com>

On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. 
> 
> Which applications? Could we have some specific examples? This is a 
> complex API, and it needs good justification.

OK, I'll explain more in v2.

> 
> > For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
>          ^^^^^^^^^^^^^^ should be epoll_mod_wait
> 
> I think you mean to say:
> 
>     The ability to batch multiple "epoll_ctl" operations into a single call
>     means that even when no wait events are requested (i.e., spec == NULL),
>     poll_mod_wait() provides a performance optimization over using multiple
>     epoll_ctl() calls.
> 
> Right? If yes, please amend the commit message, and this text should
> also make its way into the revised man page under a heading "NOTES".

OK.

> 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> 
> The "cmds' are not executed in a specified order plus the need to
> initialize the 'errors' fields to a positive value feels a bit ugly.
> And indeed the whole "command list was only partially run" case
> is not pretty. Am I correct to understand that if an error is found
> during execution of one of the "epoll_ctl" commands in 'cmds' then
> the system call will return -1 with errno set, indicating an error,
> even though the epoll interest list may have changed because some
> of the earlier 'cmds' executed successfully? This all seems a bit of
> a headache for user space.

This is the trade-off for batching. The best we can do is probably make this
transactional: none or all of the commands succeeds. It will require a much
more complex implementation, though. But even with that, the error reporting on
which command failed is a complication.

> 
> I have a couple of questions:
> 
> Q1. I can see that batching "epoll_ctl" commands might be useful,
> since it results in fewer systems calls. But, does it really
> need to be bound together with the "epoll_pwait" functionality?
> (Perhaps this point was covered in previous discussions, but
> neither the message accompanying this patch nor the 0/6 man page
> provide a compelling rationale for the need to bind these two
> operations together.)
> 
> Yes, I realize you might save a system call, but it makes for a
> cumbersome API that has the above headache, and also forces the 
> need for double pointer indirection in the 'spec' argument (i.e., 
> spec is a pointer to an array of structures where each element
> in turn includes an 'events' pointer that points to another array).
> 
> Why not a simpler API with two syscalls such as:
> 
> epoll_ctl_batch(int epfd, int flags,
>                 int ncmds, struct epoll_mod_cmd *cmds);
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, 
>              const sigset_t *sigmask, size_t sigsetsize);

The problem is that there is no room for flags field in epoll_pwait1, which is
asked for, in previous discussion thread [1].

I don't see epoll_mod_wait as a *significantly more* complicated interface
compared to epoll_ctl_batch and epoll_pwait1 above. In epoll_mod_wait, if you
leave out ncmds and cmds, it is effectively a poll without batch; and if
leaving out spec, it is effectively a batch without poll.

The most important change here is the timeout. IMO I wouldn't mind leaving out
batching. Integrating it is requested by Andy:

[1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591

which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
number of times right before epoll_wait.

[Sorry for not putting anything into cover letter changelog, but it is also
interesting to see people's reaction on the patch itself without bias of
others' opinions. This indeed brings in more points. :]

> 
> This gives us much of the benefit of reducing system calls, but 
> with greater simplicity. And epoll_ctl_batch() could simply return
> the number of 'cmds' that were successfully executed.)
> 
> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> guaranteed to be executed in order. Why not? If you did provide
> such a guarantee, then, when using your current epoll_mod_wait(),
> user space could do the following:

I guess we can make a guarentee on that.

> 
> 1. Initialize the cmd.errors fields to zero.
> 2. Call epoll_ctl_mod()
> 3. Iterate through cmd.errors looking for the first nonzero 
>    field.

It's close, but zero is not good enough, if copy_from_user of cmds failed in
the first place. Impossible value, or error value, will be safer.

> 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. 
> 
> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> though pselect() already had demonstrated the need for higher precision.
> I should have called it out way back then :-{.
> 
> > The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> 
> To make the 'ret' change a little more obvious, maybe it's better to write
> 
> 			ret = kcmds[i].error = -EINVAL;
> 
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> Likewise:
> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> 
> Cosmetic point: s/if(/if (/
> 
> > +			return -EFAULT;
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> 
> If I understand correctly, the implementation means that the
> 'size_t sigsetsize' field will probably need to be exposed to 
> applications. In the existing epoll_pwait() call (as in  ppoll()
> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> However, unless we expect glibc to do some structure copying to/from
> a structure that hides this field, then we're going end up exposing
> 'size_t sigsetsize' to applications. (This could be avoided, if we
> split the API as I suggest above. glibc would do the same thing 
> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Michael Kerrisk (man-pages) @ 2015-01-21  5:55 UTC (permalink / raw)
  To: Andy Lutomirski, Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers
In-Reply-To: <CALCETrU4TeG1ShVLkQgqQ6usFm8pg_t0D8K=Mi_UJGSfxUwXtA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 01/20/2015 11:40 PM, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> This adds a new system call, epoll_mod_wait. It's described as below:

[...]

>>        There is no guartantee that all the commands are executed in order. Only
>>        if all the commands are successfully executed (all the error fields are
>>        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

If I read the code correctly: the error of the first epoll_ctl op that fails.

[...]

>> RETURN VALUE
>>
>>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>>        appropriately. All the "error" fields in cmds are unchanged before they
>>        are executed, and if any cmds are executed, the "error" fields are set
>>        to a return code accordingly. See also epoll_ctl for more details of the
>>        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes. (Ugly!)

[...]

>> ERRORS
>>
>>        These errors apply on either the return value of epoll_mod_wait or error
>>        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

Yes, I think this would be valuable as well.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* [PATCH] selftests/exec: Check if the syscall exists and bail if not
From: Michael Ellerman @ 2015-01-21  7:41 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	drysdale-hpIqsD4AKlfQT0dZR+AlfA, shuahkh-JPH+aEBZ4P+UEJcrhfAQsw,
	davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL

On systems which don't implement sys_execveat(), this test produces a
lot of output.

Add a check at the beginning to see if the syscall is present, and if
not just note one error and return.

Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
---
 tools/testing/selftests/exec/execveat.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
index e238c9559caf..b87e4a843bea 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -234,6 +234,14 @@ static int run_tests(void)
 	int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
 	int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
 
+	/* Check if we have execveat at all, and bail early if not */
+	errno = 0;
+	execveat_(-1, NULL, NULL, NULL, 0);
+	if (errno == -ENOSYS) {
+		printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
+		return 1;
+	}
+
 	/* Change file position to confirm it doesn't affect anything */
 	lseek(fd, 10, SEEK_SET);
 
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Michael Kerrisk (man-pages) @ 2015-01-21  7:52 UTC (permalink / raw)
  To: Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers
In-Reply-To: <20150121045903.GA2858-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>

Hello Fam Zheng,

On 01/21/2015 05:59 AM, Fam Zheng wrote:
> On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
>> Hello Fam Zheng,
>>
>> On 01/20/2015 10:57 AM, Fam Zheng wrote:
>>> This syscall is a sequence of
>>>
>>> 1) a number of epoll_ctl calls
>>> 2) a epoll_pwait, with timeout enhancement.
>>>
>>> The epoll_ctl operations are embeded so that application doesn't have to use
>>> separate syscalls to insert/delete/update the fds before poll. It is more
>>> efficient if the set of fds varies from one poll to another, which is the
>>> common pattern for certain applications. 
>>
>> Which applications? Could we have some specific examples? This is a 
>> complex API, and it needs good justification.
> 
> OK, I'll explain more in v2.

Okay.

[...]

>>> The only complexity is returning the result of each operation.  For each
>>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
>>> the return code *iff* the command is executed (0 for success and -errno of the
>>> equivalent epoll_ctl call), and will be left unchanged if the command is not
>>> executed because some earlier error, for example due to failure of
>>> copy_from_user to copy the array.
>>>
>>> Applications can utilize this fact to do error handling: they could initialize
>>> all the epoll_mod_wait.error to a positive value, which is by definition not a
>>> possible output value from epoll_mod_wait. Then when the syscall returned, they
>>> know whether or not the command is executed by comparing each error with the
>>> init value, if they're different, they have the result of the command.
>>> More roughly, they can put any non-zero and not distinguish "not run" from
>>> failure.
>>
>> The "cmds' are not executed in a specified order plus the need to
>> initialize the 'errors' fields to a positive value feels a bit ugly.
>> And indeed the whole "command list was only partially run" case
>> is not pretty. Am I correct to understand that if an error is found
>> during execution of one of the "epoll_ctl" commands in 'cmds' then
>> the system call will return -1 with errno set, indicating an error,
>> even though the epoll interest list may have changed because some
>> of the earlier 'cmds' executed successfully? This all seems a bit of
>> a headache for user space.
> 
> This is the trade-off for batching. The best we can do is probably make this
> transactional: none or all of the commands succeeds. It will require a much
> more complex implementation, though. 

Transactional would be more comfortable for user-space, and while I 
can see that it would be complex to implement, perhaps the greater
point might be that the implementation is CPU expensive.

> But even with that, the error reporting on
> which command failed is a complication.

My suggestions below could make the error reporting much simpler...

>> I have a couple of questions:
>>
>> Q1. I can see that batching "epoll_ctl" commands might be useful,
>> since it results in fewer systems calls. But, does it really
>> need to be bound together with the "epoll_pwait" functionality?
>> (Perhaps this point was covered in previous discussions, but
>> neither the message accompanying this patch nor the 0/6 man page
>> provide a compelling rationale for the need to bind these two
>> operations together.)
>>
>> Yes, I realize you might save a system call, but it makes for a
>> cumbersome API that has the above headache, and also forces the 
>> need for double pointer indirection in the 'spec' argument (i.e., 
>> spec is a pointer to an array of structures where each element
>> in turn includes an 'events' pointer that points to another array).
>>
>> Why not a simpler API with two syscalls such as:
>>
>> epoll_ctl_batch(int epfd, int flags,
>>                 int ncmds, struct epoll_mod_cmd *cmds);
>>
>> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>>              struct timespec *timeout, int clock_id, 
>>              const sigset_t *sigmask, size_t sigsetsize);
> 
> The problem is that there is no room for flags field in epoll_pwait1, which is
> asked for, in previous discussion thread [1].

Ahh yes, I certainly should not have forgotten that. But that's easily solved.
Do as for pselect6():

strcut sigargs {
    const sigset_t *ss;
    size_t          ss_len; /* Size (in bytes) of object pointed
                               to by 'ss' */
}

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, int flags,
             int timeout,
             const struct sigargs *sargs);

> I don't see epoll_mod_wait as a *significantly more* complicated interface
> compared to epoll_ctl_batch and epoll_pwait1 above. 

My biggest problem with epoll_ctl_wait() is the complexity of error 
handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
note below.

Aside from that, I do think that epoll_ctl_wait() passes a certain
threshold of complexity that warrants good justification, which
I haven't really seen yet.

> In epoll_mod_wait, if you
> leave out ncmds and cmds, it is effectively a poll without batch; and if
> leaving out spec, it is effectively a batch without poll.
> 
> The most important change here is the timeout. IMO I wouldn't mind leaving out
> batching. Integrating it is requested by Andy:
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> 
> which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> number of times right before epoll_wait.
> 
> [Sorry for not putting anything into cover letter changelog, but it is also
> interesting to see people's reaction on the patch itself without bias of
> others' opinions. This indeed brings in more points. :]

But it also has the downside that the same discussions
may be repeated.

>> This gives us much of the benefit of reducing system calls, but 
>> with greater simplicity. And epoll_ctl_batch() could simply return
>> the number of 'cmds' that were successfully executed.)
>>
>> Q2. In the man page in 0/6 you said that the 'cmds' were not 
>> guaranteed to be executed in order. Why not? If you did provide
>> such a guarantee, then, when using your current epoll_mod_wait(),
>> user space could do the following:
> 
> I guess we can make a guarentee on that.

I'm puzzled by that response. Surely you *must* guarantee it.
If there's no defined order, then if the batch includes multiple 
commands that operate on the same FD, the result is undefined 
unless you provide that guarantee. (Unless, of course, you want to
explicitly specify that using the same FD multiple times in the
batch gives undefined behavior.)

>> 1. Initialize the cmd.errors fields to zero.
>> 2. Call epoll_ctl_mod()
>> 3. Iterate through cmd.errors looking for the first nonzero 
>>    field.
> 
> It's close, but zero is not good enough, if copy_from_user of cmds failed in
> the first place. Impossible value, or error value, will be safer.

See my comment in the earlier mail. If you split this into two 
APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
then the return value of epoll_ctl_batch() could be used to tell
user space how many commands succeeded. Much simpler!

>>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
>>> scalar. This provides higher precision. 
>>
>> Yes, that change seemed inevitable. It slightly puzzled me at the time when
>> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
>> though pselect() already had demonstrated the need for higher precision.
>> I should have called it out way back then :-{.
>>
>>> The parameter field in struct
>>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
>>> clock than the default when it makes more sense.
>>>
>>> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/syscalls.h |  5 ++++
>>>  2 files changed, 65 insertions(+)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index e7a116d..2cc22c9 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>>>  			      sigmask ? &ksigmask : NULL);
>>>  }
>>>  
>>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
>>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
>>> +		struct epoll_wait_spec __user *, spec)
>>> +{
>>> +	struct epoll_mod_cmd *kcmds = NULL;
>>> +	int i, ret = 0;
>>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
>>> +
>>> +	if (flags)
>>> +		return -EINVAL;
>>> +	if (ncmds) {
>>> +		if (!cmds)
>>> +			return -EINVAL;
>>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
>>> +		if (!kcmds)
>>> +			return -ENOMEM;
>>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
>>> +			ret = -EFAULT;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +	for (i = 0; i < ncmds; i++) {
>>> +		struct epoll_event ev = (struct epoll_event) {
>>> +			.events = kcmds[i].events,
>>> +			.data = kcmds[i].data,
>>> +		};
>>> +		if (kcmds[i].flags) {
>>> +			kcmds[i].error = ret = -EINVAL;
>>
>> To make the 'ret' change a little more obvious, maybe it's better to write
>>
>> 			ret = kcmds[i].error = -EINVAL;
>>
>>> +			goto out;
>>> +		}
>>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>> Likewise:
>> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +	if (spec) {
>>> +		sigset_t ksigmask;
>>> +		struct epoll_wait_spec kspec;
>>> +		ktime_t timeout;
>>> +
>>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
>>
>> Cosmetic point: s/if(/if (/
>>
>>> +			return -EFAULT;
>>> +		if (kspec.sigmask) {
>>> +			if (kspec.sigsetsize != sizeof(sigset_t))
>>> +				return -EINVAL;
>>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
>>> +				return -EFAULT;
>>> +		}
>>> +		timeout = timespec_to_ktime(kspec.timeout);
>>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
>>> +				     kspec.clockid, timeout,
>>> +				     kspec.sigmask ? &ksigmask : NULL);
>>
>> If I understand correctly, the implementation means that the
>> 'size_t sigsetsize' field will probably need to be exposed to 
>> applications. In the existing epoll_pwait() call (as in  ppoll()
>> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
>> However, unless we expect glibc to do some structure copying to/from
>> a structure that hides this field, then we're going end up exposing
>> 'size_t sigsetsize' to applications. (This could be avoided, if we
>> split the API as I suggest above. glibc would do the same thing 
>> in epoll_pwait1() that it currently does in epoll_pwait().)

You missed responding to this point; I think it matters.
(There were also some other points to consider in my reply 
to your 0/6 mail.)

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Omar Sandoval @ 2015-01-21  7:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra <peterz@
In-Reply-To: <1421747878-30744-6-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> 
> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.
> 
> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;
> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> +			return -EFAULT;
This should probably be goto out, or you'll leak kcmds.

> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
Same here...

> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
and here.

> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);
> +	}
> +
> +out:
> +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> +		return -EFAULT;
This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
to lead to a weird corner case: if cmds is read-only, we'll end up executing
every command but fail to copy out the return values, so when userspace gets the
EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
means you're probably doing something wrong anyways, so maybe not the biggest
concern.

> +	kfree(kcmds);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
>  		       struct epoll_event __user *, events,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..7156c80 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -12,6 +12,8 @@
>  #define _LINUX_SYSCALLS_H
>  
>  struct epoll_event;
> +struct epoll_mod_cmd;
> +struct epoll_wait_spec;
>  struct iattr;
>  struct inode;
>  struct iocb;
> @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
>  				int maxevents, int timeout,
>  				const sigset_t __user *sigmask,
>  				size_t sigsetsize);
> +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> +				   struct epoll_wait_spec __user * spec);
>  asmlinkage long sys_gethostname(char __user *name, int len);
>  asmlinkage long sys_sethostname(char __user *name, int len);
>  asmlinkage long sys_setdomainname(char __user *name, int len);
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Omar

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox