Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

Two structs involved in the coming syscall is defined. Flags in epoll_mod_cmd
are reserved, which makes better word alignment and may allow future extension.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/uapi/linux/eventpoll.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..e32a804 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,24 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_mod_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error;
+} EPOLL_PACKED;
+
+struct epoll_wait_spec {
+	int maxevents;
+	struct epoll_event *events;
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 4/6] epoll: Extract ep_ctl_do
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_mod_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6da143f..e7a116d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid, 
 				const ktime_t timeout)
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

This syscall is a sequence of

1) a number of epoll_ctl calls
2) a epoll_pwait, with timeout enhancement.

The epoll_ctl operations are embeded so that application doesn't have to use
separate syscalls to insert/delete/update the fds before poll. It is more
efficient if the set of fds varies from one poll to another, which is the
common pattern for certain applications. For example, depending on the input
buffer status, a data reading program may decide to temporarily not polling an
fd.

Because the enablement of batching in this interface, even that regular
epoll_ctl call sequence, which manipulates several fds, can be optimized to one
single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).

The only complexity is returning the result of each operation.  For each
epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
the return code *iff* the command is executed (0 for success and -errno of the
equivalent epoll_ctl call), and will be left unchanged if the command is not
executed because some earlier error, for example due to failure of
copy_from_user to copy the array.

Applications can utilize this fact to do error handling: they could initialize
all the epoll_mod_wait.error to a positive value, which is by definition not a
possible output value from epoll_mod_wait. Then when the syscall returned, they
know whether or not the command is executed by comparing each error with the
init value, if they're different, they have the result of the command.
More roughly, they can put any non-zero and not distinguish "not run" from
failure.

Also, timeout parameter is enhanced: timespec is used, compared to the old ms
scalar. This provides higher precision. The parameter field in struct
epoll_wait_spec, "clockid", also makes it possible for users to use a different
clock than the default when it makes more sense.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 ++++
 2 files changed, 65 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7a116d..2cc22c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
+		int, ncmds, struct epoll_mod_cmd __user *, cmds,
+		struct epoll_wait_spec __user *, spec)
+{
+	struct epoll_mod_cmd *kcmds = NULL;
+	int i, ret = 0;
+	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds) {
+		if (!cmds)
+			return -EINVAL;
+		kcmds = kmalloc(cmd_size, GFP_KERNEL);
+		if (!kcmds)
+			return -ENOMEM;
+		if (copy_from_user(kcmds, cmds, cmd_size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error = ret = -EINVAL;
+			goto out;
+		}
+		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
+		if (ret)
+			goto out;
+	}
+	if (spec) {
+		sigset_t ksigmask;
+		struct epoll_wait_spec kspec;
+		ktime_t timeout;
+
+		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
+			return -EFAULT;
+		if (kspec.sigmask) {
+			if (kspec.sigsetsize != sizeof(sigset_t))
+				return -EINVAL;
+			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
+				return -EFAULT;
+		}
+		timeout = timespec_to_ktime(kspec.timeout);
+		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
+				     kspec.clockid, timeout,
+				     kspec.sigmask ? &ksigmask : NULL);
+	}
+
+out:
+	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
+		return -EFAULT;
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..7156c80 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,8 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_mod_cmd;
+struct epoll_wait_spec;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
+				   int ncmds, struct epoll_mod_cmd __user * cmds,
+				   struct epoll_wait_spec __user * spec);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..52aead3 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_mod_wait		sys_epoll_mod_wait
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..c3c203a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_mod_wait		sys_epoll_mod_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3

^ permalink raw reply related

* [PATCH v5 0/5] power: Add max77693 charger driver
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski

Dear Sebastian,


You already acked some earlier version of this patch. In that time
there were more dependencies on MFD tree. Now the MFD patch (2/5)
is acked by Lee Jones so could you pick up everything?


Changes since v4
================
1. Add patch 5/5: Update MAINTAINERS (with Sebastian's ack [2]).

Changes since v3
================
1. Patch 3/4: Don't create platform data structure for charger settings
   because driver won't be used on non-DT SoCs (its for Exynos based
   Trats2).
2. Patch 3/4: Drop Sebastian's ack [1] because of change above.

Changes since v2
================
1. Add ack from Sebastian Reichel (charger driver).
2. Drop patch "mfd: max77693: Map charger device to its own of_node"
   (applied by Lee Jones).


Changes since v1
================
1. Add patch 2/5: mfd: max77693: Map charger device to its own of_node
   (forgot to send it in v1)
2. Remove patches for DTS and configs. I'll send them later.
3. Add ack from Lee Jones (patch 3/5).


Description
===========
The patchset adds max77693 charger driver present on Trats2 board
(and Galaxy S III). The driver configures battery charger and exposes
power supply interface.

Driver is necessary to provide full charging stack on Trats2 device
(extcon, charger-manager etc.).

[1] https://lkml.org/lkml/2014/10/27/774
[2] https://lkml.org/lkml/2015/1/16/464


Best regards,
Krzysztof


Krzysztof Kozlowski (5):
  devicetree: power/mfd: max77693: Document new bindings for charger
  mfd: max77693: Add defines for MAX77693 charger driver
  power: max77693: Add charger driver for Maxim 77693
  Documentation: power: max77693-charger: Document exported sysfs entry
  MAINTAINERS: Add entry for Maxim chargers on Samsung boards

 Documentation/ABI/testing/sysfs-class-power        |  42 ++
 Documentation/devicetree/bindings/mfd/max77693.txt |  45 ++
 MAINTAINERS                                        |   7 +
 drivers/power/Kconfig                              |   6 +
 drivers/power/Makefile                             |   1 +
 drivers/power/max77693_charger.c                   | 758 +++++++++++++++++++++
 include/linux/mfd/max77693-private.h               | 108 +++
 7 files changed, 967 insertions(+)
 create mode 100644 drivers/power/max77693_charger.c

-- 
1.9.1

^ permalink raw reply

* [PATCH v5 1/5] devicetree: power/mfd: max77693: Document new bindings for charger
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Document new device tree bindings for Maxim 77693 charger driver.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 Documentation/devicetree/bindings/mfd/max77693.txt | 45 ++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/Documentation/devicetree/bindings/mfd/max77693.txt b/Documentation/devicetree/bindings/mfd/max77693.txt
index 01e9f30fe678..38e64405e98d 100644
--- a/Documentation/devicetree/bindings/mfd/max77693.txt
+++ b/Documentation/devicetree/bindings/mfd/max77693.txt
@@ -41,6 +41,41 @@ Optional properties:
 	 To get more informations, please refer to documentaion.
 	[*] refer Documentation/devicetree/bindings/pwm/pwm.txt
 
+- charger : Node configuring the charger driver.
+  If present, required properties:
+  - compatible : Must be "maxim,max77693-charger".
+
+  Optional properties (if not set, defaults will be used):
+  - maxim,constant-microvolt : Battery constant voltage in uV. The charger
+    will operate in fast charge constant current mode till battery voltage
+    reaches this level. Then the charger will switch to fast charge constant
+    voltage mode. Also vsys (system voltage) will be set to this value when
+    DC power is supplied but charger is not enabled.
+    Valid values: 3650000 - 4400000, step by 25000 (rounded down)
+    Default: 4200000
+
+  - maxim,min-system-microvolt : Minimal system voltage in uV.
+    Valid values: 3000000 - 3700000, step by 100000 (rounded down)
+    Default: 3600000
+
+  - maxim,thermal-regulation-celsius : Temperature in Celsius for entering
+    high temperature charging mode. If die temperature exceeds this value
+    the charging current will be reduced by 105 mA/Celsius.
+    Valid values: 70, 85, 100, 115
+    Default: 100
+
+  - maxim,battery-overcurrent-microamp : Overcurrent protection threshold
+    in uA (current from battery to system).
+    Valid values: 2000000 - 3500000, step by 250000 (rounded down)
+    Default: 3500000
+
+  - maxim,charge-input-threshold-microvolt : Threshold voltage in uV for
+    triggering input voltage regulation loop. If input voltage decreases
+    below this value, the input current will be reduced to reach the
+    threshold voltage.
+    Valid values: 4300000, 4700000, 4800000, 4900000
+    Default: 4300000
+
 Example:
 	max77693@66 {
 		compatible = "maxim,max77693";
@@ -73,4 +108,14 @@ Example:
 			pwms = <&pwm 0 40000 0>;
 			pwm-names = "haptic";
 		};
+
+		charger {
+			compatible = "maxim,max77693-charger";
+
+			maxim,constant-microvolt = <4200000>;
+			maxim,min-system-microvolt = <3600000>;
+			maxim,thermal-regulation-celsius = <75>;
+			maxim,battery-overcurrent-microamp = <3000000>;
+			maxim,charge-input-threshold-microvolt = <4300000>;
+		};
 	};
-- 
1.9.1

^ permalink raw reply related

* [PATCH v5 2/5] mfd: max77693: Add defines for MAX77693 charger driver
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Prepare for adding support for Maxim 77693 charger by adding necessary
new defines.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
Acked-by: Lee Jones <lee.jones-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
 include/linux/mfd/max77693-private.h | 108 +++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/include/linux/mfd/max77693-private.h b/include/linux/mfd/max77693-private.h
index 08dae01258b9..955dd990beaf 100644
--- a/include/linux/mfd/max77693-private.h
+++ b/include/linux/mfd/max77693-private.h
@@ -143,10 +143,118 @@ enum max77693_pmic_reg {
 #define FLASH_INT_FLED1_SHORT	BIT(3)
 #define FLASH_INT_OVER_CURRENT	BIT(4)
 
+/* Fast charge timer in in hours */
+#define DEFAULT_FAST_CHARGE_TIMER		4
+/* microamps */
+#define DEFAULT_TOP_OFF_THRESHOLD_CURRENT	150000
+/* minutes */
+#define DEFAULT_TOP_OFF_TIMER			30
+/* microvolts */
+#define DEFAULT_CONSTANT_VOLT			4200000
+/* microvolts */
+#define DEFAULT_MIN_SYSTEM_VOLT			3600000
+/* celsius */
+#define DEFAULT_THERMAL_REGULATION_TEMP		100
+/* microamps */
+#define DEFAULT_BATTERY_OVERCURRENT		3500000
+/* microvolts */
+#define DEFAULT_CHARGER_INPUT_THRESHOLD_VOLT	4300000
+
+/* MAX77693_CHG_REG_CHG_INT_OK register */
+#define CHG_INT_OK_BYP_SHIFT		0
+#define CHG_INT_OK_BAT_SHIFT		3
+#define CHG_INT_OK_CHG_SHIFT		4
+#define CHG_INT_OK_CHGIN_SHIFT		6
+#define CHG_INT_OK_DETBAT_SHIFT		7
+#define CHG_INT_OK_BYP_MASK		BIT(CHG_INT_OK_BYP_SHIFT)
+#define CHG_INT_OK_BAT_MASK		BIT(CHG_INT_OK_BAT_SHIFT)
+#define CHG_INT_OK_CHG_MASK		BIT(CHG_INT_OK_CHG_SHIFT)
+#define CHG_INT_OK_CHGIN_MASK		BIT(CHG_INT_OK_CHGIN_SHIFT)
+#define CHG_INT_OK_DETBAT_MASK		BIT(CHG_INT_OK_DETBAT_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_00 register */
+#define CHG_DETAILS_00_CHGIN_SHIFT	5
+#define CHG_DETAILS_00_CHGIN_MASK	(0x3 << CHG_DETAILS_00_CHGIN_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01 register */
+#define CHG_DETAILS_01_CHG_SHIFT	0
+#define CHG_DETAILS_01_BAT_SHIFT	4
+#define CHG_DETAILS_01_TREG_SHIFT	7
+#define CHG_DETAILS_01_CHG_MASK		(0xf << CHG_DETAILS_01_CHG_SHIFT)
+#define CHG_DETAILS_01_BAT_MASK		(0x7 << CHG_DETAILS_01_BAT_SHIFT)
+#define CHG_DETAILS_01_TREG_MASK	BIT(7)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01/CHG field */
+enum max77693_charger_charging_state {
+	MAX77693_CHARGING_PREQUALIFICATION	= 0x0,
+	MAX77693_CHARGING_FAST_CONST_CURRENT,
+	MAX77693_CHARGING_FAST_CONST_VOLTAGE,
+	MAX77693_CHARGING_TOP_OFF,
+	MAX77693_CHARGING_DONE,
+	MAX77693_CHARGING_HIGH_TEMP,
+	MAX77693_CHARGING_TIMER_EXPIRED,
+	MAX77693_CHARGING_THERMISTOR_SUSPEND,
+	MAX77693_CHARGING_OFF,
+	MAX77693_CHARGING_RESERVED,
+	MAX77693_CHARGING_OVER_TEMP,
+	MAX77693_CHARGING_WATCHDOG_EXPIRED,
+};
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01/BAT field */
+enum max77693_charger_battery_state {
+	MAX77693_BATTERY_NOBAT			= 0x0,
+	/* Dead-battery or low-battery prequalification */
+	MAX77693_BATTERY_PREQUALIFICATION,
+	MAX77693_BATTERY_TIMER_EXPIRED,
+	MAX77693_BATTERY_GOOD,
+	MAX77693_BATTERY_LOWVOLTAGE,
+	MAX77693_BATTERY_OVERVOLTAGE,
+	MAX77693_BATTERY_OVERCURRENT,
+	MAX77693_BATTERY_RESERVED,
+};
+
+/* MAX77693_CHG_REG_CHG_DETAILS_02 register */
+#define CHG_DETAILS_02_BYP_SHIFT	0
+#define CHG_DETAILS_02_BYP_MASK		(0xf << CHG_DETAILS_02_BYP_SHIFT)
+
 /* MAX77693 CHG_CNFG_00 register */
 #define CHG_CNFG_00_CHG_MASK		0x1
 #define CHG_CNFG_00_BUCK_MASK		0x4
 
+/* MAX77693_CHG_REG_CHG_CNFG_01 register */
+#define CHG_CNFG_01_FCHGTIME_SHIFT	0
+#define CHG_CNFG_01_CHGRSTRT_SHIFT	4
+#define CHG_CNFG_01_PQEN_SHIFT		7
+#define CHG_CNFG_01_FCHGTIME_MASK	(0x7 << CHG_CNFG_01_FCHGTIME_SHIFT)
+#define CHG_CNFG_01_CHGRSTRT_MASK	(0x3 << CHG_CNFG_01_CHGRSTRT_SHIFT)
+#define CHG_CNFG_01_PQEN_MAKS		BIT(CHG_CNFG_01_PQEN_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_03 register */
+#define CHG_CNFG_03_TOITH_SHIFT		0
+#define CHG_CNFG_03_TOTIME_SHIFT	3
+#define CHG_CNFG_03_TOITH_MASK		(0x7 << CHG_CNFG_03_TOITH_SHIFT)
+#define CHG_CNFG_03_TOTIME_MASK		(0x7 << CHG_CNFG_03_TOTIME_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_04 register */
+#define CHG_CNFG_04_CHGCVPRM_SHIFT	0
+#define CHG_CNFG_04_MINVSYS_SHIFT	5
+#define CHG_CNFG_04_CHGCVPRM_MASK	(0x1f << CHG_CNFG_04_CHGCVPRM_SHIFT)
+#define CHG_CNFG_04_MINVSYS_MASK	(0x7 << CHG_CNFG_04_MINVSYS_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_06 register */
+#define CHG_CNFG_06_CHGPROT_SHIFT	2
+#define CHG_CNFG_06_CHGPROT_MASK	(0x3 << CHG_CNFG_06_CHGPROT_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_07 register */
+#define CHG_CNFG_07_REGTEMP_SHIFT	5
+#define CHG_CNFG_07_REGTEMP_MASK	(0x3 << CHG_CNFG_07_REGTEMP_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_12 register */
+#define CHG_CNFG_12_B2SOVRC_SHIFT	0
+#define CHG_CNFG_12_VCHGINREG_SHIFT	3
+#define CHG_CNFG_12_B2SOVRC_MASK	(0x7 << CHG_CNFG_12_B2SOVRC_SHIFT)
+#define CHG_CNFG_12_VCHGINREG_MASK	(0x3 << CHG_CNFG_12_VCHGINREG_SHIFT)
+
 /* MAX77693 CHG_CNFG_09 Register */
 #define CHG_CNFG_09_CHGIN_ILIM_MASK	0x7F
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v5 3/5] power: max77693: Add charger driver for Maxim 77693
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Add new driver for Maxim 77693 switch-mode charger (part of max77693
MFD driver) providing power supply class information to userspace.

The charger has +20V tolerant input. Current input can be set from 0 to
2.58 A. The charger can deliver up to 2.1 A to the battery or 3.5 A to
the system (when supplying additional current from battery to system).

The driver is configured through DTS (battery and system related
settings) and sysfs entries (timers and top-off charging threshold).

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 drivers/power/Kconfig            |   6 +
 drivers/power/Makefile           |   1 +
 drivers/power/max77693_charger.c | 758 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 765 insertions(+)
 create mode 100644 drivers/power/max77693_charger.c

diff --git a/drivers/power/Kconfig b/drivers/power/Kconfig
index da6981f92697..110d4bc03483 100644
--- a/drivers/power/Kconfig
+++ b/drivers/power/Kconfig
@@ -332,6 +332,12 @@ config CHARGER_MAX14577
 	  Say Y to enable support for the battery charger control sysfs and
 	  platform data of MAX14577/77836 MUICs.
 
+config CHARGER_MAX77693
+	tristate "Maxim MAX77693 battery charger driver"
+	depends on MFD_MAX77693
+	help
+	  Say Y to enable support for the Maxim MAX77693 battery charger.
+
 config CHARGER_MAX8997
 	tristate "Maxim MAX8997/MAX8966 PMIC battery charger driver"
 	depends on MFD_MAX8997 && REGULATOR_MAX8997
diff --git a/drivers/power/Makefile b/drivers/power/Makefile
index b83a0c749781..31216cb7e8a1 100644
--- a/drivers/power/Makefile
+++ b/drivers/power/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_CHARGER_LP8788)	+= lp8788-charger.o
 obj-$(CONFIG_CHARGER_GPIO)	+= gpio-charger.o
 obj-$(CONFIG_CHARGER_MANAGER)	+= charger-manager.o
 obj-$(CONFIG_CHARGER_MAX14577)	+= max14577_charger.o
+obj-$(CONFIG_CHARGER_MAX77693)	+= max77693_charger.o
 obj-$(CONFIG_CHARGER_MAX8997)	+= max8997_charger.o
 obj-$(CONFIG_CHARGER_MAX8998)	+= max8998_charger.o
 obj-$(CONFIG_CHARGER_BQ2415X)	+= bq2415x_charger.o
diff --git a/drivers/power/max77693_charger.c b/drivers/power/max77693_charger.c
new file mode 100644
index 000000000000..56cf2177aad4
--- /dev/null
+++ b/drivers/power/max77693_charger.c
@@ -0,0 +1,758 @@
+/*
+ * max77693_charger.c - Battery charger driver for the Maxim 77693
+ *
+ * Copyright (C) 2014 Samsung Electronics
+ * Krzysztof Kozlowski <k.kozlowski@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/power_supply.h>
+#include <linux/regmap.h>
+#include <linux/mfd/max77693.h>
+#include <linux/mfd/max77693-private.h>
+
+static const char *max77693_charger_name		= "max77693-charger";
+static const char *max77693_charger_model		= "MAX77693";
+static const char *max77693_charger_manufacturer	= "Maxim Integrated";
+
+struct max77693_charger {
+	struct device		*dev;
+	struct max77693_dev	*max77693;
+	struct power_supply	charger;
+
+	u32 constant_volt;
+	u32 min_system_volt;
+	u32 thermal_regulation_temp;
+	u32 batttery_overcurrent;
+	u32 charge_input_threshold_volt;
+};
+
+static int max77693_get_charger_state(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_STATUS_UNKNOWN;
+
+	data &= CHG_DETAILS_01_CHG_MASK;
+	data >>= CHG_DETAILS_01_CHG_SHIFT;
+
+	switch (data) {
+	case MAX77693_CHARGING_PREQUALIFICATION:
+	case MAX77693_CHARGING_FAST_CONST_CURRENT:
+	case MAX77693_CHARGING_FAST_CONST_VOLTAGE:
+	case MAX77693_CHARGING_TOP_OFF:
+	/* In high temp the charging current is reduced, but still charging */
+	case MAX77693_CHARGING_HIGH_TEMP:
+		state = POWER_SUPPLY_STATUS_CHARGING;
+		break;
+	case MAX77693_CHARGING_DONE:
+		state = POWER_SUPPLY_STATUS_FULL;
+		break;
+	case MAX77693_CHARGING_TIMER_EXPIRED:
+	case MAX77693_CHARGING_THERMISTOR_SUSPEND:
+		state = POWER_SUPPLY_STATUS_NOT_CHARGING;
+		break;
+	case MAX77693_CHARGING_OFF:
+	case MAX77693_CHARGING_OVER_TEMP:
+	case MAX77693_CHARGING_WATCHDOG_EXPIRED:
+		state = POWER_SUPPLY_STATUS_DISCHARGING;
+		break;
+	case MAX77693_CHARGING_RESERVED:
+	default:
+		state = POWER_SUPPLY_STATUS_UNKNOWN;
+	}
+
+	return state;
+}
+
+static int max77693_get_charge_type(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_CHARGE_TYPE_UNKNOWN;
+
+	data &= CHG_DETAILS_01_CHG_MASK;
+	data >>= CHG_DETAILS_01_CHG_SHIFT;
+
+	switch (data) {
+	case MAX77693_CHARGING_PREQUALIFICATION:
+	/*
+	 * Top-off: trickle or fast? In top-off the current varies between
+	 * 100 and 250 mA. It is higher than prequalification current.
+	 */
+	case MAX77693_CHARGING_TOP_OFF:
+		state = POWER_SUPPLY_CHARGE_TYPE_TRICKLE;
+		break;
+	case MAX77693_CHARGING_FAST_CONST_CURRENT:
+	case MAX77693_CHARGING_FAST_CONST_VOLTAGE:
+	/* In high temp the charging current is reduced, but still charging */
+	case MAX77693_CHARGING_HIGH_TEMP:
+		state = POWER_SUPPLY_CHARGE_TYPE_FAST;
+		break;
+	case MAX77693_CHARGING_DONE:
+	case MAX77693_CHARGING_TIMER_EXPIRED:
+	case MAX77693_CHARGING_THERMISTOR_SUSPEND:
+	case MAX77693_CHARGING_OFF:
+	case MAX77693_CHARGING_OVER_TEMP:
+	case MAX77693_CHARGING_WATCHDOG_EXPIRED:
+		state = POWER_SUPPLY_CHARGE_TYPE_NONE;
+		break;
+	case MAX77693_CHARGING_RESERVED:
+	default:
+		state = POWER_SUPPLY_CHARGE_TYPE_UNKNOWN;
+	}
+
+	return state;
+}
+
+/*
+ * Supported health statuses:
+ *  - POWER_SUPPLY_HEALTH_DEAD
+ *  - POWER_SUPPLY_HEALTH_GOOD
+ *  - POWER_SUPPLY_HEALTH_OVERVOLTAGE
+ *  - POWER_SUPPLY_HEALTH_SAFETY_TIMER_EXPIRE
+ *  - POWER_SUPPLY_HEALTH_UNKNOWN
+ *  - POWER_SUPPLY_HEALTH_UNSPEC_FAILURE
+ */
+static int max77693_get_battery_health(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_HEALTH_UNKNOWN;
+
+	data &= CHG_DETAILS_01_BAT_MASK;
+	data >>= CHG_DETAILS_01_BAT_SHIFT;
+
+	switch (data) {
+	case MAX77693_BATTERY_NOBAT:
+		state = POWER_SUPPLY_HEALTH_DEAD;
+		break;
+	case MAX77693_BATTERY_PREQUALIFICATION:
+	case MAX77693_BATTERY_GOOD:
+	case MAX77693_BATTERY_LOWVOLTAGE:
+		state = POWER_SUPPLY_HEALTH_GOOD;
+		break;
+	case MAX77693_BATTERY_TIMER_EXPIRED:
+		/*
+		 * Took longer to charge than expected, charging suspended.
+		 * Damaged battery?
+		 */
+		state = POWER_SUPPLY_HEALTH_SAFETY_TIMER_EXPIRE;
+		break;
+	case MAX77693_BATTERY_OVERVOLTAGE:
+		state = POWER_SUPPLY_HEALTH_OVERVOLTAGE;
+		break;
+	case MAX77693_BATTERY_OVERCURRENT:
+		state = POWER_SUPPLY_HEALTH_UNSPEC_FAILURE;
+		break;
+	case MAX77693_BATTERY_RESERVED:
+	default:
+		state = POWER_SUPPLY_HEALTH_UNKNOWN;
+		break;
+	}
+
+	return state;
+}
+
+static int max77693_get_present(struct regmap *regmap)
+{
+	unsigned int data;
+
+	/*
+	 * Read CHG_INT_OK register. High DETBAT bit here should be
+	 * equal to value 0x0 in CHG_DETAILS_01/BAT field.
+	 */
+	regmap_read(regmap, MAX77693_CHG_REG_CHG_INT_OK, &data);
+	if (data & CHG_INT_OK_DETBAT_MASK)
+		return 0;
+	return 1;
+}
+
+static int max77693_get_online(struct regmap *regmap)
+{
+	unsigned int data;
+
+	regmap_read(regmap, MAX77693_CHG_REG_CHG_INT_OK, &data);
+	if (data & CHG_INT_OK_CHGIN_MASK)
+		return 1;
+	return 0;
+}
+
+static enum power_supply_property max77693_charger_props[] = {
+	POWER_SUPPLY_PROP_STATUS,
+	POWER_SUPPLY_PROP_CHARGE_TYPE,
+	POWER_SUPPLY_PROP_HEALTH,
+	POWER_SUPPLY_PROP_PRESENT,
+	POWER_SUPPLY_PROP_ONLINE,
+	POWER_SUPPLY_PROP_MODEL_NAME,
+	POWER_SUPPLY_PROP_MANUFACTURER,
+};
+
+static int max77693_charger_get_property(struct power_supply *psy,
+			    enum power_supply_property psp,
+			    union power_supply_propval *val)
+{
+	struct max77693_charger *chg = container_of(psy,
+						  struct max77693_charger,
+						  charger);
+	struct regmap *regmap = chg->max77693->regmap;
+	int ret = 0;
+
+	switch (psp) {
+	case POWER_SUPPLY_PROP_STATUS:
+		val->intval = max77693_get_charger_state(regmap);
+		break;
+	case POWER_SUPPLY_PROP_CHARGE_TYPE:
+		val->intval = max77693_get_charge_type(regmap);
+		break;
+	case POWER_SUPPLY_PROP_HEALTH:
+		val->intval = max77693_get_battery_health(regmap);
+		break;
+	case POWER_SUPPLY_PROP_PRESENT:
+		val->intval = max77693_get_present(regmap);
+		break;
+	case POWER_SUPPLY_PROP_ONLINE:
+		val->intval = max77693_get_online(regmap);
+		break;
+	case POWER_SUPPLY_PROP_MODEL_NAME:
+		val->strval = max77693_charger_model;
+		break;
+	case POWER_SUPPLY_PROP_MANUFACTURER:
+		val->strval = max77693_charger_manufacturer;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static ssize_t device_attr_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count,
+		int (*fn)(struct max77693_charger *, unsigned long))
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned long val;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	ret = fn(chg, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static ssize_t fast_charge_timer_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_01,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_01_FCHGTIME_MASK;
+	data >>= CHG_CNFG_01_FCHGTIME_SHIFT;
+	switch (data) {
+	case 0x1 ... 0x7:
+		/* Starting from 4 hours, step by 2 hours */
+		val = 4 + (data - 1) * 2;
+		break;
+	case 0x0:
+	default:
+		val = 0;
+		break;
+	}
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_fast_charge_timer(struct max77693_charger *chg,
+		unsigned long hours)
+{
+	unsigned int data;
+
+	/*
+	 * 0x00 - disable
+	 * 0x01 - 4h
+	 * 0x02 - 6h
+	 * ...
+	 * 0x07 - 16h
+	 * Round down odd values.
+	 */
+	switch (hours) {
+	case 4 ... 16:
+		data = (hours - 4) / 2 + 1;
+		break;
+	case 0:
+		/* Disable */
+		data = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	data <<= CHG_CNFG_01_FCHGTIME_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_01,
+			CHG_CNFG_01_FCHGTIME_MASK, data);
+}
+
+static ssize_t fast_charge_timer_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_fast_charge_timer);
+}
+
+static ssize_t top_off_threshold_current_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_03,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_03_TOITH_MASK;
+	data >>= CHG_CNFG_03_TOITH_SHIFT;
+
+	if (data <= 0x04)
+		val = 100000 + data * 25000;
+	else
+		val = data * 50000;
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_top_off_threshold_current(struct max77693_charger *chg,
+		unsigned long uamp)
+{
+	unsigned int data;
+
+	if (uamp < 100000 || uamp > 350000)
+		return -EINVAL;
+
+	if (uamp <= 200000)
+		data = (uamp - 100000) / 25000;
+	else
+		/* (200000, 350000> */
+		data = uamp / 50000;
+
+	data <<= CHG_CNFG_03_TOITH_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_03,
+			CHG_CNFG_03_TOITH_MASK, data);
+}
+
+static ssize_t top_off_threshold_current_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_top_off_threshold_current);
+}
+
+static ssize_t top_off_timer_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_03,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_03_TOTIME_MASK;
+	data >>= CHG_CNFG_03_TOTIME_SHIFT;
+
+	val = data * 10;
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_top_off_timer(struct max77693_charger *chg,
+		unsigned long minutes)
+{
+	unsigned int data;
+
+	if (minutes > 70)
+		return -EINVAL;
+
+	data = minutes / 10;
+	data <<= CHG_CNFG_03_TOTIME_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_03,
+			CHG_CNFG_03_TOTIME_MASK, data);
+}
+
+static ssize_t top_off_timer_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_top_off_timer);
+}
+
+static DEVICE_ATTR_RW(fast_charge_timer);
+static DEVICE_ATTR_RW(top_off_threshold_current);
+static DEVICE_ATTR_RW(top_off_timer);
+
+static int max77693_set_constant_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	/*
+	 * 0x00 - 3.650 V
+	 * 0x01 - 3.675 V
+	 * ...
+	 * 0x1b - 4.325 V
+	 * 0x1c - 4.340 V
+	 * 0x1d - 4.350 V
+	 * 0x1e - 4.375 V
+	 * 0x1f - 4.400 V
+	 */
+	if (uvolt >= 3650000 && uvolt < 4340000)
+		data = (uvolt - 3650000) / 25000;
+	else if (uvolt >= 4340000 && uvolt < 4350000)
+		data = 0x1c;
+	else if (uvolt >= 4350000 && uvolt <= 4400000)
+		data = 0x1d + (uvolt - 4350000) / 25000;
+	else {
+		dev_err(chg->dev, "Wrong value for charging constant voltage\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_04_CHGCVPRM_SHIFT;
+
+	dev_dbg(chg->dev, "Charging constant voltage: %u (0x%x)\n", uvolt,
+			data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_04,
+			CHG_CNFG_04_CHGCVPRM_MASK, data);
+}
+
+static int max77693_set_min_system_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	if (uvolt < 3000000 || uvolt > 3700000) {
+		dev_err(chg->dev, "Wrong value for minimum system regulation voltage\n");
+		return -EINVAL;
+	}
+
+	data = (uvolt - 3000000) / 100000;
+
+	data <<= CHG_CNFG_04_MINVSYS_SHIFT;
+
+	dev_dbg(chg->dev, "Minimum system regulation voltage: %u (0x%x)\n",
+			uvolt, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_04,
+			CHG_CNFG_04_MINVSYS_MASK, data);
+}
+
+static int max77693_set_thermal_regulation_temp(struct max77693_charger *chg,
+		unsigned int cels)
+{
+	unsigned int data;
+
+	switch (cels) {
+	case 70:
+	case 85:
+	case 100:
+	case 115:
+		data = (cels - 70) / 15;
+		break;
+	default:
+		dev_err(chg->dev, "Wrong value for thermal regulation loop temperature\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_07_REGTEMP_SHIFT;
+
+	dev_dbg(chg->dev, "Thermal regulation loop temperature: %u (0x%x)\n",
+			cels, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_07,
+			CHG_CNFG_07_REGTEMP_MASK, data);
+}
+
+static int max77693_set_batttery_overcurrent(struct max77693_charger *chg,
+		unsigned int uamp)
+{
+	unsigned int data;
+
+	if (uamp && (uamp < 2000000 || uamp > 3500000)) {
+		dev_err(chg->dev, "Wrong value for battery overcurrent\n");
+		return -EINVAL;
+	}
+
+	if (uamp)
+		data = ((uamp - 2000000) / 250000) + 1;
+	else
+		data = 0; /* disable */
+
+	data <<= CHG_CNFG_12_B2SOVRC_SHIFT;
+
+	dev_dbg(chg->dev, "Battery overcurrent: %u (0x%x)\n", uamp, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_12,
+			CHG_CNFG_12_B2SOVRC_MASK, data);
+}
+
+static int max77693_set_charge_input_threshold_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	switch (uvolt) {
+	case 4300000:
+		data = 0x0;
+		break;
+	case 4700000:
+	case 4800000:
+	case 4900000:
+		data = (uvolt - 4700000) / 100000;
+	default:
+		dev_err(chg->dev, "Wrong value for charge input voltage regulation threshold\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_12_VCHGINREG_SHIFT;
+
+	dev_dbg(chg->dev, "Charge input voltage regulation threshold: %u (0x%x)\n",
+			uvolt, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_12,
+			CHG_CNFG_12_VCHGINREG_MASK, data);
+}
+
+/*
+ * Sets charger registers to proper and safe default values.
+ */
+static int max77693_reg_init(struct max77693_charger *chg)
+{
+	int ret;
+	unsigned int data;
+
+	/* Unlock charger register protection */
+	data = (0x3 << CHG_CNFG_06_CHGPROT_SHIFT);
+	ret = regmap_update_bits(chg->max77693->regmap,
+				MAX77693_CHG_REG_CHG_CNFG_06,
+				CHG_CNFG_06_CHGPROT_MASK, data);
+	if (ret) {
+		dev_err(chg->dev, "Error unlocking registers: %d\n", ret);
+		return ret;
+	}
+
+	ret = max77693_set_fast_charge_timer(chg, DEFAULT_FAST_CHARGE_TIMER);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_top_off_threshold_current(chg,
+			DEFAULT_TOP_OFF_THRESHOLD_CURRENT);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_top_off_timer(chg, DEFAULT_TOP_OFF_TIMER);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_constant_volt(chg, chg->constant_volt);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_min_system_volt(chg, chg->min_system_volt);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_thermal_regulation_temp(chg,
+			chg->thermal_regulation_temp);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_batttery_overcurrent(chg, chg->batttery_overcurrent);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_charge_input_threshold_volt(chg,
+			chg->charge_input_threshold_volt);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+#ifdef CONFIG_OF
+static int max77693_dt_init(struct device *dev, struct max77693_charger *chg)
+{
+	struct device_node *np = dev->of_node;
+
+	if (!np) {
+		dev_err(dev, "no charger OF node\n");
+		return -EINVAL;
+	}
+
+	if (of_property_read_u32(np, "maxim,constant-microvolt",
+			&chg->constant_volt))
+		chg->constant_volt = DEFAULT_CONSTANT_VOLT;
+
+	if (of_property_read_u32(np, "maxim,min-system-microvolt",
+			&chg->min_system_volt))
+		chg->min_system_volt = DEFAULT_MIN_SYSTEM_VOLT;
+
+	if (of_property_read_u32(np, "maxim,thermal-regulation-celsius",
+			&chg->thermal_regulation_temp))
+		chg->thermal_regulation_temp = DEFAULT_THERMAL_REGULATION_TEMP;
+
+	if (of_property_read_u32(np, "maxim,battery-overcurrent-microamp",
+			&chg->batttery_overcurrent))
+		chg->batttery_overcurrent = DEFAULT_BATTERY_OVERCURRENT;
+
+	if (of_property_read_u32(np, "maxim,charge-input-threshold-microvolt",
+			&chg->charge_input_threshold_volt))
+		chg->charge_input_threshold_volt =
+			DEFAULT_CHARGER_INPUT_THRESHOLD_VOLT;
+
+	return 0;
+}
+#else /* CONFIG_OF */
+static int max77693_dt_init(struct device *dev, struct max77693_charger *chg)
+{
+	return 0;
+}
+#endif /* CONFIG_OF */
+
+static int max77693_charger_probe(struct platform_device *pdev)
+{
+	struct max77693_charger *chg;
+	struct max77693_dev *max77693 = dev_get_drvdata(pdev->dev.parent);
+	int ret;
+
+	chg = devm_kzalloc(&pdev->dev, sizeof(*chg), GFP_KERNEL);
+	if (!chg)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, chg);
+	chg->dev = &pdev->dev;
+	chg->max77693 = max77693;
+
+	ret = max77693_dt_init(&pdev->dev, chg);
+	if (ret)
+		return ret;
+
+	ret = max77693_reg_init(chg);
+	if (ret)
+		return ret;
+
+	chg->charger.name = max77693_charger_name;
+	chg->charger.type = POWER_SUPPLY_TYPE_BATTERY;
+	chg->charger.properties = max77693_charger_props;
+	chg->charger.num_properties = ARRAY_SIZE(max77693_charger_props);
+	chg->charger.get_property = max77693_charger_get_property;
+
+	ret = device_create_file(&pdev->dev, &dev_attr_fast_charge_timer);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create fast charge timer sysfs entry\n");
+		goto err;
+	}
+
+	ret = device_create_file(&pdev->dev,
+			&dev_attr_top_off_threshold_current);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create top off current sysfs entry\n");
+		goto err;
+	}
+
+	ret = device_create_file(&pdev->dev, &dev_attr_top_off_timer);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create top off timer sysfs entry\n");
+		goto err;
+	}
+
+	ret = power_supply_register(&pdev->dev, &chg->charger);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: power supply register\n");
+		goto err;
+	}
+
+	return 0;
+
+err:
+	device_remove_file(&pdev->dev, &dev_attr_top_off_timer);
+	device_remove_file(&pdev->dev, &dev_attr_top_off_threshold_current);
+	device_remove_file(&pdev->dev, &dev_attr_fast_charge_timer);
+
+	return ret;
+}
+
+static int max77693_charger_remove(struct platform_device *pdev)
+{
+	struct max77693_charger *chg = platform_get_drvdata(pdev);
+
+	device_remove_file(&pdev->dev, &dev_attr_top_off_timer);
+	device_remove_file(&pdev->dev, &dev_attr_top_off_threshold_current);
+	device_remove_file(&pdev->dev, &dev_attr_fast_charge_timer);
+
+	power_supply_unregister(&chg->charger);
+
+	return 0;
+}
+
+static const struct platform_device_id max77693_charger_id[] = {
+	{ "max77693-charger", 0, },
+	{ }
+};
+MODULE_DEVICE_TABLE(platform, max77693_charger_id);
+
+static struct platform_driver max77693_charger_driver = {
+	.driver = {
+		.owner	= THIS_MODULE,
+		.name	= "max77693-charger",
+	},
+	.probe		= max77693_charger_probe,
+	.remove		= max77693_charger_remove,
+	.id_table	= max77693_charger_id,
+};
+module_platform_driver(max77693_charger_driver);
+
+MODULE_AUTHOR("Krzysztof Kozlowski <k.kozlowski@samsung.com>");
+MODULE_DESCRIPTION("Maxim 77693 charger driver");
+MODULE_LICENSE("GPL");
-- 
1.9.1


^ permalink raw reply related

* [PATCH v5 4/5] Documentation: power: max77693-charger: Document exported sysfs entry
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Document the settings exported by max77693 charger driver through sysfs
entries:
 - fast_charge_timer
 - top_off_threshold_current
 - top_off_timer

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 Documentation/ABI/testing/sysfs-class-power | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power
index 909e7602c717..369d2a2d7d3e 100644
--- a/Documentation/ABI/testing/sysfs-class-power
+++ b/Documentation/ABI/testing/sysfs-class-power
@@ -32,3 +32,45 @@ Description:
 		Valid values:
 		- 5, 6 or 7 (hours),
 		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/fast_charge_timer
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the maximum time the max77693
+		charger operates in fast-charge mode. When the timer expires
+		the device will terminate fast-charge mode (charging current
+		will drop to 0 A) and will trigger interrupt.
+
+		Valid values:
+		- 4 - 16 (hours), step by 2 (rounded down)
+		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/top_off_threshold_current
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the charging current threshold for
+		entering top-off charging mode. When charging current in fast
+		charge mode drops below this value, the charger will trigger
+		interrupt and start top-off charging mode.
+
+		Valid values:
+		- 100000 - 200000 (microamps), step by 25000 (rounded down)
+		- 200000 - 350000 (microamps), step by 50000 (rounded down)
+		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/top_off_timer
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the maximum time the max77693
+		charger operates in top-off charge mode. When the timer expires
+		the device will terminate top-off charge mode (charging current
+		will drop to 0 A) and will trigger interrupt.
+
+		Valid values:
+		- 0 - 70 (minutes), step by 10 (rounded down)
-- 
1.9.1

^ permalink raw reply related

* [PATCH v5 5/5] MAINTAINERS: Add entry for Maxim chargers on Samsung boards
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Add myself as supporter to help in reviewing patches for Maxim 14577 and
77693 MUIC charger drivers. These are used on Exynos-based boards
(Trats 2, Gear 1 and Gear 2).

Signed-off-by: Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
Cc: Sebastian Reichel <sre-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Acked-By: Sebastian Reichel <sre-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dbc17b3b51d4..c6fbb8cb58de 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6152,6 +6152,13 @@ F:	Documentation/devicetree/bindings/i2c/max6697.txt
 F:	drivers/hwmon/max6697.c
 F:	include/linux/platform_data/max6697.h
 
+MAXIM MUIC CHARGER DRIVERS FOR EXYNOS BASED BOARDS
+M:	Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
+L:	linux-pm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	drivers/power/max14577_charger.c
+F:	drivers/power/max77693_charger.c
+
 MAXIRADIO FM RADIO RECEIVER DRIVER
 M:	Hans Verkuil <hverkuil-qWit8jRvyhVmR6Xm/wNWPw@public.gmane.org>
 L:	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] tpm: fix format string error in tpm-chip.c
From: Jarkko Sakkinen @ 2015-01-20 10:03 UTC (permalink / raw)
  To: Peter Huewe, Ashley Lai, Marcel Selhorst
  Cc: tpmdd-devel, linux-kernel, josh, christophe.ricard,
	jason.gunthorpe, stefanb, linux-api, trousers-tech, jmorris,
	Jarkko Sakkinen

dev_set_name() takes three arguments where the second argument is
a format string. This patch fixes the call accordingly in tpm-chip.c

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 drivers/char/tpm/tpm-chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 6459af7..1d278cc 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -125,7 +125,7 @@ struct tpm_chip *tpmm_chip_alloc(struct device *dev,
 	else
 		chip->dev.devt = MKDEV(MAJOR(tpm_devt), chip->dev_num);
 
-	dev_set_name(&chip->dev, chip->devname);
+	dev_set_name(&chip->dev, "%s", chip->devname);
 
 	device_initialize(&chip->dev);
 
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Rasmus Villemoes @ 2015-01-20 10:37 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra, linux-fsdevel-u79uwXL29TbrhsbdSgBK9A
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>        
>        2) When you need higher precision than microsecond for wait timeout.

You probably want to say millisecond.

>            struct epoll_mod_cmd {
[...]
>            };


>            struct epoll_wait_spec {
[...]
>            } EPOLL_PACKED;

Either both or none of these should mention that EPOLL_PACKED is in fact
part of the actual definition. The changelog for 3/6 sorta mentions
that it's not really needed for epoll_mod_cmd. Why is it necessary for
either struct?

> RETURN VALUE
>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

And here, it doesn't make sense to mention a unit, since the new timeout
is given using struct timespec (this was the whole point, right?).

Rasmus

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Fam Zheng @ 2015-01-20 10:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra, linux-fsdevel-u79uwXL29TbrhsbdSgBK9A
In-Reply-To: <874mrl3fh9.fsf-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>

On Tue, 01/20 11:37, Rasmus Villemoes wrote:
> On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >        
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> You probably want to say millisecond.

Yes, you see that I just can't make this right. :)

> 
> >            struct epoll_mod_cmd {
> [...]
> >            };
> 
> 
> >            struct epoll_wait_spec {
> [...]
> >            } EPOLL_PACKED;
> 
> Either both or none of these should mention that EPOLL_PACKED is in fact
> part of the actual definition. The changelog for 3/6 sorta mentions
> that it's not really needed for epoll_mod_cmd. Why is it necessary for
> either struct?

Yeah. it's probably not really necessary.

> 
> > RETURN VALUE
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> And here, it doesn't make sense to mention a unit, since the new timeout
> is given using struct timespec (this was the whole point, right?).

Right!

Thanks,
Fam

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Johannes Stezenbach @ 2015-01-20 10:57 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: arnd, ebiederm, gnomes, teg, jkosina, luto, linux-api,
	linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <20150120011359.GE865@kroah.com>

On Tue, Jan 20, 2015 at 09:13:59AM +0800, Greg Kroah-Hartman wrote:
> On Tue, Jan 20, 2015 at 12:38:12AM +0100, Johannes Stezenbach wrote:
> > Those automotive applications you
> > were talking about, what was the OS they were ported from
> > and what was the messaging API they used?
> 
> They were ported from QNX and I don't know the exact api, it is wrapped
> up in a library layer for them to use.  And typically, they run about
> 40 thousand messages in the first few seconds of startup time.  Or was
> it 400 thousand?  Something huge and crazy to be doing on tiny ARM
> chips, but that's the IVI industry for you :(

So I did some googling and found in QNX servers create a channel
to receive messages, and clients connect to this channel.
Multiple clients can connect to the channel.
But it is not a bus -- no multicast/broadcast, and no name
service or policy rules like D-Bus has.  To me it looks
to be similar in functionality to UNIX domain sockets.

My guess is that the people porting from QNX were just confused
and their use of D-Bus was in error.  Maybe they should've used
plain sockets, capnproto, ZeroMQ or whatever.


> > As I said before, I'm seeing about a dozen D-Bus messages per minute,
> > nothing that would justify adding kdbus to the kernel for
> > performance reasons.  Wrt security I'm also not aware of any
> > open issues with D-Bus.  Thus I doubt normal users of D-Bus
> > would see any benefit from kdbus.  I also think none of the
> > applications I can install from my distribution has any performance
> > issue with D-Bus.
> 
> That's because people have not done anything really needing performance
> on the desktop over D-Bus in the past due to how slow the current
> implementation is.  Now that this is being resolved, that can change,
> and there are demos out there of even streaming audio over kdbus with no
> problems.
> 
> But performance is not just the only reason we want this in the kernel,
> I listed a whole long range of them.  Sure, it's great to now be faster,
> cutting down the number of context switches and copies by a huge amount,
> but the other things are equally important for future development
> (namespaces, containers, security, early-boot, etc.)
> 
> > And this is the point where I ask myself if I missed something.
> 
> Don't focus purely on performance for your existing desktop system,
> that's not the only use case here.  There are lots of others, as I
> document, that can benefit and want this.
> 
> One "fun" thing I've been talking to someone about is the ability to
> even port binder to be on top of kdbus.  But that's just a research
> project, and requires some API changes on the userspace binder side, but
> it shows real promise, and would then mean that we could deprecate the
> old binder code and a few hundred million devices could then use kdbus
> instead.  But that's long-term goals, not really all that relevant here,
> but it shows that having a solid bus IPC mechanism is a powerful thing
> that we have been missing in the past from Linux.

Well, IMHO you got it backwards.  Before adding a complex new IPC
API to the kernel you should do the homework and gather some
evidence that there will be enough users to justify the addition.

But maybe I misunderstood the purpose of this thread and you're
just advertising it to find possible users instead of already
suggesting to merge it?  If someone has some convincing story
to share why kdbus would solve their IPC needs, I'm all ears.

(I'm sorry this implies your responses so far were not convincing:
not verifiable facts, no numbers, no testimonials etc.)

FWIW, my gut feeling was that the earlier attempts to add a new
IPC primitve like multicast UNIX domain sockets
http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
were a much saner approach.  But now I think the comments
from this old thread have not been addressed, instead the
new approach just made the thing more complex and
put in ipc/ instead of net/ to bypass the guards.


Thanks,
Johannes

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Greg Kroah-Hartman @ 2015-01-20 11:26 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: arnd-r2nGTMty4D4, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <20150120105712.GA6260-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org>

On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:
> On Tue, Jan 20, 2015 at 09:13:59AM +0800, Greg Kroah-Hartman wrote:
> > On Tue, Jan 20, 2015 at 12:38:12AM +0100, Johannes Stezenbach wrote:
> > > Those automotive applications you
> > > were talking about, what was the OS they were ported from
> > > and what was the messaging API they used?
> > 
> > They were ported from QNX and I don't know the exact api, it is wrapped
> > up in a library layer for them to use.  And typically, they run about
> > 40 thousand messages in the first few seconds of startup time.  Or was
> > it 400 thousand?  Something huge and crazy to be doing on tiny ARM
> > chips, but that's the IVI industry for you :(
> 
> So I did some googling and found in QNX servers create a channel
> to receive messages, and clients connect to this channel.
> Multiple clients can connect to the channel.

Hence, a bus :)

> But it is not a bus -- no multicast/broadcast, and no name
> service or policy rules like D-Bus has.  To me it looks
> to be similar in functionality to UNIX domain sockets.

It's not as complex as D-Bus, but it's still subscribing to things and
getting messages.

> My guess is that the people porting from QNX were just confused
> and their use of D-Bus was in error.  Maybe they should've used
> plain sockets, capnproto, ZeroMQ or whatever.

I tend to trust that they knew what they were doing, they wouldn't have
picked D-Bus for no good reason.

> Well, IMHO you got it backwards.  Before adding a complex new IPC
> API to the kernel you should do the homework and gather some
> evidence that there will be enough users to justify the addition.

systemd wants this today for early boot.  It will remove lots of code
and enable a lot of good things to happen.  The first email in this
thread describes this quite well, is that not sufficient?

> FWIW, my gut feeling was that the earlier attempts to add a new
> IPC primitve like multicast UNIX domain sockets
> http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
> were a much saner approach.  But now I think the comments
> from this old thread have not been addressed, instead the
> new approach just made the thing more complex and
> put in ipc/ instead of net/ to bypass the guards.

Not at all, the networking maintainers said that that proposal was not
acceptable to them at all and it should not be done in the networking
stack at all.  So this was solution was created instead, which provides
a lot more things than the old networking patches did, which shows that
the networking developers were right to reject it.

thanks,

greg k-h

^ permalink raw reply

* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt@hitachi.com
In-Reply-To: <CAMEtUuwrpqRG4a=Hqnj3JBKuLbC4yV+trVAZhevKLbCsm_6U4Q@mail.gmail.com>

(2015/01/20 12:55), Alexei Starovoitov wrote:
> On Mon, Jan 19, 2015 at 6:58 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>>>
>>> it's done already... one can do the same skb->dev->name logic
>>> in kprobe attached program... so from bpf program point of view,
>>> tracepoints and kprobes feature-wise are exactly the same.
>>> Only input is different.
>>
>> No, I meant that the input should also be same, at least for the first step.
>> I guess it is easy to hook the ring buffer committing and fetch arguments
>> from the event entry.
> 
> No. That would be very slow. See my comment to Steven
> and more detailed numbers below.

Thank you for measuring the performance differences.
Indeed, the ring buffer looks slow.

> Allocating ring buffer takes too much time.
> 
>> And what I expected scenario was
>>
>> 1. setup kprobe traceevent with fd, buf, count by using perf-probe.
>> 2. load bpf module
>> 3. the module processes given event arguments.
> 
> from ring buffer? that's too slow.

Ok, BTW, would you think is it possible to use a reusable small scratchpad
memory for passing arguments? (just a thought)

> It's not usable for high frequency events which
> need this in-kernel aggregation.
> If events are rare, then just dumping everything
> into trace buffer is just fine. No in-kernel program is needed.

Hmm, let me ensure your point, the performance number is the reason why
we need to do it in the kernel, right? Not mainly for the flexibility but speed.

>> Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
>> Why don't you just reuse the existing facilities (perftools and ftrace)
>> instead of co-exist?
> 
> hmm. I don't think we're on the same page yet...
> ring buffer and tracing interface is fully reused.
> programs are run as soon as event triggers.
> They can return non-zero and kernel will allocate ring
> buffer which user space will consume.
> Please take a look at tracex1

I see, this code itself is not a destructive change.

>>> Just look how ktap scripts look alike for kprobes and tracepoints.
>>
>> Ktap is a good example, it provides only a language parser and a runtime engine.
>> Actually, currently it lacks a feature to execute "perf-probe" helper from
>> script, but it is easy to add such feature.
> ...
>> For this usecase, I've made --output option for perf probe
>> https://lkml.org/lkml/2014/10/31/210
> 
> you're proposing to call perf binary from ktap binary?

Yes, that's right :)

> I think packaging headaches and error conditions
> will make such approach very hard to use.

No, I don't think so. perf can be a "buffer" from the kernel API
and command-line API. If you need to get clearer error, you also
can join the upstream development.

> it would be much cleaner to have ktap as part of perf
> generating bpf on the fly and feeding into kernel.
> 'perf probe' parsing and functions don't belong in kernel
> when userspace can generate them in more efficient way.

No, perf probe still be needed to users who don't choose "injecting
binary blob" tracing. Efficiency is NOT only one index.

- perf probe and kprobe-event gives us a complete understandable
 interface for what will be recorded at where.
 (we can see the event definitions via kprobe_events interface,
  without any tools)
- kprobe-event gives a completely same interface as other tracepoint
  events.
- it also doesn't require any build-binary parts :) nor special tools.
  We can play with ftrace on just a small busybox.

However, this does NOT interfere your patch upstreaming. I just said current
ftrace method is also meaningful for some reasons :)


By the way, I concern about that bpf compiler can become another systemtap,
especially if you build it on llvm. Would you plan to develop it on kernel
tree? or apart from the kernel-side development?
I think it is hard to sync the development if you do it out-of-tree.


Thank you,



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api,
	netdev, linux-kernel, yrl.pp-manager.tt@hitachi.com
In-Reply-To: <1421381770-4866-5-git-send-email-ast@plumgrid.com>

(2015/01/16 13:16), Alexei Starovoitov wrote:
> simple packet drop monitor:
> - in-kernel eBPF program attaches to kfree_skb() event and records number
>   of packet drops at given location
> - userspace iterates over the map every second and prints stats

Hmm, this eBPF assembly macros are very interesting. Maybe I can
replace current kprobe's argument "fetching methods" with this.

Thank you,

> 
> Usage:
> $ sudo dropmon
> location 0xffffffff81695995 count 1
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 2
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 3
> location 0xffffffff816d0da9 count 2
> 
> $ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
> 0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
> 0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231
> 
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  samples/bpf/Makefile  |    2 +
>  samples/bpf/dropmon.c |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 131 insertions(+)
>  create mode 100644 samples/bpf/dropmon.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index b5b3600dcdf5..789691374562 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -6,7 +6,9 @@ hostprogs-y := test_verifier test_maps
>  hostprogs-y += sock_example
>  hostprogs-y += sockex1
>  hostprogs-y += sockex2
> +hostprogs-y += dropmon
>  
> +dropmon-objs := dropmon.o libbpf.o
>  test_verifier-objs := test_verifier.o libbpf.o
>  test_maps-objs := test_maps.o libbpf.o
>  sock_example-objs := sock_example.o libbpf.o
> diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
> new file mode 100644
> index 000000000000..9a2cd3344d69
> --- /dev/null
> +++ b/samples/bpf/dropmon.c
> @@ -0,0 +1,129 @@
> +/* simple packet drop monitor:
> + * - in-kernel eBPF program attaches to kfree_skb() event and records number
> + *   of packet drops at given location
> + * - userspace iterates over the map every second and prints stats
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include <linux/unistd.h>
> +#include <string.h>
> +#include <linux/filter.h>
> +#include <stdlib.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include "libbpf.h"
> +
> +#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
> +
> +static int write_to_file(const char *file, const char *str, bool keep_open)
> +{
> +	int fd, err;
> +
> +	fd = open(file, O_WRONLY);
> +	err = write(fd, str, strlen(str));
> +	(void) err;
> +
> +	if (keep_open) {
> +		return fd;
> +	} else {
> +		close(fd);
> +		return -1;
> +	}
> +}
> +
> +static int dropmon(void)
> +{
> +	long long key, next_key, value = 0;
> +	int prog_fd, map_fd, i;
> +	char fmt[32];
> +
> +	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
> +	if (map_fd < 0) {
> +		printf("failed to create map '%s'\n", strerror(errno));
> +		goto cleanup;
> +	}
> +
> +	/* the following eBPF program is equivalent to C:
> +	 * int filter(struct bpf_context *ctx)
> +	 * {
> +	 *   long loc = ctx->arg2;
> +	 *   long init_val = 1;
> +	 *   long *value;
> +	 *
> +	 *   value = bpf_map_lookup_elem(MAP_ID, &loc);
> +	 *   if (value) {
> +	 *      __sync_fetch_and_add(value, 1);
> +	 *   } else {
> +	 *      bpf_map_update_elem(MAP_ID, &loc, &init_val, BPF_ANY);
> +	 *   }
> +	 *   return 0;
> +	 * }
> +	 */
> +	struct bpf_insn prog[] = {
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
> +		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
> +		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
> +		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +		BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
> +		BPF_MOV64_IMM(BPF_REG_4, BPF_ANY),
> +		BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +	};
> +
> +	prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
> +				sizeof(prog), "GPL");
> +	if (prog_fd < 0) {
> +		printf("failed to load prog '%s'\n%s", strerror(errno), bpf_log_buf);
> +		return -1;
> +	}
> +
> +	sprintf(fmt, "bpf_%d", prog_fd);
> +
> +	write_to_file(TRACEPOINT "filter", fmt, true);
> +
> +	for (i = 0; i < 10; i++) {
> +		key = 0;
> +		while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
> +			bpf_lookup_elem(map_fd, &next_key, &value);
> +			printf("location 0x%llx count %lld\n", next_key, value);
> +			key = next_key;
> +		}
> +		if (key)
> +			printf("\n");
> +		sleep(1);
> +	}
> +
> +cleanup:
> +	/* maps, programs, tracepoint filters will auto cleanup on process exit */
> +
> +	return 0;
> +}
> +
> +int main(void)
> +{
> +	FILE *f;
> +
> +	/* start ping in the background to get some kfree_skb events */
> +	f = popen("ping -c5 localhost", "r");
> +	(void) f;
> +
> +	dropmon();
> +	return 0;
> +}
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:48 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Hello Fam Zheng,

I know this API has been through a number of iterations, and there were 
discussions about the design that led to it becoming more complex.
But, let us assume that someone has not seen those discussions,
or forgotten them, or is too lazy to go hunting list archives.

Then: this patch series should somewhere have an explanation of
why the API is what it is, ideally with links to previous relevant
discussions. I see that you do part of that in

    [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait

There are however no links to previous discussions in that mail (I guess
http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
relevant, nor is there any sort of change log in the commit message 
that explains the evolution of the API. Having those would ease the 
task of reviewers.

Coming back to THIS mail, this man page should also include an
explanation of why the API is what it is. That would include much
of the detail from the 5/6 patch, and probably more info besides.

Some specific points below.

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
> 
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
> 
> SYNOPSIS
> 
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
> 
> DESCRIPTION
> 
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.

s/microsecond/millisecond/

> 
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
> 
>            struct epoll_mod_cmd {
> 
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };
>        
>        There is no guartantee that all the commands are executed in order. Only

s/guartantee/guarantee/

I think the word "all" is not needed in this sentence.

Why is there no guarantee that the commands are run in order?
The order matters if there are operations on the same fd.

>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

Does the operation execute all commands, or stop when it encounters the first 
error? In other words, when looping over the returned 'error' fields, what
is the termination condition for the user-space application?

(Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
but the man page should explicitly state this so that I don't have to 
read the source, and also because it is only if you explicitly document 
the intended behavior that I can tell whether the actual implementation 
matches the intention.)

>        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
>        contains the information about how to poll the events. If it's NULL, this
>        call will immediately return after running all the commands in cmds.
> 
>        The structure is defined as below:
> 
>            struct epoll_wait_spec {
> 
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
> 
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
> 
>                   /* Which clock to use for timeout */
>                   int clockid;

Which clocks can be specified here?
CLOCK_MONOTONIC?
CLOCK_REALTIME?
CLOCK_PROCESS_CPUTIME_ID?
clock_getcpuclockid()?
others?

>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;

Is this timeout relative or absolute?

>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;

I just want to confirm here that 'sigmask' can be NULL, meaning
that we degenerate to epoll_wait() functionality, right?

>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

What is the "EPOLL_PACKED" here for?

> RETURN VALUE
> 
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.
> 
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

s/milliseconds//

> 
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.

s/occured/occurred/

> ERRORS
> 
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.
>
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EFAULT The memory area pointed to by events is not accessible with write
>               permissions.
> 
>        EINTR  The call was interrupted by a signal handler before either any of
>               the requested events occurred or the timeout expired; see
>               signal(7).
> 
>        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
>               or equal to zero, or fd is the same as epfd, or the requested
>               operation op is not supported by this interface.

Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_mod_wait() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Please add sigprocmask(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:50 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra
In-Reply-To: <1421747878-30744-6-git-send-email-famz@redhat.com>

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:54 UTC (permalink / raw)
  To: Daniel Mack, Florian Weimer, David Herrmann, Greg Kroah-Hartman
  Cc: mtk.manpages, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Djalal Harouni
In-Reply-To: <54BE1116.5060204@zonque.org>

On 01/20/2015 09:25 AM, Daniel Mack wrote:
> Hi Michael,
> 
> On 01/20/2015 09:09 AM, Michael Kerrisk (man-pages) wrote:
>> On 11/30/2014 06:23 PM, Florian Weimer wrote:
>>> * David Herrmann:
>>>
>>>> On Sun, Nov 30, 2014 at 10:02 AM, Florian Weimer <fw@deneb.enyo.de> wrote:
>>>>> * Greg Kroah-Hartman:
>>>>>
>>>>>> +7.4 Receiving messages
>>>
>>>>> What happens if this is not possible because the file descriptor limit
>>>>> of the processes would be exceeded?  EMFILE, and the message will not
>>>>> be received?
>>>>
>>>> The message is returned without installing the FDs. This is signaled
>>>> by EMFILE, but a valid pool offset.
>>>
>>> Oh.  This is really surprising, so it needs documentation.  But it's
>>> probably better than the alternative (return EMFILE and leave the
>>> message stuck, so that you receive it immediately again—this behavior
>>> makes non-blocking accept rather difficult to use correctly).
>>
>> So, was this point in the end explicitly documented? I not
>> obvious that it is documented in the revised kdbus.txt that
>> Greg K-H sent out 4 days ago.
> 
> No, we've revisited this point and changed the kernel behavior again in
> v3. We're no longer returning -EMFILE in this case, but rather set
> KDBUS_RECV_RETURN_INCOMPLETE_FDS in a new field in the receive ioctl
> struct called 'return_flags'. We believe that's a nicer way of signaling
> specific errors. The message will carry -1 for all FDs that failed to
> get installed, so the user can actually see which one is missing.
> 
> That's also documented in kdbus.txt, but we missed putting it into the
> Changelog - sorry for that.

Thanks for the info, Daniel.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Johannes Stezenbach @ 2015-01-20 13:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: arnd-r2nGTMty4D4, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <20150120112609.GA17198-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Tue, Jan 20, 2015 at 07:26:09PM +0800, Greg Kroah-Hartman wrote:
> On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:
> > 
> > So I did some googling and found in QNX servers create a channel
> > to receive messages, and clients connect to this channel.
> > Multiple clients can connect to the channel.
> 
> Hence, a bus :)
> 
> > But it is not a bus -- no multicast/broadcast, and no name
> > service or policy rules like D-Bus has.  To me it looks
> > to be similar in functionality to UNIX domain sockets.
> 
> It's not as complex as D-Bus, but it's still subscribing to things and
> getting messages.

Apparently you don't read what I write, probably you're not interested
in this discussion anymore...
QNX uses the term "channel" but it does not refer to a bus
or subscription facility, it is more like a socket in listening state.

> > My guess is that the people porting from QNX were just confused
> > and their use of D-Bus was in error.  Maybe they should've used
> > plain sockets, capnproto, ZeroMQ or whatever.
> 
> I tend to trust that they knew what they were doing, they wouldn't have
> picked D-Bus for no good reason.

The automotive developers I had the pleasure to work with would
use anything which is available via a mouse click in the
commercial Embedded Linux SDK IDE of their choice :)
Let's face it: QNX has a single IPC solution while Linux has
a confusing multitude of possibilities.

> > Well, IMHO you got it backwards.  Before adding a complex new IPC
> > API to the kernel you should do the homework and gather some
> > evidence that there will be enough users to justify the addition.
> 
> systemd wants this today for early boot.  It will remove lots of code
> and enable a lot of good things to happen.  The first email in this
> thread describes this quite well, is that not sufficient?

The first mail in this thread doesn't even mention systemd,
instead it has a lot of "marketing" buzzwords.
Of course it is no secret that systemd is the driving force
behind kdbus, but no public record exists to explain why
kdbus was chosen and designed the way it is, what alternatives
were considered and rejected etc.  (or if there is, please send a link)

> > FWIW, my gut feeling was that the earlier attempts to add a new
> > IPC primitve like multicast UNIX domain sockets
> > http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
> > were a much saner approach.  But now I think the comments
> > from this old thread have not been addressed, instead the
> > new approach just made the thing more complex and
> > put in ipc/ instead of net/ to bypass the guards.
> 
> Not at all, the networking maintainers said that that proposal was not
> acceptable to them at all and it should not be done in the networking
> stack at all.  So this was solution was created instead, which provides
> a lot more things than the old networking patches did, which shows that
> the networking developers were right to reject it.

Please read the gmane thread to the end.  It seems there were
several indications that D-Bus can be improved in userspace
using existing kernel facilities.  Havoc Pennington's mail
I quoted in my first response also contains some hints
about it.  I have no idea if any of this has ever been
pursued.  While adding complexity to critical net/ code paths
is probematic and a good reason to reject it, this was not
the only reason, the major one being "not neccessary".


Thanks,
Johannes

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 13:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A, Daniel Mack, Johannes Stezenbach
In-Reply-To: <1421435777-25306-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> 
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
> 
> The interface to all functions in this driver is implemented via ioctls
> on files exposed through a filesystem called 'kdbusfs'. The default
> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> documentation about the kernel level API design.

I have some details feedback on the contents of this file, and some 
bigger questions. I'll split them out into separate mails.

So here, the bigger, general questions to start with. I've arrived late 
to this, so sorry if they've already been discussed, but the answers to 
some of the questions should actually be in this file, I would have 
expected.

This is an enormous and complex API. Why is the API ioctl() based,
rather than system-call-based? Have we learned nothing from the hydra
that the futex() multiplexing syscall became? (And kdbus is an order
of magnitude more complex, by the look of things.) At the very least,
a *good* justification of why the API is ioctl()-based should be part
of this documentation file.

An observation: The documentation below is substantial, but this API is 
enormous, so the documentation still feels rather thin. What would 
really help would be some example code in the doc. 

And on the subject of code examples... Are there any (prototype) 
working user-space applications that exercise the current kdbus 
implementation? That is, can I install these kdbus patches, and
then find a simple example application somewhere that does
something to exercise kdbus?

And then: is there any substantial real-world application (e.g., a 
full D-Bus port) that is being developed in tandem with this kernel
side patch? (I don't mean a user-space library; I mean a seriously
large application.) This is an incredibly complex API whose
failings are only going to become evident through real-world use.
Solidifying an API in the kernel and then discovering the API
problems later when writing real-world applications would make for
a sad story. A story something like that of inotify, an API which 
is an order of magnitude less complex than kdbus. (I can't help but
feel that many of inotify problems that I discuss at 
https://lwn.net/Articles/605128/ might have been fixed or mitigated 
if a few real-world applications had been implemented before the
API  was set in stone.)

> +For a kdbus specific userspace library implementation please refer to:
> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h

Is this library intended just for systemd? More generally, is there an 
intention to provide a general purpose library API for kdbus? Or is the
intention that each application will roll a library suitable to its
needs? I think an answer to that question would be useful in this 
Documentation file.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 13:58 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A, Daniel Mack
In-Reply-To: <1421435777-25306-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> 
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
> 
> The interface to all functions in this driver is implemented via ioctls
> on files exposed through a filesystem called 'kdbusfs'. The default
> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> documentation about the kernel level API design.

And now the details feedback.

Please note that for the various points I raise below, even in
cases where I don't suggest/request a fix, the fact that I've
needed to answer a question probably suggests a deficiency in 
the documentation that probably needs to be remedied.

Many of my comments below are wording and typo fixes. While
these may sem trivial, the existence of various wording problems
and typos is a significant distraction, especially while trying
to grok a document of this size.

I've marked one or two notable questions about the API with "###".

> Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Signed-off-by: David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> ---
>  Documentation/kdbus.txt | 2107 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 2107 insertions(+)
>  create mode 100644 Documentation/kdbus.txt
> 
> diff --git a/Documentation/kdbus.txt b/Documentation/kdbus.txt
> new file mode 100644
> index 000000000000..2592a7e37079
> --- /dev/null
> +++ b/Documentation/kdbus.txt
> @@ -0,0 +1,2107 @@
> +D-Bus is a system for powerful, easy to use interprocess communication (IPC).
> +
> +The focus of this document is an overview of the low-level, native kernel D-Bus
> +transport called kdbus. Kdbus exposes its functionality via files in a
> +filesystem called 'kdbusfs'. All communication between processes takes place
> +via ioctls on files exposed through the mount point of a kdbusfs. The default
> +mount point of kdbusfs is /sys/fs/kdbus.
> +
> +Please note that kdbus was designed as transport layer for D-Bus, but is in no
> +way limited, nor controlled by the D-Bus protocol specification. The D-Bus
> +protocol is one possible application layer on top of kdbus.
> +
> +For the general D-Bus protocol specification, the payload format, the
> +marshaling, and the communication semantics, please refer to:
> +  http://dbus.freedesktop.org/doc/dbus-specification.html
> +
> +For a kdbus specific userspace library implementation please refer to:
> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h
> +
> +Articles about D-Bus and kdbus:
> +  http://lwn.net/Articles/580194/
> +
> +
> +1. Terminology
> +===============================================================================
> +
> +  Domain:
> +    A domain is created each time a kdbusfs is mounted. Each process that is
> +    capable to mount a new instance of a kdbusfs will have its own kdbus

s/is capable to mount/mounts/

> +    hierarchy. Each domain (ie, each mount point) offers its own "control"
> +    file to create new buses. Domains have no connection to each other and
> +    cannot see nor talk to each other. See section 5 for more details.

Smoother would be:

    s/cannot see nor talk/can neither see nor talk/

> +
> +  Bus:
> +    A bus is a named object inside a domain. Clients exchange messages
> +    over a bus. Multiple buses themselves have no connection to each other;
> +    messages can only be exchanged on the same bus. The default endpoint of
> +    a bus, where clients establish the connection to, is the "bus" file

Maybe:

where clients establish the connection to
==>
to which clients establish connections
?

> +    /sys/fs/kdbus/<bus name>/bus.
> +    Common operating system setups create one "system bus" per system, and one
> +    "user bus" for every logged-in user. Applications or services may create

At the kdbus level is there any difference between such system and user buses?
If not, it would perhaps be good to insert a parenthetical aside to say 
that.

> +    their own private buses.  See section 5 for more details.
> +
> +  Endpoint:
> +    An endpoint provides a file to talk to a bus. Opening an endpoint
> +    creates a new connection to the bus to which the endpoint belongs. All
> +    endpoints have unique names and are accessible as files underneath the
> +    directory of a bus, e.g., /sys/fs/kdbus/<bus>/<endpoint>
> +    Every bus has a default endpoint called "bus". A bus can optionally offer
> +    additional endpoints with custom names to provide restricted access to the
> +    bus. Custom endpoints carry additional policy which can be used to create
> +    sandboxes with locked-down, limited, filtered access to a bus.  See
> +    section 5 for more details.
> +
> +  Connection:
> +    A connection to a bus is created by opening an endpoint file of a bus and
> +    becoming an active client with the HELLO exchange. Every ordinary client
> +    connection has a unique identifier on the bus and can address messages to
> +    every other connection on the same bus by using the peer's connection id
> +    as the destination.  See section 6 for more details.
> +
> +  Pool:
> +    Each connection allocates a piece of shmem-backed memory that is used
> +    to receive messages and answers to ioctl commands from the kernel. It is
> +    never used to send anything to the kernel. In order to access that memory,
> +    userspace must mmap() it into its address space.

s/userspace/a userspace application/

> +    See section 12 for more details.
> +
> +  Well-known Name:
> +    A connection can, in addition to its implicit unique connection id, request
> +    the ownership of a textual well-known name. Well-known names are noted in
> +    reverse-domain notation, such as com.example.service1. Connections offering
> +    a service on a bus are usually reached by its well-known name. The analogy

Noun/pronoun number disagreement at "Connections... its".
==>
    A connection that offers a service on a bus is usually reached by its 
    well-known name.

> +    of connection id and well-known name is an IP address and a DNS name

Doing s/id/ID/ throughout the doc would help readability. (The doc 
already uses "ID" in some places, so consistency is a further argument 
for this change.)

> +    associated with that address.
> +
> +  Message:
> +    Connections can exchange messages with other connections by addressing
> +    the peers with their connection id or well-known name. A message consists
> +    of a message header with kernel-specific information on how to route the

What does "kernel-specific" mean here? Something needs explaining (or removing).

> +    message, and the message payload, which is a logical byte stream of
> +    arbitrary size. Messages can carry additional file descriptors to be passed

So, this is FD passing like UNIX domain sockets? If yes, it would be helpful
here to mention that analogy.

> +    from one connection to another. Every connection can specify which set of
> +    metadata the kernel should attach to the message when it is delivered
> +    to the receiving connection. Metadata contains information like: system
> +    timestamps, uid, gid, tid, proc-starttime, well-known-names, process comm,

s/well-known-names/well-known names/

See the note on "ID" above. I think UID, GID, TID throughout would help 
readability.

> +    process exe, process argv, cgroup, capabilities, seclabel, audit session,
> +    loginuid and the connection's human-readable name.
> +    See section 7 and 13 for more details.
> +
> +  Item:
> +    The API of kdbus implements a notion of items, submitted through and

s/a notion/the notion/

> +    returned by most ioctls, and stored inside data structures in the
> +    connection's pool. See section 4 for more details.
> +
> +  Broadcast and Match:
> +    Broadcast messages are potentially sent to all connections of a bus. By
> +    default, the connections will not actually receive any of the sent
> +    broadcast messages; only after installing a match for specific message
> +    properties, a broadcast message passes this filter.

"Filter" suddenly gets mentioned here without previously being defined. I suspect 
the last piece should read more like:

  a connection will receive a broadcast message only after it installs a filter
  that matches the specific message properties of the broadcast message.

> +    See section 10 for more details.
> +
> +  Policy:
> +    A policy is a set of rules that define which connections can see, talk to,
> +    or register a well-know name on the bus. A policy is attached to buses and

s/know/know/

> +    custom endpoints, and modified by policy holder connections or owners of
> +    custom endpoints. See section 11 for more details.
> +    See section 11 for more details.

Repeated last sentence.

> +  Privileged bus users:
> +    A user connecting to the bus is considered privileged if it is either the
> +    creator of the bus, or if it has the CAP_IPC_OWNER capability flag set.
> +
> +
> +2. Control Files Layout
> +===============================================================================
> +
> +The kdbus interface is exposed through files in its kdbusfs mount point
> +(defaults to /sys/fs/kdbus):
> +
> +  /sys/fs/kdbus                 (mount point of kdbusfs)
> +  |-- control                   (domain control-file)
> +  |-- 0-system                  (bus of user uid=0)
> +  |   |-- bus                   (default endpoint of bus '0-system')
> +  |   `-- ep.apache             (custom endpoint of bus '0-system')
> +  |-- 1000-user                 (bus of user uid=1000)
> +  |   `-- bus                   (default endpoint of bus '1000-user')
> +  `-- 2702-user                 (bus of user uid=2702)
> +      |-- bus                   (default endpoint of bus '2702-user')
> +      `-- ep.app                (custom endpoint of bus '2702-user')
> +
> +
> +3. Data Structures and flags
> +===============================================================================
> +
> +3.1 Data structures and interconnections
> +----------------------------------------
> +
> +  +--------------------------------------------------------------------------+
> +  | Domain (Mount Point)                                                     |
> +  | /sys/fs/kdbus/control                                                    |
> +  | +----------------------------------------------------------------------+ |
> +  | | Bus (System Bus)                                                     | |
> +  | | /sys/fs/kdbus/0-system/                                              | |
> +  | | +-------------------------------+ +--------------------------------+ | |
> +  | | | Endpoint                      | | Endpoint                       | | |
> +  | | | /sys/fs/kdbus/0-system/bus    | | /sys/fs/kdbus/0-system/ep.app  | | |
> +  | | +-------------------------------+ +--------------------------------+ | |
> +  | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> +  | | | Connection   | | Connection   | | Connection   | | Connection    | | |
> +  | | | :1.22        | | :1.25        | | :1.55        | | :1.81         | | |
> +  | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> +  | +----------------------------------------------------------------------+ |
> +  |                                                                          |
> +  | +----------------------------------------------------------------------+ |
> +  | | Bus (User Bus for UID 2702)                                          | |
> +  | | /sys/fs/kdbus/2702-user/                                             | |
> +  | | +-------------------------------+ +--------------------------------+ | |
> +  | | | Endpoint                      | | Endpoint                       | | |
> +  | | | /sys/fs/kdbus/2702-user/bus   | | /sys/fs/kdbus/2702-user/ep.app | | |
> +  | | +-------------------------------+ +--------------------------------+ | |
> +  | | +--------------+ +--------------+ +--------------+ +---------------+ | |
> +  | | | Connection   | | Connection   | | Connection   | | Connection    | | |
> +  | | | :1.22        | | :1.25        | | :1.55        | | :1.81         | | |
> +  | | +--------------+ +--------------+ +--------------------------------+ | |
> +  | +----------------------------------------------------------------------+ |
> +  +--------------------------------------------------------------------------+
> +
> +The above description uses the D-Bus notation of unique connection names that
> +adds a ":1." prefix to the connection's unique ID. kdbus itself doesn't
> +use that notation, neither internally nor externally. However, libraries and
> +other userspace code that aims for compatibility to D-Bus might.

s/compatibility to/compatibility with/

> +
> +3.2 Flags
> +---------
> +
> +All ioctls used in the communication with the driver contain three 64-bit
> +fields: 'flags', 'kernel_flags' and 'return_flags'. All of them are specific
> +to the ioctl used.
> +
> +In 'flags', the behavior of the command can be tweaked. All bits that are not
> +recognized by the kernel in this field are rejected, and the ioctl fails with
> +-EINVAL.
> +
> +In 'kernel_flags', the kernel driver writes back the mask of supported bits
> +upon each call, and sets the KDBUS_FLAGS_KERNEL bit. This is a way to probe
> +possible kernel features and make userspace code forward and backward
> +compatible.

So, is "kernel_flags' a bounding superset of what the caller may specify 
in 'flags'? If yes, please make that clearer.

> +
> +In 'return_flags', the kernel can return results of the command, in addition
> +to the actual return value. This is mostly to inform userspace about non-fatal
> +conditions that occurred during the execution of the command.
> +
> +
> +4. Items
> +===============================================================================
> +
> +To flexibly augment transport structures, data blobs of type struct kdbus_item
> +can be attached to the structs passed into the ioctls. Some ioctls make items
> +of certain types mandatory, others are optional. Unsupported items will cause
> +the ioctl to fail -EINVAL.
> +
> +The total size of an item is variable and is in some cases defined by the item
> +type. In other cases, they can be of arbitrary length (for instance, a string).
> +
> +Items are also used for information stored in a connection's pool, such as
> +received messages, name lists or requested connection or bus owner information.
> +
> +Whenever items are used as part of the kdbus kernel API, they are embedded in
> +structs that have an overall size of their own, so there can be multiple items

"have an overall size of their own" as hard to grok. I think you mean

    ... are embedded insides structs that themselves include a size field
    containing the overall size of the structure. This allows multiple 
    items per ioctl.

> +per ioctl.
> +
> +The kernel expects all items to be aligned to 8-byte boundaries. Unaligned
> +items or such that are unsupported by the ioctl are rejected.

s/such/items/?
(Or otherwise replace "such" with whatever it actually means.)

> +A simple iterator in userspace would iterate over the items until the items
> +have reached the embedding structure's overall size. An example implementation
> +of such an iterator can be found in tools/testing/selftests/kdbus/kdbus-util.h.
> +
> +
> +5. Creation of new domains, buses and endpoints
> +===============================================================================
> +
> +
> +5.1 Domains
> +-----------
> +
> +A domain is a container of buses. Domains themselves do not provide any IPC
> +functionality. Their sole purpose is to manage buses allocated in their
> +domain. Each time kdbusfs is mounted, a new kdbus domain is created, with its
> +own 'control' file. The lifetime of the domain ends once the user has unmounted
> +the kdbusfs. If you mount kdbusfs multiple times, each will have its own kdbus
> +domain internally. 

What does that last sentence mean? Somehow it needs to be reworded to better
convey whatever sense it is trying to convey.

> Operations performed on one domain do not affect any
> +other domain.
> +
> +The full kdbusfs hierarchy, any sub-directory, or file can be bind-mounted to
> +an external mount point and will remain fully functional. The kdbus domain and
> +any linked resources stay available until the original mount and all subsequent
> +bind-mounts have been unmounted.
> +
> +During creation, domains pin the user-namespace of the creator and use
> +it as controlling user-namespace for this domain. Any user accounting is done

s/as/as the/

> +relative to that user-namespace.
> +
> +Newly created kdbus domains do not have any bus pre-created. The only resource
> +available is a 'control' file, which is used to manage kdbus domains.
> +Currently, 'control' files are exclusively used to create buses via

s/exclusively used/used exclusively/

> +KDBUS_CMD_BUS_MAKE, but further ioctls might be added in the future.
> +
> +
> +5.2 Buses
> +---------
> +
> +A bus is a shared resource between connections to transmit messages. Each bus

==> A bus is a resource that is shared between connections in order to 
    transmit messages

> +is independent and operations on the bus will not have any effect on other
> +buses. A bus is a management entity, that controls the addresses of its

s/,//

> +connections, their policies and message transactions performed via this bus.
> +
> +Each bus is bound to the domain it was created on. It has a custom name that is
> +unique across all buses of a domain. In kdbusfs, a bus is presented as a
> +directory. No operations can be performed on the bus itself, instead you need

s/,/;/

> +to perform those on an endpoint associated with the bus. Endpoints are

s/those/operations/

> +accessible as files underneath the bus directory. A default endpoint called
> +"bus" is provided on each bus.
> +
> +Bus names may be chosen freely except for one restriction: the name
> +must be prefixed with the numeric UID of the creator and a dash. This

s/UID/effective  UID/ 
(I assume it's the effective UID...)

> +is required to avoid namespace clashes between different users. When
> +creating a bus the name must be passed in properly formatted, or the

the name must be passed in properly formatted
==>
that name that is passed in must be properly formatted

> +kernel will refuse creation of the bus. Example: "1047-foobar" is an
> +OK name for a bus registered by a user with UID 1047. However,

s/OK/acceptable/

> +"1024-foobar" is not, and neither is "foobar".
> +The UID must be provided in the user-namespace of the parent domain.
> +
> +To create a new bus, you need to open the control file of a domain and run the

s/run/employ/ (One doesn't "run" a system call.)

> +KDBUS_CMD_BUS_MAKE ioctl. The control file descriptor that was used to issue
> +KDBUS_CMD_BUS_MAKE must not have been used for any other control-ioctl before

Maybe better:

have been used for any other control-ioctl before
==>
previously have been used for any other control-ioctl

> +and needs to be kept open for the entire life-time of the created bus. Closing

s/needs/must/ (just reads smoother)

> +it will immediately cleanup the entire bus and all its associated resources and
> +endpoints. Every control file descriptor can only be used to create a single
> +new bus; from that point on, it is not used for any further communication until
> +the final close().
> +
> +Each bus will generate a random, 128-bit UUID upon creation. It will be

/It/This UUID/

> +returned to creators of connections through kdbus_cmd_hello.id128 and can
> +be used by userspace to uniquely identify buses, even across different machines
> +or containers. The UUID will have its variant bits set to 'DCE', and denote
> +version 4 (random).

I find that last sentence rather difficult to grasp. I think more detail
needs to be added.

> +When creating buses, a variable list of items that must be passed in
> +the items array is expected otherwise bus creation will fail.

What does "a variable list of items that must be passed in
the items array" mean? Something needs fixing, I think.

> +
> +5.3 Endpoints
> +-------------
> +
> +Endpoints are entry points to a bus. By default, each bus has a default
> +endpoint called 'bus'. The bus owner has the ability to create custom
> +endpoints with specific names, permissions, and policy databases (see below).
> +An endpoint is presented as file underneath the directory of the parent bus.
> +
> +To create a custom endpoint, open the default endpoint ('bus') and use the
> +KDBUS_CMD_ENDPOINT_MAKE ioctl with "struct kdbus_cmd_make". Custom endpoints
> +always have a policy database that, by default, forbids any operation. You have
> +to explicitly install policy entries to allow any operation on this endpoint.
> +Once KDBUS_CMD_ENDPOINT_MAKE succeeded, this file descriptor will manage the
> +newly created endpoint resource. It cannot be used to manage further resources.
> +
> +Endpoint names may be chosen freely except for one restriction: the name
> +must be prefixed with the numeric UID of the creator and a dash. This

s/UID/effective UID/

> +is required to avoid namespace clashes between different users. When
> +creating an endpoint the name must be passed in properly formatted, or the

creating an endpoint the name must be passed in properly formatted
==>
creating an endpoint, the name that is passed in must be properly formatted

> +kernel will refuse creation of the endpoint. Example: "1047-foobar" is an
> +OK name for an endpoint registered by a user with UID 1047. However,

s/OK/acceptable/

> +"1024-foobar" is not, and neither is "foobar".

Because this text reads almost exactly as the bus text in 5.2, I did 
a double take here. I suggest making the text more distinct in each case.

So, for example:

    Example: "1047-my-endpoint" is an OK name for an endpoint registered 
    by a user with UID 1047. However, "1024-my-endpoint" is not, and 
    neither is "my-endpoint".

(And you could do similar in the bus text in section 5.2.)

> +The UID must be provided in the user-namespace of the parent domain.
> +
> +To create connections to a bus, you use KDBUS_CMD_HELLO. See section 6 for
> +details. Note that once KDBUS_CMD_HELLO succeeded, this file descriptor manages

Note that once KDBUS_CMD_HELLO succeeded,
==>
Note that after a successful KDBUS_CMD_HELLO,

> +the newly created connection resource. It cannot be used to manage further
> +resources.
> +
> +
> +5.4 Creating buses and endpoints
> +--------------------------------
> +
> +KDBUS_CMD_BUS_MAKE, and KDBUS_CMD_ENDPOINT_MAKE take a

s/,//

> +struct kdbus_cmd_make argument.
> +
> +struct kdbus_cmd_make {
> +  __u64 size;
> +    The overall size of the struct, including its items.
> +
> +  __u64 flags;
> +    The flags for creation.
> +
> +    KDBUS_MAKE_ACCESS_GROUP
> +      Make the bus or endpoint file group-accessible
> +
> +    KDBUS_MAKE_ACCESS_WORLD
> +      Make the bus or endpoint file world-accessible
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  struct kdbus_item items[0];
> +    A list of items that has specific meanings for KDBUS_CMD_BUS_MAKE

s/has/have/

> +    and KDBUS_CMD_ENDPOINT_MAKE (see above).
> +
> +    Following items are expected for KDBUS_CMD_BUS_MAKE:
> +    KDBUS_ITEM_MAKE_NAME
> +      Contains a string to identify the bus name.

So, up to here, I've seen no definition of 'kdbus_item', which leaves me 
asking questions like: what subfield is KDBUS_ITEM_MAKE_NAME stored in?
which subfield holds the pointer to the string?

Somewhere earlier,  kdbus_item needs to be exaplained in more detail, 
I think.

> +
> +    KDBUS_ITEM_BLOOM_PARAMETER
> +      Bus-wide bloom parameters passed in a dbus_bloom_parameter struct
> +
> +    KDBUS_ITEM_ATTACH_FLAGS_RECV
> +      An optional item that contains a set of required attach flags
> +      that connections must allow. This item is used as a negotiation
> +      measure during connection creation. If connections do not satisfy
> +      the bus requirements, they are not allowed on the bus.
> +      If not set, the bus does not require any metadata to be attached,

s/,/;/

> +      in this case connections are free to set their own attach flags.
> +
> +    KDBUS_ITEM_ATTACH_FLAGS_SEND
> +      An optional item that contains a set of attach flags that are
> +      returned to connections when they query the bus creator metadata.
> +      If not set, no metadata is returned.
> +
> +    Unrecognized items are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +6. Connections
> +===============================================================================
> +
> +
> +6.1 Connection IDs and well-known connection names
> +--------------------------------------------------
> +
> +Connections are identified by their connection id, internally implemented as a
> +uint64_t counter. The IDs of every newly created bus start at 1, and every new
> +connection will increment the counter by 1. The ids are not reused.

Again, please change "ids" to IDs" throughout.

> +
> +In higher level tools, the user visible representation of a connection is
> +defined by the D-Bus protocol specification as ":1.<id>".
> +
> +Messages with a specific uint64_t destination id are directly delivered to
> +the connection with the corresponding id. Messages with the special destination
> +id KDBUS_DST_ID_BROADCAST are broadcast messages and are potentially delivered
> +to all known connections on the bus; clients interested in broadcast messages
> +need to subscribe to the specific messages they are interested, though before
> +any broadcast message reaches them.

The piece following the semicolon would be better as this separate sentence, 
I think:

    However, in order to receive any broadcast messages, clients must
    to subscribe to the specific messages in which they are interested.

> +
> +Messages synthesized and sent directly by the kernel will carry the special
> +source id KDBUS_SRC_ID_KERNEL (0).
> +
> +In addition to the unique uint64_t connection id, established connections can
> +request the ownership of well-known names, under which they can be found and
> +addressed by other bus clients. A well-known name is associated with one and
> +only one connection at a time. See section 8 on name acquisition and the
> +name registry, and the validity of names.
> +
> +Messages can specify the special destination id 0 and carry a well-known name
> +in the message data. Such a message is delivered to the destination connection
> +which owns that well-known name.
> +
> +  +-------------------------------------------------------------------------+
> +  | +---------------+     +---------------------------+                     |
> +  | | Connection    |     | Message                   | -----------------+  |
> +  | | :1.22         | --> | src: 22                   |                  |  |
> +  | |               |     | dst: 25                   |                  |  |
> +  | |               |     |                           |                  |  |
> +  | |               |     |                           |                  |  |
> +  | |               |     +---------------------------+                  |  |
> +  | |               |                                                    |  |
> +  | |               | <--------------------------------------+           |  |
> +  | +---------------+                                        |           |  |
> +  |                                                          |           |  |
> +  | +---------------+     +---------------------------+      |           |  |
> +  | | Connection    |     | Message                   | -----+           |  |
> +  | | :1.25         | --> | src: 25                   |                  |  |
> +  | |               |     | dst: 0xffffffffffffffff   | -------------+   |  |
> +  | |               |     |  (KDBUS_DST_ID_BROADCAST) |              |   |  |
> +  | |               |     |                           | ---------+   |   |  |
> +  | |               |     +---------------------------+          |   |   |  |
> +  | |               |                                            |   |   |  |
> +  | |               | <--------------------------------------------------+  |
> +  | +---------------+                                            |   |      |
> +  |                                                              |   |      |
> +  | +---------------+     +---------------------------+          |   |      |
> +  | | Connection    |     | Message                   | --+      |   |      |
> +  | | :1.55         | --> | src: 55                   |   |      |   |      |
> +  | |               |     | dst: 0 / org.foo.bar      |   |      |   |      |
> +  | |               |     |                           |   |      |   |      |
> +  | |               |     |                           |   |      |   |      |
> +  | |               |     +---------------------------+   |      |   |      |
> +  | |               |                                     |      |   |      |
> +  | |               | <------------------------------------------+   |      |
> +  | +---------------+                                     |          |      |
> +  |                                                       |          |      |
> +  | +---------------+                                     |          |      |
> +  | | Connection    |                                     |          |      |
> +  | | :1.81         |                                     |          |      |
> +  | | org.foo.bar   |                                     |          |      |
> +  | |               |                                     |          |      |
> +  | |               |                                     |          |      |
> +  | |               | <-----------------------------------+          |      |
> +  | |               |                                                |      |
> +  | |               | <----------------------------------------------+      |
> +  | +---------------+                                                       |
> +  +-------------------------------------------------------------------------+
> +
> +
> +6.2 Creating connections
> +------------------------
> +
> +A connection to a bus is created by opening an endpoint file of a bus and
> +becoming an active client with the KDBUS_CMD_HELLO ioctl. Every connected client
> +connection has a unique identifier on the bus and can address messages to every
> +other connection on the same bus by using the peer's connection id as the
> +destination.
> +
> +The KDBUS_CMD_HELLO ioctl takes the following struct as argument.
> +
> +struct kdbus_cmd_hello {
> +  __u64 size;
> +    The overall size of the struct, including all attached items.
> +
> +  __u64 flags;
> +    Flags to apply to this connection:
> +
> +    KDBUS_HELLO_ACCEPT_FD
> +      When this flag is set, the connection can be sent file descriptors
> +      as message payload. If it's not set, any attempt of doing so will

s/any attempt of doing so/an attempt to send file descriptors/

> +      result in -ECOMM on the sender's side.
> +
> +    KDBUS_HELLO_ACTIVATOR
> +      Make this connection an activator (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes

s/attached which describes/attached. This item describes/

> +      the well-known name this connection should be an activator for.
> +
> +    KDBUS_HELLO_POLICY_HOLDER
> +      Make this connection a policy holder (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes

s/attached which describes/attached. This item describes/

> +      the well-known name this connection should hold a policy for.
> +
> +    KDBUS_HELLO_MONITOR
> +      Make this connection an eaves-dropping connection. See section 6.8 for

s/eaves-dropping/eavesdropping/

> +      more information.
> +
> +To also receive broadcast messages,
   ^
Indentation error.

> +      the connection has to upload appropriate matches as well.
> +      This flag is only valid for privileged bus connections.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  __u64 attach_flags_send;
> +      Set the bits for metadata this connection permits to be sent to the
> +      receiving peer. Only metadata items that are both allowed to be sent by
> +      the sender and that are requested by the receiver will effectively be
> +      attached to the message eventually. Note, however, that the bus may

What does "eventually" mean here?

> +      optionally enforce some of those bits to be set. If the match fails,

s/enforce/require/ ?

> +      -ECONNREFUSED will be returned. In either case, this field will be set
> +      to the mask of metadata items that are enforced by the bus. The
> +      KDBUS_FLAGS_KERNEL bit will as well be set.
> +
> +  __u64 attach_flags_recv;
> +      Request the attachment of metadata for each message received by this
> +      connection. The metadata actually attached may actually augment the list

Seems like two "actually" in the previous line is one too many.

> +      of requested items. See section 13 for more details.
> +
> +  __u64 bus_flags;
> +      Upon successful completion of the ioctl, this member will contain the
> +      flags of the bus it connected to.
> +
> +  __u64 id;
> +      Upon successful completion of the ioctl, this member will contain the
> +      id of the new connection.
> +
> +  __u64 pool_size;
> +      The size of the communication pool, in bytes. The pool can be accessed
> +      by calling mmap() on the file descriptor that was used to issue the
> +      KDBUS_CMD_HELLO ioctl.
> +
> +  __u64 offset;
> +      The kernel will return the offset in the pool where returned details
> +      will be stored.
> +
> +  __u8 id128[16];
> +      Upon successful completion of the ioctl, this member will contain the
> +      128 bit wide UUID of the connected bus.

s/128 bit wide/128-bit/

> +
> +  struct kdbus_item items[0];
> +      Variable list of items to add optional additional information. The

s/to add optional additional/containing optional additional/

> +      following items are currently expected/valid:
> +
> +      KDBUS_ITEM_CONN_DESCRIPTION
> +        Contains a string to describes this connection's name, so it can be

s/to/that/

> +        identified later.
> +
> +      KDBUS_ITEM_NAME
> +      KDBUS_ITEM_POLICY_ACCESS
> +        For activators and policy holders only, combinations of these two
> +        items describe policy access entries (see section about policy).
> +
> +      KDBUS_ITEM_CREDS
> +      KDBUS_ITEM_PIDS
> +      KDBUS_ITEM_SECLABEL
> +        Privileged bus users may submit these types in order to create
> +        connections with faked credentials. This information will be returned
> +        when peer information is queried by KDBUS_CMD_CONN_INFO. See section
> +        13 for more information.
> +
> +      Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +At the offset returned in the 'offset' field of struct kdbus_cmd_hello, the
> +kernel will store items of the following types:
> +
> +  KDBUS_ITEM_BLOOM_PARAMETER
> +      Bloom filter parameter as defined by the bus creator (see below).
> +
> +The offset in the pool has to be freed with the KDBUS_CMD_FREE ioctl.

As far as I can tell, KDBUS_CMD_FREE detailed anywhere in this file. It
needs a detailed description soemwhere.

> +
> +6.3 Activator and policy holder connection
> +------------------------------------------
> +
> +An activator connection is a placeholder for a well-known name. Messages sent
> +to such a connection can be used by userspace to start an implementer
> +connection, which will then get all the messages from the activator copied
> +over. An activator connection cannot be used to send any message.
> +
> +A policy holder connection only installs a policy for one or more names.
> +These policy entries are kept active as long as the connection is alive, and
> +are removed once it terminates. Such a policy connection type can be used to
> +deploy restrictions for names that are not yet active on the bus. A policy
> +holder connection cannot be used to send any message.
> +
> +The creation of activator, policy holder or monitor connections is an operation

What is a "monitor connection"? That term springs up unannounced. Is it
an "eavesdropping connection" as described above? Either define the term
"monitor connection" or use consistent terminology. (Actually, further down
in the document, it is clarified that "monitor connection" == "eavesdropper".
But that is not clear at THIS point in the document. It needs to be clearer.)

> +restricted to privileged users on the bus (see section "Terminology").
> +
> +
> +6.4 Retrieving information on a connection
> +------------------------------------------
> +
> +The KDBUS_CMD_CONN_INFO ioctl can be used to retrieve credentials and
> +properties of the initial creator of a connection. This ioctl uses the
> +following struct:
> +
> +struct kdbus_cmd_info {
> +  __u64 size;
> +    The overall size of the struct, including the name with its 0-byte string
> +    terminator.

"0-byte string terminator" reads strangely. I assume you mean "terminating 
null byte" / "null-terminated string" Best to use those more standard terms.

So, maybe:

    including the name with its terminating null byte

or:

    including the null-terminated 'name' string
    

There are multiple other instances of 0-byte in the doc, and I think they 
should also be fixed in a similar fashion.

> +
> +  __u64 flags;
> +    Specify which metadata items should be attached to the answer.

s/Specify/Specifies/

> +    See section 13 for more details.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  __u64 id;
> +    The connection's numerical ID to retrieve information for. If set to

Hard to parse sentence.

==> The numerical ID of the connection for which information is to 
    be retrieved.

> +    non-zero value, the 'name' field is ignored.

s/non-zero value/a non-zero value/

> +
> +  __u64 offset;
> +    When the ioctl returns, this value will yield the offset of the connection

s/value will yield/field will contain/

> +    information inside the caller's pool.
> +
> +  __u64 info_size;
> +    The kernel will return the size of the returned information, so applications
> +    can optionally mmap specific parts of the pool.
> +
> +  struct kdbus_item items[0];
> +    The optional item list, containing the well-known name to look up as
> +    a KDBUS_ITEM_OWNED_NAME. Only required if the 'id' field is set to 0.
> +    Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +After the ioctl returns, the following struct will be stored in the caller's
> +pool at 'offset'.
> +
> +struct kdbus_info {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 id;
> +    The connection's unique ID.
> +
> +  __u64 flags;
> +    The connection's flags as specified when it was created.
> +
> +  struct kdbus_item items[0];
> +    Depending on the 'flags' field in struct kdbus_cmd_info, items of
> +    types KDBUS_ITEM_OWNED_NAME and KDBUS_ITEM_CONN_DESCRIPTION are followed

s/are followed/follow/

> +    here.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.5 Getting information about a connection's bus creator
> +--------------------------------------------------------
> +
> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
> +the bus the connection is attached to. The metadata returned by this call is
> +collected during the creation of the bus and is never altered afterwards, so
> +it provides pristine information on the task that created the bus, at the
> +moment when it did so.
> +
> +In response to this call, a slice in the connection's pool is allocated and
> +filled with an object of type struct kdbus_info, pointed to by the ioctl's
> +'offset' field.
> +
> +struct kdbus_info {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 id;
> +    The bus ID
> +
> +  __u64 flags;
> +    The bus flags as specified when it was created.

s/it/the bus/

> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Metadata information is stored in items here. The item list contains
> +    a KDBUS_ITEM_MAKE_NAME item that indicates the bus name of the
> +    calling connection.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.6 Updating connection details
> +-------------------------------
> +
> +Some of a connection's details can be updated with the KDBUS_CMD_CONN_UPDATE
> +ioctl, using the file descriptor that was used to create the connection.
> +The update command uses the following struct.
> +
> +struct kdbus_cmd_update {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 flags;
> +    Currently no flags are supported. Reserved for future use.

> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  struct kdbus_item items[0];
> +    Items to describe the connection details to be updated. The following item
> +    types are supported:
> +
> +    KDBUS_ITEM_ATTACH_FLAGS_SEND
> +      Supply a new set of items that this connection permits to be sent along
> +      with messages.
> +
> +    KDBUS_ITEM_ATTACH_FLAGS_RECV
> +      Supply a new set of items to be attached to each message.
> +
> +    KDBUS_ITEM_NAME
> +    KDBUS_ITEM_POLICY_ACCESS
> +      Policy holder connections may supply a new set of policy information
> +      with these items. For other connection types, -EOPNOTSUPP is returned.
> +
> +    Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +6.7 Termination
> +---------------
> +
> +A connection can be terminated by simply closing its file descriptor. All
> +pending incoming messages will be discarded, and the memory in the pool will
> +be freed.
> +
> +An alternative way of closing down a connection is calling the KDBUS_CMD_BYEBYE
> +ioctl on it, which will only succeed if the message queue of the connection is
> +empty at the time of closing, otherwise, -EBUSY is returned.

The preceding is a little hard to parse. I suggest:

    An alternative way of closing down a connection is via the KDBUS_CMD_BYEBYE
    ioctl. This ioctly will succeed only if the message queue of the connection is
    empty at the time of closing; otherwise, -EBUSY is returned.

> +
> +When this ioctl returns successfully, the connection has been terminated and
> +won't accept any new messages from remote peers. This way, a connection can
> +be terminated race-free, without losing any messages.
> +
> +
> +6.8 Monitor connections ('eavesdropper')
> +----------------------------------------
> +
> +Eavesdropping connections are created by setting the KDBUS_HELLO_MONITOR flag
> +in struct kdbus_hello.flags. Such connections have all properties of any other,
> +regular connection, except for the following details:
> +
> +  * They will get every message sent over the bus, both unicasts and broadcasts
> +
> +  * Installing matches for broadcast messages is neither necessary nor allowed
> +
> +  * They cannot send messages or be directly addressed as receiver
> +
> +  * Their creation and destruction will not cause KDBUS_ITEM_ID_{ADD,REMOVE}
> +    notifications to be generated, so other connections cannot detect the
> +    presence of an eavesdropper.
> +
> +
> +7. Messages
> +===============================================================================
> +
> +Messages consist of a fixed-size header followed directly by a list of
> +variable-sized data 'items'. The overall message size is specified in the
> +header of the message. The chain of data items can contain well-defined
> +message metadata fields, raw data, references to data, or file descriptors.
> +
> +
> +7.1 Sending messages
> +--------------------
> +
> +Messages are passed to the kernel with the KDBUS_CMD_SEND ioctl. Depending
> +on the destination address of the message, the kernel delivers the message to
> +the specific destination connection or to all connections on the same bus.
> +Sending messages across buses is not possible. Messages are always queued in
> +the memory pool of the destination connection (see below).
> +
> +The KDBUS_CMD_SEND ioctl uses struct kdbus_cmd_send to describe the message
> +transfer.
> +
> +struct kdbus_cmd_send {
> +  __u64 size;
> +    The overall size of the struct, including the attached items.
> +
> +  __u64 flags;
> +    Flags for message delivery:
> +
> +    KDBUS_SEND_SYNC_REPLY
> +      By default, all calls to kdbus are considered asynchronous,
> +      non-blocking. However, as there are many use cases that need to wait
> +      for a remote peer to answer a method call, there's a way to send a
> +      message and wait for a reply in a synchronous fashion. This is what
> +      the KDBUS_MSG_SYNC_REPLY controls. The KDBUS_CMD_SEND ioctl will block
> +      until the reply has arrived, the timeout limit is reached, in case the
> +      remote connection was shut down, or if interrupted by a signal before
> +      any reply; see signal(7).
> +
> +      The offset of the reply message in the sender's pool is stored in in
> +      'offset_reply' when the ioctl has returned without error. Hence, there
> +      is no need for another KDBUS_CMD_RECV ioctl or anything else to receive
> +      the reply.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call of
> +    KDBUS_CMD_SEND.
> +
> +  __u64 kernel_msg_flags;
> +    Valid bits for message flags, returned by the kernel upon each call of
> +    KDBUS_CMD_SEND.
> +
> +  __u64 return_flags;
> +    Kernel-provided flags, returning non-fatal errors that occurred during
> +    send. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  __u64 msg_address;
> +    Userspace has to provide a pointer to a message (struct kdbus_msg) to send.
> +
> +  struct kdbus_msg_info reply;
> +    Only used for synchronous replies. See description of struct kdbus_cmd_recv
> +    for more details.
> +
> +  struct kdbus_item items[0];
> +    The following items are currently recognized:
> +
> +    KDBUS_ITEM_CANCEL_FD
> +      When this optional item is passed in, and the call is executed as SYNC
> +      call, the passed in file descriptor can be used as alternative
> +      cancellation point. The kernel will call poll() on this file descriptor,
> +      and if it reports any incoming bytes, the blocking send operation will
> +      be canceled, and the call will return -ECANCELED. Any type of file
> +      descriptor that implements poll() can be used as payload to this item.
> +      For asynchronous message sending, this item is accepted but ignored.
> +
> +    All other items are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +The message referenced by 'msg_address' above has the following layout.
> +
> +struct kdbus_msg {
> +  __u64 size;
> +    The overall size of the struct, including the attached items.
> +
> +  __u64 flags;
> +    KDBUS_MSG_EXPECT_REPLY
> +      Expect a reply from the remote peer to this message. With this bit set,

s/from the remote peer to this message/to this message from the remote peer/

> +      the timeout_ns field must be set to a non-zero number of nanoseconds in
> +      which the receiving peer is expected to reply. If such a reply is not
> +      received in time, the sender will be notified with a timeout message
> +      (see below). The value must be an absolute value, in nanoseconds and
> +      based on CLOCK_MONOTONIC.

Why is the option of a relative timeout not available?

> +      For a message to be accepted as reply, it must be a direct message to
> +      the original sender (not a broadcast), and its kdbus_msg.reply_cookie
> +      must match the previous message's kdbus_msg.cookie.
> +
> +      Expected replies also temporarily open the policy of the sending
> +      connection, so the other peer is allowed to respond within the given
> +      time window.
> +
> +    KDBUS_MSG_NO_AUTO_START
> +      By default, when a message is sent to an activator connection, the
> +      activator notified and will start an implementer. This flag inhibits
> +      that behavior. With this bit set, and the remote being an activator,
> +      -EADDRNOTAVAIL is returned from the ioctl.
> +
> +  __s64 priority;
> +    The priority of this message. Receiving messages (see below) may
> +    optionally be constrained to messages of a minimal priority. This
> +    allows for use cases where timing critical data is interleaved with
> +    control data on the same connection. If unused, the priority should be
> +    set to zero.
> +
> +  __u64 dst_id;
> +    The numeric ID of the destination connection, or KDBUS_DST_ID_BROADCAST
> +    (~0ULL) to address every peer on the bus, or KDBUS_DST_ID_NAME (0) to look
> +    it up dynamically from the bus' name registry. In the latter case, an item
> +    of type KDBUS_ITEM_DST_NAME is mandatory.
> +
> +  __u64 src_id;
> +    Upon return of the ioctl, this member will contain the sending
> +    connection's numerical ID. Should be 0 at send time.
> +
> +  __u64 payload_type;
> +    Type of the payload in the actual data records. Currently, only
> +    KDBUS_PAYLOAD_DBUS is accepted as input value of this field. When
> +    receiving messages that are generated by the kernel (notifications),
> +    this field will yield KDBUS_PAYLOAD_KERNEL.

s/yield/cointain/ ?

> +
> +  __u64 cookie;
> +    Cookie of this message, for later recognition. Also, when replying
> +    to a message (see above), the cookie_reply field must match this value.
> +
> +  __u64 timeout_ns;
> +    If the message sent requires a reply from the remote peer (see above),
> +    this field contains the timeout in absolute nanoseconds based on
> +    CLOCK_MONOTONIC.
> +
> +  __u64 cookie_reply;
> +    If the message sent is a reply to another message, this field must
> +    match the cookie of the formerly received message.
> +
> +  struct kdbus_item items[0];
> +    A dynamically sized list of items to contain additional information.
> +    The following items are expected/valid:
> +
> +    KDBUS_ITEM_PAYLOAD_VEC
> +    KDBUS_ITEM_PAYLOAD_MEMFD
> +    KDBUS_ITEM_FDS
> +      Actual data records containing the payload. See section "Passing of
> +      Payload Data".
> +
> +    KDBUS_ITEM_BLOOM_FILTER
> +      Bloom filter for matches (see below).
> +
> +    KDBUS_ITEM_DST_NAME
> +      Well-known name to send this message to. Required if dst_id is set
> +      to KDBUS_DST_ID_NAME. If a connection holding the given name can't
> +      be found, -ESRCH is returned.
> +      For messages to a unique name (ID), this item is optional. If present,
> +      the kernel will make sure the name owner matches the given unique name.
> +      This allows userspace tie the message sending to the condition that a
> +      name is currently owned by a certain unique name.
> +};
> +
> +The message will be augmented by the requested metadata items when queued into
> +the receiver's pool. See also section 13.2 ("Metadata and namespaces").
> +
> +
> +7.2 Message layout
> +------------------
> +
> +The layout of a message is shown below.
> +
> +  +-------------------------------------------------------------------------+
> +  | Message                                                                 |
> +  | +---------------------------------------------------------------------+ |
> +  | | Header                                                              | |
> +  | | size:          overall message size, including the data records     | |
> +  | | destination:   connection id of the receiver                        | |
> +  | | source:        connection id of the sender (set by kernel)          | |
> +  | | payload_type:  "DBusDBus" textual identifier stored as uint64_t     | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size:  overall record size (without padding)                        | |
> +  | | type:  type of data                                                 | |
> +  | | data:  reference to data (address or file descriptor)               | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | padding bytes to the next 8 byte alignment                          | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size:  overall record size (without padding)                        | |
> +  | | ...                                                                 | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | padding bytes to the next 8 byte alignment                          | |
> +  | +---------------------------------------------------------------------+ |
> +  | +---------------------------------------------------------------------+ |
> +  | | Data Record                                                         | |
> +  | | size:  overall record size                                          | |
> +  | | ...                                                                 | |
> +  | +---------------------------------------------------------------------+ |
> +  |   ... further data records ...                                          |
> +  +-------------------------------------------------------------------------+
> +
> +
> +7.3 Passing of Payload Data
> +---------------------------
> +
> +When connecting to the bus, receivers request a memory pool of a given size,
> +large enough to carry all backlog of data enqueued for the connection. The
> +pool is internally backed by a shared memory file which can be mmap()ed by
> +the receiver.
> +
> +KDBUS_MSG_PAYLOAD_VEC:
> +  Messages are directly copied by the sending process into the receiver's pool,

s/,/./ and then start a new sentence.

> +  that way two peers can exchange data by effectively doing a single-copy from
> +  one process to another, the kernel will not buffer the data anywhere else.
s/,/;/

> +
> +KDBUS_MSG_PAYLOAD_MEMFD:
> +  Messages can reference memfd files which contain the data.
> +  memfd files are tmpfs-backed files that allow sealing of the content of the
> +  file, which prevents all writable access to the file content.
> +  Only memfds that have (F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL) set
> +  are accepted as payload data, which enforces reliable passing of data.
> +  The receiver can assume that neither the sender nor anyone else can alter the
> +  content after the message is sent.
> +  Apart from the sender filling-in the content into memfd files, the data will
> +  be passed as zero-copy from one process to another, read-only, shared between
> +  the peers.
> +
> +The sender must not make any assumptions on the type how data is received by the

Wording error at "the type how". Some fix is needed.

> +remote peer. The kernel is free to re-pack multiple VEC and MEMFD payloads. For
> +instance, the kernel may decide to merge multiple VECs into a single VEC, inline
> +MEMFD payloads into memory or merge all passed VECs into a single MEMFD.
> +However, the kernel preserves the order of passed data. This means, the order of

s/,//

> +all VEC and MEMFD items is not changed in respect to each other.

s/in/with/

> +
> +In other words: All passed VEC and MEMFD data payloads are treated as a single
> +stream of data that may be received by the remote peer in a different set of
> +hunks than it was sent as.
> +
> +
> +7.4 Receiving messages
> +----------------------
> +
> +Messages are received by the client with the KDBUS_CMD_RECV ioctl. The endpoint
> +file of the bus supports poll() to wake up the receiving process when new

s%poll()%poll/select()/epoll%  ?

> +messages are queued up to be received.
> +
> +With the KDBUS_CMD_RECV ioctl, a struct kdbus_cmd_recv is used.
> +
> +struct kdbus_cmd_recv {
> +  __u64 size;
> +    The overall size of the struct, including the attached items.
> +
> +  __u64 flags;
> +    Flags to control the receive command.
> +
> +    KDBUS_RECV_PEEK
> +      Just return the location of the next message. Do not install file
> +      descriptors or anything else. This is usually used to determine the
> +      sender of the next queued message.
> +
> +    KDBUS_RECV_DROP
> +      Drop the next message without doing anything else with it, and free the
> +      pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
> +
> +    KDBUS_RECV_USE_PRIORITY
> +      Use the priority field (see below).
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Kernel-provided flags, returning non-fatal errors that occurred during
> +    send. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  __s64 priority;
> +    With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
> +    the queue with at least the given priority. If no such message is waiting
> +    in the queue, -ENOMSG is returned.

###
How do I simply select the highest priority message, without knowing what 
its priority is?

> +
> +  __u64 dropped_msgs;
> +    If the CMD_RECV ioctl fails with EOVERFLOW, this field is filled by
> +    the kernel with the number of messages that couldn't be transmitted to
> +    this connection. In that case, the @offset member must not be accessed.
> +
> +  struct kdbus_msg_info msg;
> +   Embedded struct to be filled when the command succeeded (see below).
> +
> +  struct kdbus_item items[0];
> +    Items to specify further details for the receive command. Currently unused.
> +};
> +
> +Both 'struct kdbus_cmd_recv' and 'struct kdbus_cmd_send' embed 'struct
> +kdbus_msg_info'. For the SEND ioctl, it is used to catch synchronous replies,
> +if one was requested, and is unused otherwise.
> +
> +struct kdbus_msg_info {
> +  __u64 offset;
> +    Upon return of the ioctl, this field contains the offset in the receiver's
> +    memory pool. The memory must be freed with KDBUS_CMD_FREE.
> +
> +  __u64 msg_size;
> +    Upon successful return of the ioctl, this field contains the size of the
> +    allocated slice at offset @offset. It is the combination of the size of
> +    the stored kdbus_msg object plus all appended VECs. You can use it in
> +    combination with @offset to map a single message, instead of mapping the
> +    whole pool.
> +
> +  __u64 return_flags;
> +    Kernel-provided return flags. Currently, the following flags are defined.
> +
> +      KDBUS_RECV_RETURN_INCOMPLETE_FDS
> +        The message contained file descriptors which couldn't be installed
> +        into the receiver's task. Most probably that happened because the
> +        maximum number of file descriptors for that task were exceeded.
> +        The message is still delivered, so this is not a fatal condition.
> +        File descriptors inside the KDBUS_ITEM_FDS item that could not be
> +        installed will be set to -1.
> +};
> +
> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the

s/and given that the ioctl succeeded/after a successful KDBUS_CMD_RECV ioctl/

> +offset field contains the location of the new message inside the receiver's
> +pool. The message is stored as struct kdbus_msg at this offset, and can be
> +interpreted with the semantics described above.
> +
> +Also, if the connection allowed for file descriptor to be passed
> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
> +installed into the receiving process after the KDBUS_CMD_RECV ioctl returns.

###
"after"??? When exactly?

> +The receiving task is obliged to close all of them appropriately. If
> +KDBUS_RECV_PEEK is set, no file descriptors are installed. This allows for
> +peeking at a message and dropping it via KDBUS_RECV_DROP, without installing
> +the passed file descriptors into the receiving process.
> +
> +The caller is obliged to call KDBUS_CMD_FREE with the returned offset when
> +the memory is no longer needed.
> +
> +
> +8. Name registry
> +===============================================================================
> +
> +Each bus instantiates a name registry to resolve well-known names into unique
> +connection IDs for message delivery. The registry will be queried when a
> +message is sent with kdbus_msg.dst_id set to KDBUS_DST_ID_NAME, or when a
> +registry dump is requested.
> +
> +All of the below is subject to policy rules for SEE and OWN permissions.
> +
> +
> +8.1 Name validity
> +-----------------
> +
> +A name has to comply to the following rules to be considered valid:

s/comply to/comply with/

> +
> + - The name has two or more elements separated by a period ('.') character
> + - All elements must contain at least one character
> + - Each element must only contain the ASCII characters "[A-Z][a-z][0-9]_"
> +   and must not begin with a digit
> + - The name must contain at least one '.' (period) character
> +   (and thus at least two elements)
> + - The name must not begin with a '.' (period) character
> + - The name must not exceed KDBUS_NAME_MAX_LEN (255)
> +
> +
> +8.2 Acquiring a name
> +--------------------
> +
> +To acquire a name, a client uses the KDBUS_CMD_NAME_ACQUIRE ioctl with the
> +following data structure.
> +
> +struct kdbus_cmd_name {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    Flags to control details in the name acquisition.
> +
> +    KDBUS_NAME_REPLACE_EXISTING
> +      Acquiring a name that is already present usually fails, unless this flag
> +      is set in the call, and KDBUS_NAME_ALLOW_REPLACEMENT or (see below) was
> +      set when the current owner of the name acquired it, or if the current
> +      owner is an activator connection (see below).
> +
> +    KDBUS_NAME_ALLOW_REPLACEMENT
> +      Allow other connections to take over this name. When this happens, the
> +      former owner of the connection will be notified of the name loss.
> +
> +    KDBUS_NAME_QUEUE (acquire)
> +      A name that is already acquired by a connection, and which wasn't
> +      requested with the KDBUS_NAME_ALLOW_REPLACEMENT flag set can not be
> +      acquired again. However, a connection can put itself in a queue of
> +      connections waiting for the name to be released. Once that happens, the
> +      first connection in that queue becomes the new owner and is notified
> +      accordingly.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  struct kdbus_item items[0];
> +    Items to submit the name. Currently, one item of type KDBUS_ITEM_NAME is
> +    expected and allowed, and the contained string must be a valid bus name.
> +    Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +8.3 Releasing a name
> +--------------------
> +
> +A connection may release a name explicitly with the KDBUS_CMD_NAME_RELEASE
> +ioctl. If the connection was an implementer of an activatable name, its
> +pending messages are moved back to the activator. If there are any connections
> +queued up as waiters for the name, the oldest one of them will become the new
> +owner. The same happens implicitly for all names once a connection terminates.
> +
> +The KDBUS_CMD_NAME_RELEASE ioctl uses the same data structure as the
> +acquisition call, but with slightly different field usage.
> +
> +struct kdbus_cmd_name {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    Flags to the command. Currently unused.

And, so presumably must be 0?  Best to note that.

> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +  struct kdbus_item items[0];
> +    Items to submit the name. Currently, one item of type KDBUS_ITEM_NAME is
> +    expected and allowed, and the contained string must be a valid bus name.
> +};
> +
> +
> +8.4 Dumping the name registry
> +-----------------------------
> +
> +A connection may request a complete or filtered dump of currently active bus
> +names with the KDBUS_CMD_NAME_LIST ioctl, which takes a struct
> +kdbus_cmd_name_list as argument.
> +
> +struct kdbus_cmd_name_list {
> +  __u64 flags;
> +    Any combination of flags to specify which names should be dumped.
> +
> +    KDBUS_NAME_LIST_UNIQUE
> +      List the unique (numeric) IDs of the connection, whether it owns a name
> +      or not.
> +
> +    KDBUS_NAME_LIST_NAMES
> +      List well-known names stored in the database which are actively owned by
> +      a real connection (not an activator).
> +
> +    KDBUS_NAME_LIST_ACTIVATORS
> +      List names that are owned by an activator.
> +
> +    KDBUS_NAME_LIST_QUEUED
> +      List connections that are not yet owning a name but are waiting for it
> +      to become available.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  __u64 offset;
> +    When the ioctl returns successfully, the offset to the name registry dump
> +    inside the connection's pool will be stored in this field.
> +};
> +
> +The returned list of names is stored in a struct kdbus_name_list that in turn
> +contains a dynamic number of struct kdbus_cmd_name that carry the actual
> +information. The fields inside that struct kdbus_cmd_name is described next.
> +
> +struct kdbus_name_info {
> +  __u64 size;
> +    The overall size of this struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 owner_id;
> +    The owning connection's unique ID.
> +
> +  __u64 conn_flags;
> +    The flags of the owning connection.
> +
> +  struct kdbus_item items[0];
> +    Items containing the actual name. Currently, one item of type
> +    KDBUS_ITEM_OWNED_NAME will be attached, including the name's flags. In that
> +    item, the flags field of the name may carry the following bits:
> +
> +    KDBUS_NAME_ALLOW_REPLACEMENT
> +      Other connections are allowed to take over this name from the
> +      connection that owns it.
> +
> +    KDBUS_NAME_IN_QUEUE (list)
> +      When retrieving a list of currently acquired name in the registry, this
> +      flag indicates whether the connection actually owns the name or is
> +      currently waiting for it to become available.
> +
> +    KDBUS_NAME_ACTIVATOR (list)
> +      An activator connection owns a name as a placeholder for an implementer,
> +      which is started on demand as soon as the first message arrives. There's
> +      some more information on this topic below. In contrast to
> +      KDBUS_NAME_REPLACE_EXISTING, when a name is taken over from an activator
> +      connection, all the messages that have been queued in the activator
> +      connection will be moved over to the new owner. The activator connection
> +      will still be tracked for the name and will take control again if the
> +      implementer connection terminates.
> +      This flag can not be used when acquiring a name, but is implicitly set
> +      through KDBUS_CMD_HELLO with KDBUS_HELLO_ACTIVATOR set in
> +      kdbus_cmd_hello.conn_flags.
> +};
> +
> +The returned buffer must be freed with the KDBUS_CMD_FREE ioctl when the user
> +is finished with it.
> +
> +
> +9. Notifications
> +===============================================================================
> +
> +The kernel will notify its users of the following events.
> +
> +  * When connection A is terminated while connection B is waiting for a reply
> +    from it, connection B is notified with a message with an item of type
> +    KDBUS_ITEM_REPLY_DEAD.
> +
> +  * When connection A does not receive a reply from connection B within the
> +    specified timeout window, connection A will receive a message with an item
> +    of type KDBUS_ITEM_REPLY_TIMEOUT.
> +
> +  * When an ordinary connection (not a monitor) is created on or removed from
> +    a bus, messages with an item of type KDBUS_ITEM_ID_ADD or
> +    KDBUS_ITEM_ID_REMOVE, respectively, are sent to all bus members that match
> +    these messages through their match database. Eavesdroppers (monitor
> +    connections) do not cause such notifications to be sent. They are invisible
> +    on the bus.
> +
> +  * When a connection gains or loses ownership of a name, messages with an item
> +    of type KDBUS_ITEM_NAME_ADD, KDBUS_ITEM_NAME_REMOVE or
> +    KDBUS_ITEM_NAME_CHANGE are sent to all bus members that match these
> +    messages through their match database.
> +
> +A kernel notification is a regular kdbus message with the following details.
> +
> +  * kdbus_msg.src_id == KDBUS_SRC_ID_KERNEL
> +  * kdbus_msg.dst_id == KDBUS_DST_ID_BROADCAST
> +  * kdbus_msg.payload_type == KDBUS_PAYLOAD_KERNEL
> +  * Has exactly one of the aforementioned items attached
> +
> +Kernel notifications have an item of type KDBUS_ITEM_TIMESTAMP attached.
> +
> +
> +10. Message Matching, Bloom filters
> +===============================================================================
> +
> +10.1 Matches for broadcast messages from other connections
> +----------------------------------------------------------
> +
> +A message addressed at the connection ID KDBUS_DST_ID_BROADCAST (~0ULL) is a
> +broadcast message, delivered to all connected peers which installed a rule to
> +match certain properties of the message. Without any rules installed in the
> +connection, no broadcast message or kernel-side notifications will be delivered
> +to the connection. Broadcast messages are subject to policy rules and TALK
> +access checks.
> +
> +See section 11 for details on policies, and section 11.5 for more
> +details on implicit policies.
> +
> +Matches for messages from other connections (not kernel notifications) are
> +implemented as bloom filters. The sender adds certain properties of the message
> +as elements to a bloom filter bit field, and sends that along with the
> +broadcast message.
> +
> +The connection adds the message properties it is interested as elements to a
> +bloom mask bit field, and uploads the mask to the match rules of the
> +connection.
> +
> +The kernel will match the broadcast message's bloom filter against the
> +connections bloom mask (simply by &-ing it), and decide whether the message
> +should be delivered to the connection.
> +
> +The kernel has no notion of any specific properties of the message, all it
> +sees are the bit fields of the bloom filter and mask to match against. The
> +use of bloom filters allows simple and efficient matching, without exposing
> +any message properties or internals to the kernel side. Clients need to deal
> +with the fact that they might receive broadcasts which they did not subscribe
> +to, as the bloom filter might allow false-positives to pass the filter.
> +
> +To allow the future extension of the set of elements in the bloom filter, the
> +filter specifies a "generation" number. A later generation must always contain
> +all elements of the set of the previous generation, but can add new elements
> +to the set. The match rules mask can carry an array with all previous
> +generations of masks individually stored. When the filter and mask are matched
> +by the kernel, the mask with the closest matching "generation" is selected
> +as the index into the mask array.
> +
> +
> +10.2 Matches for kernel notifications
> +------------------------------------
> +
> +To receive kernel generated notifications (see section 9), a connection must
> +install special match rules that are different from the bloom filter matches
> +described in the section above. They can be filtered by a sender connection's
> +ID, by one of the name the sender connection owns at the time of sending the
> +message, or by type of the notification (id/name add/remove/change).

s/type/the type/

> +
> +10.3 Adding a match
> +-------------------
> +
> +To add a match, the KDBUS_CMD_MATCH_ADD ioctl is used, which takes a struct
> +of the struct described below.
> +
> +Note that each of the items attached to this command will internally create
> +one match 'rule', and the collection of them, which is submitted as one block
> +via the ioctl is called a 'match'. To allow a message to pass, all rules of a

s/ioctl/ioctl,/

> +match have to be satisfied. Hence, adding more items to the command will only
> +narrow the possibility of a match to effectively let the message pass, and will
> +cause the connection's user space process to wake up less likely.

Make that last line

decrease the chance that the connection's userspace process wil be woken up

> +
> +Multiple matches can be installed per connection. As long as one of it has a

s/it/them/ ?
(If that change is not correct, then the sentence is quite confused.)

> +set of rules which allows the message to pass, this one will be decisive.
> +
> +struct kdbus_cmd_match {
> +  __u64 size;
> +    The overall size of the struct, including its items.
> +
> +  __u64 cookie;
> +    A cookie which identifies the match, so it can be referred to at removal
> +    time.
> +
> +  __u64 flags;
> +    Flags to control the behavior of the ioctl.
> +
> +    KDBUS_MATCH_REPLACE:
> +      Remove all entries with the given cookie before installing the new one.
> +      This allows for race-free replacement of matches.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  struct kdbus_item items[0];
> +    Items to define the actual rules of the matches. The following item types
> +    are expected. Each item will cause one new match rule to be created.
> +
> +    KDBUS_ITEM_BLOOM_MASK
> +      An item that carries the bloom filter mask to match against in its
> +      data field. The payload size must match the bloom filter size that
> +      was specified when the bus was created.
> +      See section 10.4 for more information.
> +
> +    KDBUS_ITEM_NAME
> +      Specify a name that a sending connection must own at a time of sending

s/a time/the time/

> +      a broadcast message in order to match this rule.
> +
> +    KDBUS_ITEM_ID
> +      Specify a sender connection's ID that will match this rule.
> +
> +    KDBUS_ITEM_NAME_ADD
> +    KDBUS_ITEM_NAME_REMOVE
> +    KDBUS_ITEM_NAME_CHANGE
> +      These items request delivery of broadcast messages that describe a name
> +      acquisition, loss, or change. The details are stored in the item's
> +      kdbus_notify_name_change member. All information specified must be
> +      matched in order to make the message pass. Use KDBUS_MATCH_ID_ANY to
> +      match against any unique connection ID.
> +
> +    KDBUS_ITEM_ID_ADD
> +    KDBUS_ITEM_ID_REMOVE
> +      These items request delivery of broadcast messages that are generated
> +      when a connection is created or terminated. struct kdbus_notify_id_change
> +      is used to store the actual match information. This item can be used to
> +      monitor one particular connection ID, or, when the id field is set to
> +      KDBUS_MATCH_ID_ANY, all of them.
> +
> +    Items of other types are rejected, and the ioctl will fail with -EINVAL.
> +};
> +
> +
> +10.4 Bloom filters
> +------------------
> +
> +Bloom filters allow checking whether a given word is present in a dictionary.
> +This allows connections to set up a mask for information it is interested in,
> +and will be delivered signal messages that have a matching filter.
> +
> +For general information on bloom filters, see
> +
> +  https://en.wikipedia.org/wiki/Bloom_filter
> +
> +The size of the bloom filter is defined per bus when it is created, in
> +kdbus_bloom_parameter.size. All bloom filters attached to signals on the bus
> +must match this size, and all bloom filter matches uploaded by connections must
> +also match the size, or a multiple thereof (see below).
> +
> +The calculation of the mask has to be done on the userspace side. The kernel
> +just checks the bitmasks to decide whether or not to let the message pass. All
> +bits in the mask must match the filter in and bit-wise AND logic, but the

Parse error at "in and bit-wise AND logic". I am not sure what you are meaning 
there, buth something needs fixing.

> +mask may have more bits set than the filter. Consequently, false positive
> +matches are expected to happen, and userspace must deal with that fact.
> +
> +Masks are entities that are always passed to the kernel as part of a match
> +(with an item of type KDBUS_ITEM_BLOOM_MASK), and filters can be attached to
> +signals, with an item of type KDBUS_ITEM_BLOOM_FILTER.
> +
> +For a filter to match, all its bits have to be set in the match mask as well.
> +For example, consider a bus has a bloom size of 8 bytes, and the following

s/has/that has/

> +mask/filter combinations:
> +
> +    filter  0x0101010101010101
> +    mask    0x0101010101010101
> +            -> matches
> +
> +    filter  0x0303030303030303
> +    mask    0x0101010101010101
> +            -> doesn't match
> +
> +    filter  0x0101010101010101
> +    mask    0x0303030303030303
> +            -> matches
> +
> +Hence, in order to catch all messages, a mask filled with 0xff bytes can be
> +installed as a wildcard match rule.
> +
> +Uploaded matches may contain multiple masks, each of which in the size of the

Parse error at "each of which in the size of"
s/in/is/?

> +bloom size defined by the bus. Each block of a mask is called a 'generation',
> +starting at index 0.
> +
> +At match time, when a signal is about to be delivered, a bloom mask generation
> +is passed, which denotes which of the bloom masks the filter should be matched
> +against. This allows userspace to provide backward compatible masks at upload
> +time, while older clients can still match against older versions of filters.
> +
> +
> +10.5 Removing a match
> +--------------------
> +
> +Matches can be removed through the KDBUS_CMD_MATCH_REMOVE ioctl, which again
> +takes struct kdbus_cmd_match as argument, but its fields are used slightly
> +differently.
> +
> +struct kdbus_cmd_match {
> +  __u64 size;
> +    The overall size of the struct. As it has no items in this use case, the
> +    value should yield 16.

s/yield/contain/ ?
> +
> +  __u64 cookie;
> +    The cookie of the match, as it was passed when the match was added.
> +    All matches that have this cookie will be removed.
> +
> +  __u64 flags;
> +    Unused for this use case,
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 return_flags;
> +    Flags returned by the kernel. Currently unused.

And, so presumably always returned as 0?  Best to note that.

> +
> +  struct kdbus_item items[0];
> +    Unused und not allowed for this use case.
> +};
> +
> +
> +11. Policy
> +===============================================================================
> +
> +A policy databases restrict the possibilities of connections to own, see and
> +talk to well-known names. It can be associated with a bus (through a policy

s/It/A policy/

> +holder connection) or a custom endpoint.
> +
> +See section 8.1 for more details on the validity of well-known names.
> +
> +Default endpoints of buses always have a policy database. The default
> +policy is to deny all operations except for operations that are covered by
> +implicit policies. Custom endpoints always have a policy, and by default,
> +a policy database is empty. Therefore, unless policy rules are added, all

s/a policy database/the policy database/

> +operations will also be denied by default.
> +
> +See section 11.5 for more details on implicit policies.
> +
> +A set of policy rules is described by a name and multiple access rules, defined
> +by the following struct.
> +
> +struct kdbus_policy_access {
> +  __u64 type;	/* USER, GROUP, WORLD */
> +    One of the following.
> +
> +    KDBUS_POLICY_ACCESS_USER
> +      Grant access to a user with the uid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_GROUP
> +      Grant access to a user with the gid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_WORLD
> +      Grant access to everyone. The 'id' field is ignored.
> +
> +  __u64 access;	/* OWN, TALK, SEE */
> +    The access to grant.
> +
> +    KDBUS_POLICY_SEE
> +      Allow the name to be seen.
> +
> +    KDBUS_POLICY_TALK
> +      Allow the name to be talked to.
> +
> +    KDBUS_POLICY_OWN
> +      Allow the name to be owned.
> +
> +  __u64 id;
> +    For KDBUS_POLICY_ACCESS_USER, stores the uid.
> +    For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
> +};
> +
> +Policies are set through KDBUS_CMD_HELLO (when creating a policy holder
> +connection), KDBUS_CMD_CONN_UPDATE (when updating a policy holder connection),
> +KDBUS_CMD_ENDPOINT_MAKE (creating a custom endpoint) or
> +KDBUS_CMD_ENDPOINT_UPDATE (updating a custom endpoint). In all cases, the name
> +and policy access information is stored in items of type KDBUS_ITEM_NAME and
> +KDBUS_ITEM_POLICY_ACCESS. For this transport, the following rules apply.
> +
> +  * An item of type KDBUS_ITEM_NAME must be followed by at least one
> +    KDBUS_ITEM_POLICY_ACCESS item
> +  * An item of type KDBUS_ITEM_NAME can be followed by an arbitrary number of
> +    KDBUS_ITEM_POLICY_ACCESS items
> +  * An arbitrary number of groups of names and access levels can be passed
> +
> +uids and gids are internally always stored in the kernel's view of global ids,
> +and are translated back and forth on the ioctl level accordingly.
> +
> +
> +11.2 Wildcard names
> +-------------------
> +
> +Policy holder connections may upload names that contain the wild card suffix

s/wild card/wildcard/

> +(".*"). That way, a policy can be uploaded that is effective for every
> +well-known name that extends the provided name by exactly one more level.
> +
> +For example, if an item of a set up uploaded policy rules contains the name

I find that last line very difficult to parse. Someehting is broken. 
What are you trying to say?

> +"foo.bar.*", both "foo.bar.baz" and "foo.bar.bazbaz" are valid, but

s/both/then both/ to help parsing.

> +"foo.bar.baz.baz" is not.
> +
> +This allows connections to take control over multiple names that the policy
> +holder doesn't need to know about when uploading the policy.
> +
> +Such wildcard entries are not allowed for custom endpoints.
> +
> +
> +11.3 Policy example
> +-------------------
> +
> +For example, a set of policy rules may look like this:
> +
> +  KDBUS_ITEM_NAME: str='org.foo.bar'
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=1000
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=TALK, id=1001
> +  KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=SEE
> +  KDBUS_ITEM_NAME: str='org.blah.baz'
> +  KDBUS_ITEM_POLICY_ACCESS: type=USER, access=OWN, id=0
> +  KDBUS_ITEM_POLICY_ACCESS: type=WORLD, access=TALK
> +
> +That means that 'org.foo.bar' may only be owned by uid 1000, but every user on
> +the bus is allowed to see the name. However, only uid 1001 may actually send
> +a message to the connection and receive a reply from it.
> +
> +The second rule allows 'org.blah.baz' to be owned by uid 0 only, but every user
> +may talk to it.
> +
> +
> +11.4 TALK access and multiple well-known names per connection
> +-------------------------------------------------------------
> +
> +Note that TALK access is checked against all names of a connection.
> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
> +after all, we allow messages to be directed to either the ID or a well-known
> +name, and policy is applied to the connection, not the name. In other words,
> +the effective TALK policy for a connection is the most permissive of all names
> +the connection owns.
> +
> +For broadcast messages, the receiver needs TALK permissions to the sender to
> +receive the broadcast.
> +
> +If a policy database exists for a bus (because a policy holder created one on
> +demand) or for a custom endpoint (which always has one), each one is consulted

By "created one on demand" do you mean "explicitly created one"? If so
please use the latter wording, since its clearer.

> +during name registry listing, name owning or message delivery. If either one

What is "name owning". I think this could be worded better.

> +fails, the operation is failed with -EPERM.
> +
> +For best practices, connections that own names with a restricted TALK
> +access should not install matches. This avoids cases where the sent
> +message may pass the bloom filter due to false-positives and may also
> +satisfy the policy rules.
> +
> +
> +11.5 Implicit policies
> +----------------------
> +
> +Depending on the type of the endpoint, a set of implicit rules that
> +override installed policies might be enforced.
> +
> +On default endpoints, the following set is enforced and checked before
> +any user-supplied policy is checked.
> +
> +  * Privileged connections always override any installed policy. Those
> +    connections could easily install their own policies, so there is no
> +    reason to enforce installed policies.
> +  * Connections can always talk to connections of the same user. This
> +    includes broadcast messages.
> +
> +Custom endpoints have stricter policies. The following rules apply:
> +
> +  * Policy rules are always enforced, even if the connection is a privileged
> +    connection.
> +  * Policy rules are always enforced for TALK access, even if both ends are
> +    running under the same user. This includes broadcast messages.
> +  * To restrict the set of names that can be seen, endpoint policies can
> +    install "SEE" policies.
> +
> +
> +12. Pool
> +===============================================================================
> +
> +A pool for data received from the kernel is installed for every connection of
> +the bus, and is sized according to the information stored in the
> +KDBUS_ITEM_BLOOM_PARAMETER item that is returned by KDBUS_CMD_HELLO.
> +
> +The pool is written to by the kernel when one of the following ioctls is issued:
> +
> +  * KDBUS_CMD_HELLO, to receive details about the bus the connection was made to
> +  * KDBUS_CMD_RECV, to receive a message
> +  * KDBUS_CMD_NAME_LIST, to dump the name registry
> +  * KDBUS_CMD_CONN_INFO, to retrieve information on a connection
> +
> +The offsets returned by either one of the aforementioned ioctls describe offsets
> +inside the pool. In order to make the slice available for subsequent calls,
> +KDBUS_CMD_FREE has to be called on the offset.
> +
> +To access the memory, the caller is expected to mmap() it to its task, like

s/to its task//

> +this:
> +
> +  /*
> +   * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
> +   * value that was previously returned through the KDBUS_ITEM_BLOOM_PARAMETER
> +   * item field when the KDBUS_CMD_HELLO ioctl returned.
> +   */
> +
> +  buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_SHARED, conn_fd, 0);
> +
> +Alternatively, instead of mapping the entire pool buffer, only parts of it can
> +be mapped. The length of the response is returned by the kernel along with the
> +offset for each of the ioctls listed above.
> +
> +
> +13. Metadata
> +===============================================================================
> +
> +kdbus records data about the system in certain situations. Such metadata can
> +refer to the currently active process (creds, PIDs, current user groups, process
> +names and its executable path, cgroup membership, capabilities, security label
> +and audit information), connection information (description string, currently
> +owned names) and the timestamp.
> +
> +Metadata is collected in the following circumstances:
> +
> +  * When a bus is created (KDBUS_CMD_MAKE), information about the calling task
> +    is collected. This data is returned by the kernel via the
> +    KDBUS_CMD_BUS_CREATOR_INFO call-

s/-/./

> +
> +  * When a connection is created (KDBUS_CMD_HELLO), information about the
> +    calling task is collected. Alternatively, a privileged connection may
> +    provide 'faked' information about credentials, PIDs and a security labels

s/a security/security/

> +    which will be taken instead. This data is returned by the kernel as
> +    information on a connection (KDBUS_CMD_CONN_INFO).
> +
> +  * When a message is sent (KDBUS_CMD_SEND), information about the sending task
> +    and the sending connection are collected. This metadata will be attached
> +    to the message when it arrives in the receiver's pool. If the connection
> +    sending the message installed faked credentials (see above), the message
> +    will not be augmented by any information about the currently sending task.
> +
> +Which metadata items are actually delivered depends on the following sets and
> +masks:
> +
> +    (a) the system-wide kmod creds mask (module parameter 'attach_flags_mask')
> +    (b) the per-connection send creds mask, set by the connecting client
> +    (c) the per-connection receive creds mask, set by the connecting client
> +    (d) the per-bus minimal creds mask, set by the bus creator
> +    (e) the per-bus owner creds mask, set by the bus creator
> +    (f) the mask specified when querying creds of a bus peer
> +    (g) the mask specified when querying creds of a bus owner
> +
> +With the following rules:
> +
> +    [1] The creds attached to messages are determined as (a & b & c).
> +    [2] When connecting to a bus (KDBUS_CMD_HELLO), and (~b & d) != 0, the call
> +        will fail, the connection is refused.
> +    [3] When querying creds of a bus peer, the creds returned are  (a & b & f)
> +    [4] When querying creds of a bus owner, the creds returned are (a & e & g)
> +    [5] When creating a new bus, and (d & ~a) != 0, then bus creation will fail
> +
> +Hence, userspace might not always get all requested metadata items that it
> +requested. Code must be written so that it can cope with this fact.
> +
> +
> +13.1 Known item types
> +---------------------
> +
> +The following attach flags are currently supported.
> +
> +  KDBUS_ATTACH_TIMESTAMP
> +    Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
> +    monotonic and the realtime timestamp, taken when the message was
> +    processed on the kernel side.
> +
> +  KDBUS_ATTACH_CREDS
> +    Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
> +    described in struct kdbus_creds: the user and group IDs in the usual four
> +    flavors: real, effective, saved and file-system related.
> +
> +  KDBUS_ATTACH_PIDS
> +    Attaches an item of type KDBUS_ITEM_PIDS, containing information on the
> +    process. In particular, the PID (process ID), TID (thread ID), and PPID
> +    (PID of the parent process).
> +
> +  KDBUS_ATTACH_AUXGROUPS
> +    Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
> +    number of auxiliary groups the sending task was a member of.
> +
> +  KDBUS_ATTACH_NAMES
> +    Attaches items of type KDBUS_ITEM_OWNED_NAME, one for each name the sending
> +    connection currently owns. The name and flags are stored in kdbus_item.name
> +    for each of them.
> +
> +  KDBUS_ATTACH_TID_COMM [*]
> +    Attaches an items of type KDBUS_ITEM_TID_COMM, transporting the sending
> +    task's 'comm', for the tid.  The string is stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_PID_COMM [*]
> +    Attaches an items of type KDBUS_ITEM_PID_COMM, transporting the sending
> +    task's 'comm', for the pid.  The string is stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_EXE [*]
> +    Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
> +    executable of the sending task, stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_CMDLINE [*]
> +    Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
> +    arguments of the sending task, as an array of strings, stored in
> +    kdbus_item.str.
> +
> +  KDBUS_ATTACH_CGROUP
> +    Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
> +
> +  KDBUS_ATTACH_CAPS
> +    Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
> +    that should be accessed via kdbus_item.caps.caps. Also, userspace should
> +    be written in a way that it takes kdbus_item.caps.last_cap into account,
> +    and derive the number of sets and rows from the item size and the reported
> +    number of valid capability bits.
> +
> +  KDBUS_ATTACH_SECLABEL
> +    Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
> +    security label of the sending task. SELinux and other MACs might want to
> +    do additional per-service security checks. For example, a service manager
> +    might want to check the security label of a service file against the
> +    security label of the client process checking the SELinux database before
> +    allowing access.  The label can be accessed via kdbus_item->str.
> +
> +  KDBUS_ATTACH_AUDIT
> +    Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audit
> +    sessionid and loginuid of the sending task. Access via kdbus_item->audit.
> +
> +  KDBUS_ATTACH_CONN_DESCRIPTION
> +    Attaches an item of type KDBUS_ITEM_CONN_DESCRIPTION that contains a
> +    descriptive string of the sending peer. That string can be supplied
> +    during HELLO by attaching an item of type KDBUS_ITEM_CONN_DESCRIPTION.
> +
> +
> +[*] Note that the content stored in these items can easily be tampered by
> +    the sending tasks. Therefore, they should NOT be used for any sort of
> +    security relevant assumptions. The only reason why they are transmitted is
> +    to let receivers know about details that were set when metadata was
> +    collected, even though the task they were collected from is not active any
> +    longer when the items are received.
> +
> +
> +13.2 Metadata and namespaces
> +----------------------------
> +
> +Metadata such as PIDs, UIDs or GIDs are automatically translated to the
> +namespaces of the task that receives them.
> +
> +
> +14. Error codes
> +===============================================================================
> +
> +Below is a list of error codes that might be returned by the individual
> +ioctl commands. The list focuses on the return values from kdbus code itself,
> +and might not cover those of all kernel internal functions.
> +
> +For all ioctls:
> +
> +  -ENOMEM	The kernel memory is exhausted
> +  -ENOTTY	Illegal ioctl command issued for the file descriptor

Why ENOTTY here, rather than EINVAL? The latter is, I beleive, the usual 
ioctl() error for invalid commands, I believe (If you keep ENOTTY, add an
explanation in this document.)

> +  -ENOSYS	The requested functionality is not available

Maybe add here an explanation or examples of why the functionality is 
not available?

> +  -EINVAL	Unsupported item attached to command
> +
> +For all ioctls that carry a struct as payload:
> +
> +  -EFAULT	The supplied data pointer was not 64-bit aligned, or was
> +		inaccessible from the kernel side.
> +  -EINVAL	The size inside the supplied struct was smaller than expected
> +  -EMSGSIZE	The size inside the supplied struct was bigger than expected

Why two different errors for smaller and larger than expected? (If you keep things this
way, pelase explain the reason in this document.)

> +  -ENAMETOOLONG	A supplied name is larger than the allowed maximum size
> +
> +For KDBUS_CMD_BUS_MAKE:
> +
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid or
> +		the supplied name does not start with the current uid and a '-'
> +  -EEXIST	A bus of that name already exists
> +  -ESHUTDOWN	The domain for the bus is already shut down
> +  -EMFILE	The maximum number of buses for the current user is exhausted
> +
> +For KDBUS_CMD_ENDPOINT_MAKE:
> +
> +  -EPERM	The calling user is not privileged (see Terminology)
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid
> +  -EEXIST	An endpoint of that name already exists
> +
> +For KDBUS_CMD_HELLO:
> +
> +  -EFAULT	The supplied pool size was 0 or not a multiple of the page size
> +  -EINVAL	The flags supplied in the kdbus_cmd_make struct are invalid, or
> +		an illegal combination of KDBUS_HELLO_MONITOR,
> +		KDBUS_HELLO_ACTIVATOR and KDBUS_HELLO_POLICY_HOLDER was passed
> +		in the flags, or an invalid set of items was supplied
> +  -ECONNREFUSED	The attach_flags_send field did not satisfy the requirements of
> +		the bus
> +  -EPERM	An KDBUS_ITEM_CREDS items was supplied, but the current user is
> +		not privileged
> +  -ESHUTDOWN	The bus has already been shut down
> +  -EMFILE	The maximum number of connection on the bus has been reached

s/connection/connections/

> +  -EOPNOTSUPP	The endpoint does not support the connection flags
> +		supplied in the kdbus_cmd_hello struct
> +
> +For KDBUS_CMD_BYEBYE:
> +
> +  -EALREADY	The connection has already been shut down
> +  -EBUSY	There are still messages queued up in the connection's pool
> +
> +For KDBUS_CMD_SEND:
> +
> +  -EOPNOTSUPP	The connection is not an ordinary connection, or the passed
> +		file descriptors are either kdbus handles or unix domain
> +		sockets. Both are currently unsupported
> +  -EINVAL	The submitted payload type is KDBUS_PAYLOAD_KERNEL,
> +		KDBUS_MSG_EXPECT_REPLY was set without timeout or cookie
> +		values, KDBUS_MSG_SYNC_REPLY was set without
> +		KDBUS_MSG_EXPECT_REPLY, an invalid item was supplied,
> +		src_id was != 0 and different from the current connection's ID,

s/src_id was != 0 and different/src_id was nonzero and was different/

> +		a supplied memfd had a size of 0, a string was not properly
> +		null-terminated
> +  -ENOTUNIQ	The supplied destination is KDBUS_DST_ID_BROADCAST, a file
> +		descriptor was passed, KDBUS_MSG_EXPECT_REPLY was set,
> +		or a timeout was given for a broadcast message
> +  -E2BIG	Too many items
> +  -EMSGSIZE	The size of the message header and items or the payload vector
> +		is too big.
> +  -EEXIST	Multiple KDBUS_ITEM_FDS, KDBUS_ITEM_BLOOM_FILTER or
> +		KDBUS_ITEM_DST_NAME items were supplied
> +  -EBADF	The supplied KDBUS_ITEM_FDS or KDBUS_MSG_PAYLOAD_MEMFD items
> +		contained an illegal file descriptor
> +  -EMEDIUMTYPE	The supplied memfd is not a sealed kdbus memfd
> +  -EMFILE	Too many file descriptors inside a KDBUS_ITEM_FDS
> +  -EBADMSG	An item had illegal size, both a dst_id and a
> +		KDBUS_ITEM_DST_NAME was given, or both a name and a bloom
> +		filter was given
> +  -ETXTBSY	The supplied kdbus memfd file cannot be sealed or the seal
> +		was removed, because it is shared with other processes or
> +		still mmap()ed
> +  -ECOMM	A peer does not accept the file descriptors addressed to it
> +  -EFAULT	The supplied bloom filter size was not 64-bit aligned
> +  -EDOM		The supplied bloom filter size did not match the bloom filter
> +		size of the bus
> +  -EDESTADDRREQ	dst_id was set to KDBUS_DST_ID_NAME, but no KDBUS_ITEM_DST_NAME
> +		was attached
> +  -ESRCH	The name to look up was not found in the name registry
> +  -EADDRNOTAVAIL KDBUS_MSG_NO_AUTO_START was given but the destination
> +		 connection is an activator.
> +  -ENXIO	The passed numeric destination connection ID couldn't be found,
> +		or is not connected
> +  -ECONNRESET	The destination connection is no longer active
> +  -ETIMEDOUT	Timeout while synchronously waiting for a reply
> +  -EINTR	System call interrupted while synchronously waiting for a reply
> +  -EPIPE	When sending a message, a synchronous reply from the receiving
> +		connection was expected but the connection died before
> +		answering
> +  -ENOBUFS	Too many pending messages on the receiver side
> +  -EREMCHG	Both a well-known name and a unique name (ID) was given, but
> +		the name is not currently owned by that connection.
> +  -EXFULL	The memory pool of the receiver is full
> +  -EREMOTEIO	While synchronously waiting for a reply, the remote peer
> +		failed with an I/O error.
> +
> +For KDBUS_CMD_RECV:
> +
> +  -EINVAL	Invalid flags or offset
> +  -EAGAIN	No message found in the queue
> +  -ENOMSG	No message of the requested priority found
> +  -EOVERFLOW	Broadcast messages have been lost
> +
> +For KDBUS_CMD_FREE:
> +
> +  -ENXIO	No pool slice found at given offset
> +  -EINVAL	Invalid flags provided, the offset is valid, but the user is
> +		not allowed to free the slice. This happens, for example, if
> +		the offset was retrieved with KDBUS_RECV_PEEK.

It would be easier to read if this was written to clarify that there are 
two distinct error cases:

  -EINVAL	Invalid flags provided.
  -EINVAL       The offset is valid, but the user is
		not allowed to free the slice. This happens, for example, if
		the offset was retrieved with KDBUS_RECV_PEEK.

> +For KDBUS_CMD_NAME_ACQUIRE:
> +
> +  -EINVAL	Illegal command flags, illegal name provided, or an activator
> +		tried to acquire a second name
> +  -EPERM	Policy prohibited name ownership
> +  -EALREADY	Connection already owns that name
> +  -EEXIST	The name already exists and can not be taken over
> +  -E2BIG	The maximum number of well-known names per connection
> +		is exhausted
> +  -ECONNRESET	The connection was reset during the call
> +
> +For KDBUS_CMD_NAME_RELEASE:
> +
> +  -EINVAL	Invalid command flags, or invalid name provided
> +  -ESRCH	Name is not found found in the registry
> +  -EADDRINUSE	Name is owned by a different connection and can't be released
> +
> +For KDBUS_CMD_NAME_LIST:
> +
> +  -EINVAL	Invalid flags
> +  -ENOBUFS	No available memory in the connection's pool.
> +
> +For KDBUS_CMD_CONN_INFO:
> +
> +  -EINVAL	Invalid flags, or neither an ID nor a name was provided,
> +		or the name is invalid.
> +  -ESRCH	Connection lookup by name failed
> +  -ENXIO	No connection with the provided connection ID found
> +
> +For KDBUS_CMD_CONN_UPDATE:
> +
> +  -EINVAL	Illegal flags or items
> +  -EOPNOTSUPP	Operation not supported by connection.
> +  -E2BIG	Too many policy items attached
> +  -EINVAL	Wildcards submitted in policy entries, or illegal sequence
> +		of policy items
> +
> +For KDBUS_CMD_ENDPOINT_UPDATE:
> +
> +  -E2BIG	Too many policy items attached
> +  -EINVAL	Invalid flags, or wildcards submitted in policy entries,
> +		or illegal sequence of policy items
> +
> +For KDBUS_CMD_MATCH_ADD:
> +
> +  -EINVAL	Illegal flags or items
> +  -EDOM		Illegal bloom filter size
> +  -EMFILE	Too many matches for this connection
> +
> +For KDBUS_CMD_MATCH_REMOVE:
> +
> +  -EINVAL	Illegal flags
> +  -ENOENT	A match entry with the given cookie could not be found.
> +
> +
> +15. Internal object relations
> +===============================================================================
> +
> +This is a simplified outline of the internal kdbus object relations, for
> +those interested in the inner life of the driver implementation.
> +
> +From the a mount point's (domain's) perspective:
> +
> +struct kdbus_domain
> +  |» struct kdbus_domain_user *user (many, owned)
> +  '» struct kdbus_node node (embedded)
> +      |» struct kdbus_node children (many, referenced)
> +      |» struct kdbus_node *parent (pinned)
> +      '» struct kdbus_bus (many, pinned)
> +          |» struct kdbus_node node (embedded)
> +          '» struct kdbus_ep (many, pinned)
> +              |» struct kdbus_node node (embedded)
> +              |» struct kdbus_bus *bus (pinned)
> +              |» struct kdbus_conn conn_list (many, pinned)
> +              |   |» struct kdbus_ep *ep (pinned)
> +              |   |» struct kdbus_name_entry *activator_of (owned)
> +              |   |» struct kdbus_match_db *match_db (owned)
> +              |   |» struct kdbus_meta *meta (owned)
> +              |   |» struct kdbus_match_db *match_db (owned)
> +              |   |    '» struct kdbus_match_entry (many, owned)
> +              |   |
> +              |   |» struct kdbus_pool *pool (owned)
> +              |   |    '» struct kdbus_pool_slice *slices (many, owned)
> +              |   |       '» struct kdbus_pool *pool (pinned)
> +              |   |
> +              |   |» struct kdbus_domain_user *user (pinned)
> +              |   `» struct kdbus_queue_entry entries (many, embedded)
> +              |        |» struct kdbus_pool_slice *slice (pinned)
> +              |        |» struct kdbus_conn_reply *reply (owned)
> +              |        '» struct kdbus_domain_user *user (pinned)
> +              |
> +              '» struct kdbus_domain_user *user (pinned)
> +                  '» struct kdbus_policy_db policy_db (embedded)
> +                       |» struct kdbus_policy_db_entry (many, owned)
> +                       |   |» struct kdbus_conn (pinned)
> +                       |   '» struct kdbus_ep (pinned)
> +                       |
> +                       '» struct kdbus_policy_db_cache_entry (many, owned)
> +                           '» struct kdbus_conn (pinned)
> +
> +
> +For the life-time of a file descriptor derived from calling open() on a file
> +inside the mount point:
> +
> +struct kdbus_handle
> +  |» struct kdbus_meta *meta (owned)
> +  |» struct kdbus_ep *ep (pinned)
> +  |» struct kdbus_conn *conn (owned)
> +  '» struct kdbus_ep *ep (owned)

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:05 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <1421435777-25306-1-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> kdbus is a kernel-level IPC implementation that aims for resemblance to
> the the protocol layer with the existing userspace D-Bus daemon while
> enabling some features that couldn't be implemented before in userspace.
> 
> The documentation in the first patch in this series explains the
> protocol and the API details.
> 
> Full details on what has changed from the v2 submission are at the
> bottom of this email.
> 
> Reasons why this should be done in the kernel, instead of userspace as
> it is currently done today include the following:
> 
> - performance: fewer process context switches, fewer copies, fewer
>   syscalls, larger memory chunks via memfd.  This is really important
>   for a whole class of userspace programs that are ported from other
>   operating systems that are run on tiny ARM systems that rely on
>   hundreds of thousands of messages passed at boot time, and at
>   "critical" times in their user interaction loops.
> - security: the peers which communicate do not have to trust each other,
>   as the only trustworthy compoenent in the game is the kernel which
>   adds metadata and ensures that all data passed as payload is either
>   copied or sealed, so that the receiver can parse the data without
>   having to protect against changing memory while parsing buffers.  Also,
>   all the data transfer is controlled by the kernel, so that LSMs can
>   track and control what is going on, without involving userspace.
>   Because of the LSM issue, security people are much happier with this
>   model than the current scheme of having to hook into dbus to mediate
>   things.
> - more metadata can be attached to messages than in userspace
> - semantics for apps with heavy data payloads (media apps, for instance)
>   with optinal priority message dequeuing, and global message ordering.
>   Some "crazy" people are playing with using kdbus for audio data in the
>   system.  I'm not saying that this is the best model for this, but
>   until now, there wasn't any other way to do this without having to
>   create custom "busses", one for each application library.
> - being in the kernle closes a lot of races which can't be fixed with
>   the current userspace solutions.  For example, with kdbus, there is a
>   way a client can disconnect from a bus, but do so only if no further
>   messages present in its queue, which is crucial for implementing
>   race-free "exit-on-idle" services
> - eavesdropping on the kernel level, so privileged users can hook into
>   the message stream without hacking support for that into their
>   userspace processes
> - a number of smaller benefits: for example kdbus learned a way to peek
>   full messages without dequeing them, which is really useful for
>   logging metadata when handling bus-activation requests.
> 
> Of course, some of the bits above could be implemented in userspace
> alone, for example with more sophisticated memory management APIs, but
> this is usually done by losing out on the other details.  For example,
> for many of the memory management APIs, it's hard to not require the
> communicating peers to fully trust each other.  And we _really_ don't
> want peers to have to trust each other.
> 
> Another benefit of having this in the kernel, rather than as a userspace
> daemon, is that you can now easily use the bus from the initrd, or up to
> the very end when the system shuts down.  On current userspace D-Bus,
> this is not really possible, as this requires passing the bus instance
> around between initrd and the "real" system.  Such a transition of all
> fds also requires keeping full state of what has already been read from
> the connection fds.  kdbus makes this much simpler, as we can change the
> ownership of the bus, just by passing one fd over from one part to the
> other.

I tend to think that much of the above should also be part of the 
documentation file (patch 01/13).

Cheers,

Michael


 
> Regarding binder: binder and kdbus follow very different design
> concepts.  Binder implies the use of thread-pools to dispatch incoming
> method calls.  This is a very efficient scheme, and completely natural
> in programming languages like Java.  On most Linux programs, however,
> there's a much stronger focus on central poll() loops that dispatch all
> sources a program cares about.  kdbus is much more usable in such
> environments, as it doesn't enforce a threading model, and it is happy
> with serialized dispatching.  In fact, this major difference had an
> effect on much of the design decisions: binder does not guarantee global
> message ordering due to the parallel dispatching in the thread-pools,
> but  kdbus does.  Moreover, there's also a difference in the way message
> handling.  In kdbus, every message is basically taken and dispatched as
> one blob, while in binder, continious connections to other peers are
> created, which are then used to send messages on.  Hence, the models are
> quite different, and they serve different needs.  I believe that the
> D-Bus/kdbus model is more compatible and friendly with how Linux
> programs are usually implemented.
> 
> This can also be found in a git tree, the kdbus branch of char-misc.git at:
>         https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/
> 
> Changes since v2:
> 
>   * Add FS_USERNS_MOUNT to the file system flags, so users can mount
>     their own kdbusfs instances without being root in the parent
>     user-ns. Spotted by Andy Lutomirski.
> 
>   * Rewrite major parts of the metadata implementation to allow for
>     per-recipient namespace translations. For this, namespaces are
>     now not pinned by domains anymore. Instead, metadata is recorded
>     in kernel scope, and exported into the currently active namespaces
>     at the time of message installing.
> 
>   * Split PID and TID from KDBUS_ITEM_CREDS into KDBUS_ITEM_PIDS.
>     The starttime is there to detect re-used PIDs, so move it to that
>     new item type as well. Consequently, introduce struct kdbus_pids
>     to accommodate the information. Requested by Andy Lutomirski.
> 
>   * Add {e,s,fs}{u,g}id to KDBUS_ITEM_CREDS, so users have a way to
>     get more fine-grained credential information.
> 
>   * Removed KDBUS_CMD_CANCEL. The interface was not usable from
>     threaded userspace implementation due to inherent races. Instead,
>     add an item type CANCEL_FD which can be used to pass a file
>     descriptor to the CMD_SEND ioctl. When the SEND is done
>     synchronously, it will get cancelled as soon as the passed
>     FD signals POLLIN.
> 
>   * Dropped startttime from KDBUS_ITEM_PIDS
> 
>   * Restrict names of custom endpoints to names with a "<uid>-" prefix,
>     just like we do for buses.
> 
>   * Provide module-parameter "kdbus.attach_flags_mask" to specify the
>     a mask of metadata items that is applied on all exported items.
> 
>   * Monitors are now entirely invisible (IOW, there won't be any
>     notification when they are created) and they don't need to install
>     filters for broadcast messages anymore.
> 
>   * All information exposed via a connection's pool now also reports
>     the length in addition to the offset. That way, userspace
>     applications can mmap() only parts of the pool on demand.
> 
>   * Due to the metadata rework, KDBUS_ITEM_PAYLOAD_OFF items now
>     describe the offset relative to the pool, where they used to be
>     relative to the message header.
> 
>   * Added return_flags bitmask to all kdbus_cmd_* structs, so the
>     kernel can report details of the command processing. This is
>     mostly reserved for future extensions.
> 
>   * Some fixes in kdbus.txt and tests, spotted by Harald Hoyer, Andy
>     Lutomirski, Michele Curti, Sergei Zviagintsev, Sheng Yong, Torstein
>     Husebø and Hristo Venev.
> 
>   * Fixed compiler warnings in test-message by Michele Curti
> 
>   * Unexpected items are now rejected with -EINVAL
> 
>   * Split signal and broadcast handling. Unicast signals are now
>     supported, and messages have a new KDBUS_MSG_SIGNAL flag.
> 
>   * KDBUS_CMD_MSG_SEND was renamed to KDBUS_CMD_SEND, and now takes
>     a struct kdbus_cmd_send instead of a kdbus_msg.
> 
>   * KDBUS_CMD_MSG_RECV was renamed to KDBUS_CMD_RECV.
> 
>   * Test case memory leak plugged, and various other cleanups and
>     fixes, by Rui Miguel Silva.
> 
>   * Build fix for s390
> 
>   * Test case fix for 32bit archs
> 
>   * The test framework now supports mount, pid and user namespaces.
> 
>   * The test framework learned a --tap command line parameter to
>     format its output in the "Test Anything Protocol". This format
>     is chosen by default when "make kselftest" is invoked.
> 
>   * Fixed buses and custom endpoints name validation, reported by
>     Andy Lutomirski.
> 
>   * copy_from_user() return code issue fixed, reported by
>     Dan Carpenter.
> 
>   * Avoid signed int overflow on archs without atomic_sub
> 
>   * Avoid variable size stack items. Fixes a sparse warning in queue.c.
> 
>   * New test case for kernel notification quota
> 
>   * Switched back to enums for the list of ioctls. This has advantages
>     for userspace code as gdb, for instance, is able to resolve the
>     numbers into names. Added features can easily be detected with
>     autotools, and new iotcls can get #defines as well. Having #defines
>     for the initial set of ioctls is uncecessary.
> 
> Daniel Mack (13):
>   kdbus: add documentation
>   kdbus: add header file
>   kdbus: add driver skeleton, ioctl entry points and utility functions
>   kdbus: add connection pool implementation
>   kdbus: add connection, queue handling and message validation code
>   kdbus: add node and filesystem implementation
>   kdbus: add code to gather metadata
>   kdbus: add code for notifications and matches
>   kdbus: add code for buses, domains and endpoints
>   kdbus: add name registry implementation
>   kdbus: add policy database implementation
>   kdbus: add Makefile, Kconfig and MAINTAINERS entry
>   kdbus: add selftests
> 
>  Documentation/ioctl/ioctl-number.txt              |    1 +
>  Documentation/kdbus.txt                           | 2107 +++++++++++++++++++++
>  MAINTAINERS                                       |   12 +
>  include/uapi/linux/Kbuild                         |    1 +
>  include/uapi/linux/kdbus.h                        | 1049 ++++++++++
>  include/uapi/linux/magic.h                        |    2 +
>  init/Kconfig                                      |   12 +
>  ipc/Makefile                                      |    2 +-
>  ipc/kdbus/Makefile                                |   22 +
>  ipc/kdbus/bus.c                                   |  553 ++++++
>  ipc/kdbus/bus.h                                   |  103 +
>  ipc/kdbus/connection.c                            | 2004 ++++++++++++++++++++
>  ipc/kdbus/connection.h                            |  262 +++
>  ipc/kdbus/domain.c                                |  350 ++++
>  ipc/kdbus/domain.h                                |   84 +
>  ipc/kdbus/endpoint.c                              |  232 +++
>  ipc/kdbus/endpoint.h                              |   68 +
>  ipc/kdbus/fs.c                                    |  519 +++++
>  ipc/kdbus/fs.h                                    |   25 +
>  ipc/kdbus/handle.c                                | 1134 +++++++++++
>  ipc/kdbus/handle.h                                |   20 +
>  ipc/kdbus/item.c                                  |  309 +++
>  ipc/kdbus/item.h                                  |   57 +
>  ipc/kdbus/limits.h                                |   95 +
>  ipc/kdbus/main.c                                  |   72 +
>  ipc/kdbus/match.c                                 |  535 ++++++
>  ipc/kdbus/match.h                                 |   32 +
>  ipc/kdbus/message.c                               |  598 ++++++
>  ipc/kdbus/message.h                               |  133 ++
>  ipc/kdbus/metadata.c                              | 1066 +++++++++++
>  ipc/kdbus/metadata.h                              |   52 +
>  ipc/kdbus/names.c                                 |  891 +++++++++
>  ipc/kdbus/names.h                                 |   82 +
>  ipc/kdbus/node.c                                  |  910 +++++++++
>  ipc/kdbus/node.h                                  |   87 +
>  ipc/kdbus/notify.c                                |  244 +++
>  ipc/kdbus/notify.h                                |   30 +
>  ipc/kdbus/policy.c                                |  481 +++++
>  ipc/kdbus/policy.h                                |   51 +
>  ipc/kdbus/pool.c                                  |  784 ++++++++
>  ipc/kdbus/pool.h                                  |   47 +
>  ipc/kdbus/queue.c                                 |  505 +++++
>  ipc/kdbus/queue.h                                 |  108 ++
>  ipc/kdbus/reply.c                                 |  262 +++
>  ipc/kdbus/reply.h                                 |   68 +
>  ipc/kdbus/util.c                                  |  317 ++++
>  ipc/kdbus/util.h                                  |  133 ++
>  tools/testing/selftests/Makefile                  |    1 +
>  tools/testing/selftests/kdbus/.gitignore          |   11 +
>  tools/testing/selftests/kdbus/Makefile            |   46 +
>  tools/testing/selftests/kdbus/kdbus-enum.c        |   95 +
>  tools/testing/selftests/kdbus/kdbus-enum.h        |   14 +
>  tools/testing/selftests/kdbus/kdbus-test.c        |  920 +++++++++
>  tools/testing/selftests/kdbus/kdbus-test.h        |   85 +
>  tools/testing/selftests/kdbus/kdbus-util.c        | 1646 ++++++++++++++++
>  tools/testing/selftests/kdbus/kdbus-util.h        |  216 +++
>  tools/testing/selftests/kdbus/test-activator.c    |  319 ++++
>  tools/testing/selftests/kdbus/test-attach-flags.c |  751 ++++++++
>  tools/testing/selftests/kdbus/test-benchmark.c    |  427 +++++
>  tools/testing/selftests/kdbus/test-bus.c          |  174 ++
>  tools/testing/selftests/kdbus/test-chat.c         |  123 ++
>  tools/testing/selftests/kdbus/test-connection.c   |  611 ++++++
>  tools/testing/selftests/kdbus/test-daemon.c       |   66 +
>  tools/testing/selftests/kdbus/test-endpoint.c     |  344 ++++
>  tools/testing/selftests/kdbus/test-fd.c           |  710 +++++++
>  tools/testing/selftests/kdbus/test-free.c         |   36 +
>  tools/testing/selftests/kdbus/test-match.c        |  442 +++++
>  tools/testing/selftests/kdbus/test-message.c      |  658 +++++++
>  tools/testing/selftests/kdbus/test-metadata-ns.c  |  507 +++++
>  tools/testing/selftests/kdbus/test-monitor.c      |  158 ++
>  tools/testing/selftests/kdbus/test-names.c        |  184 ++
>  tools/testing/selftests/kdbus/test-policy-ns.c    |  633 +++++++
>  tools/testing/selftests/kdbus/test-policy-priv.c  | 1270 +++++++++++++
>  tools/testing/selftests/kdbus/test-policy.c       |   81 +
>  tools/testing/selftests/kdbus/test-race.c         |  313 +++
>  tools/testing/selftests/kdbus/test-sync.c         |  368 ++++
>  tools/testing/selftests/kdbus/test-timeout.c      |   99 +
>  77 files changed, 27818 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/kdbus.txt
>  create mode 100644 include/uapi/linux/kdbus.h
>  create mode 100644 ipc/kdbus/Makefile
>  create mode 100644 ipc/kdbus/bus.c
>  create mode 100644 ipc/kdbus/bus.h
>  create mode 100644 ipc/kdbus/connection.c
>  create mode 100644 ipc/kdbus/connection.h
>  create mode 100644 ipc/kdbus/domain.c
>  create mode 100644 ipc/kdbus/domain.h
>  create mode 100644 ipc/kdbus/endpoint.c
>  create mode 100644 ipc/kdbus/endpoint.h
>  create mode 100644 ipc/kdbus/fs.c
>  create mode 100644 ipc/kdbus/fs.h
>  create mode 100644 ipc/kdbus/handle.c
>  create mode 100644 ipc/kdbus/handle.h
>  create mode 100644 ipc/kdbus/item.c
>  create mode 100644 ipc/kdbus/item.h
>  create mode 100644 ipc/kdbus/limits.h
>  create mode 100644 ipc/kdbus/main.c
>  create mode 100644 ipc/kdbus/match.c
>  create mode 100644 ipc/kdbus/match.h
>  create mode 100644 ipc/kdbus/message.c
>  create mode 100644 ipc/kdbus/message.h
>  create mode 100644 ipc/kdbus/metadata.c
>  create mode 100644 ipc/kdbus/metadata.h
>  create mode 100644 ipc/kdbus/names.c
>  create mode 100644 ipc/kdbus/names.h
>  create mode 100644 ipc/kdbus/node.c
>  create mode 100644 ipc/kdbus/node.h
>  create mode 100644 ipc/kdbus/notify.c
>  create mode 100644 ipc/kdbus/notify.h
>  create mode 100644 ipc/kdbus/policy.c
>  create mode 100644 ipc/kdbus/policy.h
>  create mode 100644 ipc/kdbus/pool.c
>  create mode 100644 ipc/kdbus/pool.h
>  create mode 100644 ipc/kdbus/queue.c
>  create mode 100644 ipc/kdbus/queue.h
>  create mode 100644 ipc/kdbus/reply.c
>  create mode 100644 ipc/kdbus/reply.h
>  create mode 100644 ipc/kdbus/util.c
>  create mode 100644 ipc/kdbus/util.h
>  create mode 100644 tools/testing/selftests/kdbus/.gitignore
>  create mode 100644 tools/testing/selftests/kdbus/Makefile
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-enum.h
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-test.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-test.h
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-util.c
>  create mode 100644 tools/testing/selftests/kdbus/kdbus-util.h
>  create mode 100644 tools/testing/selftests/kdbus/test-activator.c
>  create mode 100644 tools/testing/selftests/kdbus/test-attach-flags.c
>  create mode 100644 tools/testing/selftests/kdbus/test-benchmark.c
>  create mode 100644 tools/testing/selftests/kdbus/test-bus.c
>  create mode 100644 tools/testing/selftests/kdbus/test-chat.c
>  create mode 100644 tools/testing/selftests/kdbus/test-connection.c
>  create mode 100644 tools/testing/selftests/kdbus/test-daemon.c
>  create mode 100644 tools/testing/selftests/kdbus/test-endpoint.c
>  create mode 100644 tools/testing/selftests/kdbus/test-fd.c
>  create mode 100644 tools/testing/selftests/kdbus/test-free.c
>  create mode 100644 tools/testing/selftests/kdbus/test-match.c
>  create mode 100644 tools/testing/selftests/kdbus/test-message.c
>  create mode 100644 tools/testing/selftests/kdbus/test-metadata-ns.c
>  create mode 100644 tools/testing/selftests/kdbus/test-monitor.c
>  create mode 100644 tools/testing/selftests/kdbus/test-names.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy-ns.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy-priv.c
>  create mode 100644 tools/testing/selftests/kdbus/test-policy.c
>  create mode 100644 tools/testing/selftests/kdbus/test-race.c
>  create mode 100644 tools/testing/selftests/kdbus/test-sync.c
>  create mode 100644 tools/testing/selftests/kdbus/test-timeout.c
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Michael Kerrisk (man-pages) @ 2015-01-20 14:12 UTC (permalink / raw)
  To: Johannes Stezenbach, Greg Kroah-Hartman
  Cc: mtk.manpages, arnd, ebiederm, gnomes, teg, jkosina, luto,
	linux-api, linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <20150120132437.GB7545@sig21.net>

On 01/20/2015 02:24 PM, Johannes Stezenbach wrote:
> On Tue, Jan 20, 2015 at 07:26:09PM +0800, Greg Kroah-Hartman wrote:
>> On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:

>>> My guess is that the people porting from QNX were just confused
>>> and their use of D-Bus was in error.  Maybe they should've used
>>> plain sockets, capnproto, ZeroMQ or whatever.
>>
>> I tend to trust that they knew what they were doing, they wouldn't have
>> picked D-Bus for no good reason.
> 
> The automotive developers I had the pleasure to work with would
> use anything which is available via a mouse click in the
> commercial Embedded Linux SDK IDE of their choice :)
> Let's face it: QNX has a single IPC solution while Linux has
> a confusing multitude of possibilities.

Greg, from my spell in IVI, I too have to say your faith in the
wisdom of IVI developers' choices is touching. I think D-Bus was 
in the main picked because it had some nice features, but then 
people realized it had no bandwidth, and the solution has been 
"make D-Bus faster", rather than "maybe we should explore 
other (mixed model) solutions". This isn't to say that I'm
against adding kdbus, but I don't think there's much strength to
the argument you make above.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox