Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>

This adds a new system call, epoll_mod_wait. It's described as below:

NAME
       epoll_mod_wait - modify and wait for I/O events on an epoll file
                        descriptor

SYNOPSIS

       int epoll_mod_wait(int epfd, int flags,
                          int ncmds, struct epoll_mod_cmd *cmds,
                          struct epoll_wait_spec *spec);

DESCRIPTION

       The epoll_mod_wait() system call can be seen as an enhanced combination
       of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
       call. It is superior in two cases:
       
       1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
       will save context switches between user mode and kernel mode;
       
       2) When you need higher precision than microsecond for wait timeout.

       The epoll_ctl(2) operations are embedded into this call by with ncmds
       and cmds. The latter is an array of command structs:

           struct epoll_mod_cmd {

                  /* Reserved flags for future extension, must be 0 for now. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code once this
                   * command is executed by kernel */
                  int error;
           };
       
       There is no guartantee that all the commands are executed in order. Only
       if all the commands are successfully executed (all the error fields are
       set to 0), events are polled.

       The last parameter "spec" is a pointer to struct epoll_wait_spec, which
       contains the information about how to poll the events. If it's NULL, this
       call will immediately return after running all the commands in cmds.

       The structure is defined as below:

           struct epoll_wait_spec {

                  /* The same as "maxevents" in epoll_pwait() */
                  int maxevents;

                  /* The same as "events" in epoll_pwait() */
                  struct epoll_event *events;

                  /* Which clock to use for timeout */
                  int clockid;

                  /* Maximum time to wait if there is no event */
                  struct timespec timeout;

                  /* The same as "sigmask" in epoll_pwait() */
                  sigset_t *sigmask;

                  /* The same as "sigsetsize" in epoll_pwait() */
                  size_t sigsetsize;
           } EPOLL_PACKED;

RETURN VALUE

       When any error occurs, epoll_mod_wait() returns -1 and errno is set
       appropriately. All the "error" fields in cmds are unchanged before they
       are executed, and if any cmds are executed, the "error" fields are set
       to a return code accordingly. See also epoll_ctl for more details of the
       return code.

       When successful, epoll_mod_wait() returns the number of file
       descriptors ready for the requested I/O, or zero if no file descriptor
       became ready during the requested timeout milliseconds.

       If spec is NULL, it returns 0 if all the commands are successful, and -1
       if an error occured.

ERRORS

       These errors apply on either the return value of epoll_mod_wait or error
       status for each command, respectively.

       EBADF  epfd or fd is not a valid file descriptor.

       EFAULT The memory area pointed to by events is not accessible with write
              permissions.

       EINTR  The call was interrupted by a signal handler before either any of
              the requested events occurred or the timeout expired; see
              signal(7).

       EINVAL epfd is not an epoll file descriptor, or maxevents is less than
              or equal to zero, or fd is the same as epfd, or the requested
              operation op is not supported by this interface.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_mod_wait() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (6):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Add definition for epoll_mod_wait structures
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_mod_wait
  x86: Hook up epoll_mod_wait syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 219 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   5 +
 include/uapi/linux/eventpoll.h   |  20 ++++
 5 files changed, 167 insertions(+), 79 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

In preparation of epoll_mod_wait, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related

* [PATCH RFC 2/6] epoll: Specify clockid explicitly
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..6da143f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid, 
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

Two structs involved in the coming syscall is defined. Flags in epoll_mod_cmd
are reserved, which makes better word alignment and may allow future extension.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/uapi/linux/eventpoll.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..e32a804 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,24 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_mod_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error;
+} EPOLL_PACKED;
+
+struct epoll_wait_spec {
+	int maxevents;
+	struct epoll_event *events;
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 4/6] epoll: Extract ep_ctl_do
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_mod_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6da143f..e7a116d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid, 
 				const ktime_t timeout)
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

This syscall is a sequence of

1) a number of epoll_ctl calls
2) a epoll_pwait, with timeout enhancement.

The epoll_ctl operations are embeded so that application doesn't have to use
separate syscalls to insert/delete/update the fds before poll. It is more
efficient if the set of fds varies from one poll to another, which is the
common pattern for certain applications. For example, depending on the input
buffer status, a data reading program may decide to temporarily not polling an
fd.

Because the enablement of batching in this interface, even that regular
epoll_ctl call sequence, which manipulates several fds, can be optimized to one
single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).

The only complexity is returning the result of each operation.  For each
epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
the return code *iff* the command is executed (0 for success and -errno of the
equivalent epoll_ctl call), and will be left unchanged if the command is not
executed because some earlier error, for example due to failure of
copy_from_user to copy the array.

Applications can utilize this fact to do error handling: they could initialize
all the epoll_mod_wait.error to a positive value, which is by definition not a
possible output value from epoll_mod_wait. Then when the syscall returned, they
know whether or not the command is executed by comparing each error with the
init value, if they're different, they have the result of the command.
More roughly, they can put any non-zero and not distinguish "not run" from
failure.

Also, timeout parameter is enhanced: timespec is used, compared to the old ms
scalar. This provides higher precision. The parameter field in struct
epoll_wait_spec, "clockid", also makes it possible for users to use a different
clock than the default when it makes more sense.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 ++++
 2 files changed, 65 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7a116d..2cc22c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }

+SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
+		int, ncmds, struct epoll_mod_cmd __user *, cmds,
+		struct epoll_wait_spec __user *, spec)
+{
+	struct epoll_mod_cmd *kcmds = NULL;
+	int i, ret = 0;
+	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds) {
+		if (!cmds)
+			return -EINVAL;
+		kcmds = kmalloc(cmd_size, GFP_KERNEL);
+		if (!kcmds)
+			return -ENOMEM;
+		if (copy_from_user(kcmds, cmds, cmd_size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error = ret = -EINVAL;
+			goto out;
+		}
+		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
+		if (ret)
+			goto out;
+	}
+	if (spec) {
+		sigset_t ksigmask;
+		struct epoll_wait_spec kspec;
+		ktime_t timeout;
+
+		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
+			return -EFAULT;
+		if (kspec.sigmask) {
+			if (kspec.sigsetsize != sizeof(sigset_t))
+				return -EINVAL;
+			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
+				return -EFAULT;
+		}
+		timeout = timespec_to_ktime(kspec.timeout);
+		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
+				     kspec.clockid, timeout,
+				     kspec.sigmask ? &ksigmask : NULL);
+	}
+
+out:
+	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
+		return -EFAULT;
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..7156c80 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,8 @@
 #define _LINUX_SYSCALLS_H

 struct epoll_event;
+struct epoll_mod_cmd;
+struct epoll_wait_spec;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
+				   int ncmds, struct epoll_mod_cmd __user * cmds,
+				   struct epoll_wait_spec __user * spec);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1421747878-30744-1-git-send-email-famz@redhat.com>

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..52aead3 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_mod_wait		sys_epoll_mod_wait
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..c3c203a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_mod_wait		sys_epoll_mod_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3

^ permalink raw reply related

* [PATCH v5 0/5] power: Add max77693 charger driver
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski

Dear Sebastian,

You already acked some earlier version of this patch. In that time
there were more dependencies on MFD tree. Now the MFD patch (2/5)
is acked by Lee Jones so could you pick up everything?

Changes since v4
================
1. Add patch 5/5: Update MAINTAINERS (with Sebastian's ack [2]).

Changes since v3
================
1. Patch 3/4: Don't create platform data structure for charger settings
   because driver won't be used on non-DT SoCs (its for Exynos based
   Trats2).
2. Patch 3/4: Drop Sebastian's ack [1] because of change above.

Changes since v2
================
1. Add ack from Sebastian Reichel (charger driver).
2. Drop patch "mfd: max77693: Map charger device to its own of_node"
   (applied by Lee Jones).

Changes since v1
================
1. Add patch 2/5: mfd: max77693: Map charger device to its own of_node
   (forgot to send it in v1)
2. Remove patches for DTS and configs. I'll send them later.
3. Add ack from Lee Jones (patch 3/5).

Description
===========
The patchset adds max77693 charger driver present on Trats2 board
(and Galaxy S III). The driver configures battery charger and exposes
power supply interface.

Driver is necessary to provide full charging stack on Trats2 device
(extcon, charger-manager etc.).

[1] https://lkml.org/lkml/2014/10/27/774
[2] https://lkml.org/lkml/2015/1/16/464

Best regards,
Krzysztof

Krzysztof Kozlowski (5):
  devicetree: power/mfd: max77693: Document new bindings for charger
  mfd: max77693: Add defines for MAX77693 charger driver
  power: max77693: Add charger driver for Maxim 77693
  Documentation: power: max77693-charger: Document exported sysfs entry
  MAINTAINERS: Add entry for Maxim chargers on Samsung boards

 Documentation/ABI/testing/sysfs-class-power        |  42 ++
 Documentation/devicetree/bindings/mfd/max77693.txt |  45 ++
 MAINTAINERS                                        |   7 +
 drivers/power/Kconfig                              |   6 +
 drivers/power/Makefile                             |   1 +
 drivers/power/max77693_charger.c                   | 758 +++++++++++++++++++++
 include/linux/mfd/max77693-private.h               | 108 +++
 7 files changed, 967 insertions(+)
 create mode 100644 drivers/power/max77693_charger.c

-- 
1.9.1

^ permalink raw reply

* [PATCH v5 1/5] devicetree: power/mfd: max77693: Document new bindings for charger
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Document new device tree bindings for Maxim 77693 charger driver.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 Documentation/devicetree/bindings/mfd/max77693.txt | 45 ++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/Documentation/devicetree/bindings/mfd/max77693.txt b/Documentation/devicetree/bindings/mfd/max77693.txt
index 01e9f30fe678..38e64405e98d 100644
--- a/Documentation/devicetree/bindings/mfd/max77693.txt
+++ b/Documentation/devicetree/bindings/mfd/max77693.txt
@@ -41,6 +41,41 @@ Optional properties:
 	 To get more informations, please refer to documentaion.
 	[*] refer Documentation/devicetree/bindings/pwm/pwm.txt
 
+- charger : Node configuring the charger driver.
+  If present, required properties:
+  - compatible : Must be "maxim,max77693-charger".
+
+  Optional properties (if not set, defaults will be used):
+  - maxim,constant-microvolt : Battery constant voltage in uV. The charger
+    will operate in fast charge constant current mode till battery voltage
+    reaches this level. Then the charger will switch to fast charge constant
+    voltage mode. Also vsys (system voltage) will be set to this value when
+    DC power is supplied but charger is not enabled.
+    Valid values: 3650000 - 4400000, step by 25000 (rounded down)
+    Default: 4200000
+
+  - maxim,min-system-microvolt : Minimal system voltage in uV.
+    Valid values: 3000000 - 3700000, step by 100000 (rounded down)
+    Default: 3600000
+
+  - maxim,thermal-regulation-celsius : Temperature in Celsius for entering
+    high temperature charging mode. If die temperature exceeds this value
+    the charging current will be reduced by 105 mA/Celsius.
+    Valid values: 70, 85, 100, 115
+    Default: 100
+
+  - maxim,battery-overcurrent-microamp : Overcurrent protection threshold
+    in uA (current from battery to system).
+    Valid values: 2000000 - 3500000, step by 250000 (rounded down)
+    Default: 3500000
+
+  - maxim,charge-input-threshold-microvolt : Threshold voltage in uV for
+    triggering input voltage regulation loop. If input voltage decreases
+    below this value, the input current will be reduced to reach the
+    threshold voltage.
+    Valid values: 4300000, 4700000, 4800000, 4900000
+    Default: 4300000
+
 Example:
 	max77693@66 {
 		compatible = "maxim,max77693";
@@ -73,4 +108,14 @@ Example:
 			pwms = <&pwm 0 40000 0>;
 			pwm-names = "haptic";
 		};
+
+		charger {
+			compatible = "maxim,max77693-charger";
+
+			maxim,constant-microvolt = <4200000>;
+			maxim,min-system-microvolt = <3600000>;
+			maxim,thermal-regulation-celsius = <75>;
+			maxim,battery-overcurrent-microamp = <3000000>;
+			maxim,charge-input-threshold-microvolt = <4300000>;
+		};
 	};
-- 
1.9.1

^ permalink raw reply related

* [PATCH v5 2/5] mfd: max77693: Add defines for MAX77693 charger driver
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Prepare for adding support for Maxim 77693 charger by adding necessary
new defines.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
Acked-by: Lee Jones <lee.jones-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
 include/linux/mfd/max77693-private.h | 108 +++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/include/linux/mfd/max77693-private.h b/include/linux/mfd/max77693-private.h
index 08dae01258b9..955dd990beaf 100644
--- a/include/linux/mfd/max77693-private.h
+++ b/include/linux/mfd/max77693-private.h
@@ -143,10 +143,118 @@ enum max77693_pmic_reg {
 #define FLASH_INT_FLED1_SHORT	BIT(3)
 #define FLASH_INT_OVER_CURRENT	BIT(4)
 
+/* Fast charge timer in in hours */
+#define DEFAULT_FAST_CHARGE_TIMER		4
+/* microamps */
+#define DEFAULT_TOP_OFF_THRESHOLD_CURRENT	150000
+/* minutes */
+#define DEFAULT_TOP_OFF_TIMER			30
+/* microvolts */
+#define DEFAULT_CONSTANT_VOLT			4200000
+/* microvolts */
+#define DEFAULT_MIN_SYSTEM_VOLT			3600000
+/* celsius */
+#define DEFAULT_THERMAL_REGULATION_TEMP		100
+/* microamps */
+#define DEFAULT_BATTERY_OVERCURRENT		3500000
+/* microvolts */
+#define DEFAULT_CHARGER_INPUT_THRESHOLD_VOLT	4300000
+
+/* MAX77693_CHG_REG_CHG_INT_OK register */
+#define CHG_INT_OK_BYP_SHIFT		0
+#define CHG_INT_OK_BAT_SHIFT		3
+#define CHG_INT_OK_CHG_SHIFT		4
+#define CHG_INT_OK_CHGIN_SHIFT		6
+#define CHG_INT_OK_DETBAT_SHIFT		7
+#define CHG_INT_OK_BYP_MASK		BIT(CHG_INT_OK_BYP_SHIFT)
+#define CHG_INT_OK_BAT_MASK		BIT(CHG_INT_OK_BAT_SHIFT)
+#define CHG_INT_OK_CHG_MASK		BIT(CHG_INT_OK_CHG_SHIFT)
+#define CHG_INT_OK_CHGIN_MASK		BIT(CHG_INT_OK_CHGIN_SHIFT)
+#define CHG_INT_OK_DETBAT_MASK		BIT(CHG_INT_OK_DETBAT_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_00 register */
+#define CHG_DETAILS_00_CHGIN_SHIFT	5
+#define CHG_DETAILS_00_CHGIN_MASK	(0x3 << CHG_DETAILS_00_CHGIN_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01 register */
+#define CHG_DETAILS_01_CHG_SHIFT	0
+#define CHG_DETAILS_01_BAT_SHIFT	4
+#define CHG_DETAILS_01_TREG_SHIFT	7
+#define CHG_DETAILS_01_CHG_MASK		(0xf << CHG_DETAILS_01_CHG_SHIFT)
+#define CHG_DETAILS_01_BAT_MASK		(0x7 << CHG_DETAILS_01_BAT_SHIFT)
+#define CHG_DETAILS_01_TREG_MASK	BIT(7)
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01/CHG field */
+enum max77693_charger_charging_state {
+	MAX77693_CHARGING_PREQUALIFICATION	= 0x0,
+	MAX77693_CHARGING_FAST_CONST_CURRENT,
+	MAX77693_CHARGING_FAST_CONST_VOLTAGE,
+	MAX77693_CHARGING_TOP_OFF,
+	MAX77693_CHARGING_DONE,
+	MAX77693_CHARGING_HIGH_TEMP,
+	MAX77693_CHARGING_TIMER_EXPIRED,
+	MAX77693_CHARGING_THERMISTOR_SUSPEND,
+	MAX77693_CHARGING_OFF,
+	MAX77693_CHARGING_RESERVED,
+	MAX77693_CHARGING_OVER_TEMP,
+	MAX77693_CHARGING_WATCHDOG_EXPIRED,
+};
+
+/* MAX77693_CHG_REG_CHG_DETAILS_01/BAT field */
+enum max77693_charger_battery_state {
+	MAX77693_BATTERY_NOBAT			= 0x0,
+	/* Dead-battery or low-battery prequalification */
+	MAX77693_BATTERY_PREQUALIFICATION,
+	MAX77693_BATTERY_TIMER_EXPIRED,
+	MAX77693_BATTERY_GOOD,
+	MAX77693_BATTERY_LOWVOLTAGE,
+	MAX77693_BATTERY_OVERVOLTAGE,
+	MAX77693_BATTERY_OVERCURRENT,
+	MAX77693_BATTERY_RESERVED,
+};
+
+/* MAX77693_CHG_REG_CHG_DETAILS_02 register */
+#define CHG_DETAILS_02_BYP_SHIFT	0
+#define CHG_DETAILS_02_BYP_MASK		(0xf << CHG_DETAILS_02_BYP_SHIFT)
+
 /* MAX77693 CHG_CNFG_00 register */
 #define CHG_CNFG_00_CHG_MASK		0x1
 #define CHG_CNFG_00_BUCK_MASK		0x4
 
+/* MAX77693_CHG_REG_CHG_CNFG_01 register */
+#define CHG_CNFG_01_FCHGTIME_SHIFT	0
+#define CHG_CNFG_01_CHGRSTRT_SHIFT	4
+#define CHG_CNFG_01_PQEN_SHIFT		7
+#define CHG_CNFG_01_FCHGTIME_MASK	(0x7 << CHG_CNFG_01_FCHGTIME_SHIFT)
+#define CHG_CNFG_01_CHGRSTRT_MASK	(0x3 << CHG_CNFG_01_CHGRSTRT_SHIFT)
+#define CHG_CNFG_01_PQEN_MAKS		BIT(CHG_CNFG_01_PQEN_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_03 register */
+#define CHG_CNFG_03_TOITH_SHIFT		0
+#define CHG_CNFG_03_TOTIME_SHIFT	3
+#define CHG_CNFG_03_TOITH_MASK		(0x7 << CHG_CNFG_03_TOITH_SHIFT)
+#define CHG_CNFG_03_TOTIME_MASK		(0x7 << CHG_CNFG_03_TOTIME_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_04 register */
+#define CHG_CNFG_04_CHGCVPRM_SHIFT	0
+#define CHG_CNFG_04_MINVSYS_SHIFT	5
+#define CHG_CNFG_04_CHGCVPRM_MASK	(0x1f << CHG_CNFG_04_CHGCVPRM_SHIFT)
+#define CHG_CNFG_04_MINVSYS_MASK	(0x7 << CHG_CNFG_04_MINVSYS_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_06 register */
+#define CHG_CNFG_06_CHGPROT_SHIFT	2
+#define CHG_CNFG_06_CHGPROT_MASK	(0x3 << CHG_CNFG_06_CHGPROT_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_07 register */
+#define CHG_CNFG_07_REGTEMP_SHIFT	5
+#define CHG_CNFG_07_REGTEMP_MASK	(0x3 << CHG_CNFG_07_REGTEMP_SHIFT)
+
+/* MAX77693_CHG_REG_CHG_CNFG_12 register */
+#define CHG_CNFG_12_B2SOVRC_SHIFT	0
+#define CHG_CNFG_12_VCHGINREG_SHIFT	3
+#define CHG_CNFG_12_B2SOVRC_MASK	(0x7 << CHG_CNFG_12_B2SOVRC_SHIFT)
+#define CHG_CNFG_12_VCHGINREG_MASK	(0x3 << CHG_CNFG_12_VCHGINREG_SHIFT)
+
 /* MAX77693 CHG_CNFG_09 Register */
 #define CHG_CNFG_09_CHGIN_ILIM_MASK	0x7F
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v5 3/5] power: max77693: Add charger driver for Maxim 77693
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Add new driver for Maxim 77693 switch-mode charger (part of max77693
MFD driver) providing power supply class information to userspace.

The charger has +20V tolerant input. Current input can be set from 0 to
2.58 A. The charger can deliver up to 2.1 A to the battery or 3.5 A to
the system (when supplying additional current from battery to system).

The driver is configured through DTS (battery and system related
settings) and sysfs entries (timers and top-off charging threshold).

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 drivers/power/Kconfig            |   6 +
 drivers/power/Makefile           |   1 +
 drivers/power/max77693_charger.c | 758 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 765 insertions(+)
 create mode 100644 drivers/power/max77693_charger.c

diff --git a/drivers/power/Kconfig b/drivers/power/Kconfig
index da6981f92697..110d4bc03483 100644
--- a/drivers/power/Kconfig
+++ b/drivers/power/Kconfig
@@ -332,6 +332,12 @@ config CHARGER_MAX14577
 	  Say Y to enable support for the battery charger control sysfs and
 	  platform data of MAX14577/77836 MUICs.
 
+config CHARGER_MAX77693
+	tristate "Maxim MAX77693 battery charger driver"
+	depends on MFD_MAX77693
+	help
+	  Say Y to enable support for the Maxim MAX77693 battery charger.
+
 config CHARGER_MAX8997
 	tristate "Maxim MAX8997/MAX8966 PMIC battery charger driver"
 	depends on MFD_MAX8997 && REGULATOR_MAX8997
diff --git a/drivers/power/Makefile b/drivers/power/Makefile
index b83a0c749781..31216cb7e8a1 100644
--- a/drivers/power/Makefile
+++ b/drivers/power/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_CHARGER_LP8788)	+= lp8788-charger.o
 obj-$(CONFIG_CHARGER_GPIO)	+= gpio-charger.o
 obj-$(CONFIG_CHARGER_MANAGER)	+= charger-manager.o
 obj-$(CONFIG_CHARGER_MAX14577)	+= max14577_charger.o
+obj-$(CONFIG_CHARGER_MAX77693)	+= max77693_charger.o
 obj-$(CONFIG_CHARGER_MAX8997)	+= max8997_charger.o
 obj-$(CONFIG_CHARGER_MAX8998)	+= max8998_charger.o
 obj-$(CONFIG_CHARGER_BQ2415X)	+= bq2415x_charger.o
diff --git a/drivers/power/max77693_charger.c b/drivers/power/max77693_charger.c
new file mode 100644
index 000000000000..56cf2177aad4
--- /dev/null
+++ b/drivers/power/max77693_charger.c
@@ -0,0 +1,758 @@
+/*
+ * max77693_charger.c - Battery charger driver for the Maxim 77693
+ *
+ * Copyright (C) 2014 Samsung Electronics
+ * Krzysztof Kozlowski <k.kozlowski@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/power_supply.h>
+#include <linux/regmap.h>
+#include <linux/mfd/max77693.h>
+#include <linux/mfd/max77693-private.h>
+
+static const char *max77693_charger_name		= "max77693-charger";
+static const char *max77693_charger_model		= "MAX77693";
+static const char *max77693_charger_manufacturer	= "Maxim Integrated";
+
+struct max77693_charger {
+	struct device		*dev;
+	struct max77693_dev	*max77693;
+	struct power_supply	charger;
+
+	u32 constant_volt;
+	u32 min_system_volt;
+	u32 thermal_regulation_temp;
+	u32 batttery_overcurrent;
+	u32 charge_input_threshold_volt;
+};
+
+static int max77693_get_charger_state(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_STATUS_UNKNOWN;
+
+	data &= CHG_DETAILS_01_CHG_MASK;
+	data >>= CHG_DETAILS_01_CHG_SHIFT;
+
+	switch (data) {
+	case MAX77693_CHARGING_PREQUALIFICATION:
+	case MAX77693_CHARGING_FAST_CONST_CURRENT:
+	case MAX77693_CHARGING_FAST_CONST_VOLTAGE:
+	case MAX77693_CHARGING_TOP_OFF:
+	/* In high temp the charging current is reduced, but still charging */
+	case MAX77693_CHARGING_HIGH_TEMP:
+		state = POWER_SUPPLY_STATUS_CHARGING;
+		break;
+	case MAX77693_CHARGING_DONE:
+		state = POWER_SUPPLY_STATUS_FULL;
+		break;
+	case MAX77693_CHARGING_TIMER_EXPIRED:
+	case MAX77693_CHARGING_THERMISTOR_SUSPEND:
+		state = POWER_SUPPLY_STATUS_NOT_CHARGING;
+		break;
+	case MAX77693_CHARGING_OFF:
+	case MAX77693_CHARGING_OVER_TEMP:
+	case MAX77693_CHARGING_WATCHDOG_EXPIRED:
+		state = POWER_SUPPLY_STATUS_DISCHARGING;
+		break;
+	case MAX77693_CHARGING_RESERVED:
+	default:
+		state = POWER_SUPPLY_STATUS_UNKNOWN;
+	}
+
+	return state;
+}
+
+static int max77693_get_charge_type(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_CHARGE_TYPE_UNKNOWN;
+
+	data &= CHG_DETAILS_01_CHG_MASK;
+	data >>= CHG_DETAILS_01_CHG_SHIFT;
+
+	switch (data) {
+	case MAX77693_CHARGING_PREQUALIFICATION:
+	/*
+	 * Top-off: trickle or fast? In top-off the current varies between
+	 * 100 and 250 mA. It is higher than prequalification current.
+	 */
+	case MAX77693_CHARGING_TOP_OFF:
+		state = POWER_SUPPLY_CHARGE_TYPE_TRICKLE;
+		break;
+	case MAX77693_CHARGING_FAST_CONST_CURRENT:
+	case MAX77693_CHARGING_FAST_CONST_VOLTAGE:
+	/* In high temp the charging current is reduced, but still charging */
+	case MAX77693_CHARGING_HIGH_TEMP:
+		state = POWER_SUPPLY_CHARGE_TYPE_FAST;
+		break;
+	case MAX77693_CHARGING_DONE:
+	case MAX77693_CHARGING_TIMER_EXPIRED:
+	case MAX77693_CHARGING_THERMISTOR_SUSPEND:
+	case MAX77693_CHARGING_OFF:
+	case MAX77693_CHARGING_OVER_TEMP:
+	case MAX77693_CHARGING_WATCHDOG_EXPIRED:
+		state = POWER_SUPPLY_CHARGE_TYPE_NONE;
+		break;
+	case MAX77693_CHARGING_RESERVED:
+	default:
+		state = POWER_SUPPLY_CHARGE_TYPE_UNKNOWN;
+	}
+
+	return state;
+}
+
+/*
+ * Supported health statuses:
+ *  - POWER_SUPPLY_HEALTH_DEAD
+ *  - POWER_SUPPLY_HEALTH_GOOD
+ *  - POWER_SUPPLY_HEALTH_OVERVOLTAGE
+ *  - POWER_SUPPLY_HEALTH_SAFETY_TIMER_EXPIRE
+ *  - POWER_SUPPLY_HEALTH_UNKNOWN
+ *  - POWER_SUPPLY_HEALTH_UNSPEC_FAILURE
+ */
+static int max77693_get_battery_health(struct regmap *regmap)
+{
+	int state;
+	unsigned int data;
+
+	if (regmap_read(regmap, MAX77693_CHG_REG_CHG_DETAILS_01, &data) < 0)
+		return POWER_SUPPLY_HEALTH_UNKNOWN;
+
+	data &= CHG_DETAILS_01_BAT_MASK;
+	data >>= CHG_DETAILS_01_BAT_SHIFT;
+
+	switch (data) {
+	case MAX77693_BATTERY_NOBAT:
+		state = POWER_SUPPLY_HEALTH_DEAD;
+		break;
+	case MAX77693_BATTERY_PREQUALIFICATION:
+	case MAX77693_BATTERY_GOOD:
+	case MAX77693_BATTERY_LOWVOLTAGE:
+		state = POWER_SUPPLY_HEALTH_GOOD;
+		break;
+	case MAX77693_BATTERY_TIMER_EXPIRED:
+		/*
+		 * Took longer to charge than expected, charging suspended.
+		 * Damaged battery?
+		 */
+		state = POWER_SUPPLY_HEALTH_SAFETY_TIMER_EXPIRE;
+		break;
+	case MAX77693_BATTERY_OVERVOLTAGE:
+		state = POWER_SUPPLY_HEALTH_OVERVOLTAGE;
+		break;
+	case MAX77693_BATTERY_OVERCURRENT:
+		state = POWER_SUPPLY_HEALTH_UNSPEC_FAILURE;
+		break;
+	case MAX77693_BATTERY_RESERVED:
+	default:
+		state = POWER_SUPPLY_HEALTH_UNKNOWN;
+		break;
+	}
+
+	return state;
+}
+
+static int max77693_get_present(struct regmap *regmap)
+{
+	unsigned int data;
+
+	/*
+	 * Read CHG_INT_OK register. High DETBAT bit here should be
+	 * equal to value 0x0 in CHG_DETAILS_01/BAT field.
+	 */
+	regmap_read(regmap, MAX77693_CHG_REG_CHG_INT_OK, &data);
+	if (data & CHG_INT_OK_DETBAT_MASK)
+		return 0;
+	return 1;
+}
+
+static int max77693_get_online(struct regmap *regmap)
+{
+	unsigned int data;
+
+	regmap_read(regmap, MAX77693_CHG_REG_CHG_INT_OK, &data);
+	if (data & CHG_INT_OK_CHGIN_MASK)
+		return 1;
+	return 0;
+}
+
+static enum power_supply_property max77693_charger_props[] = {
+	POWER_SUPPLY_PROP_STATUS,
+	POWER_SUPPLY_PROP_CHARGE_TYPE,
+	POWER_SUPPLY_PROP_HEALTH,
+	POWER_SUPPLY_PROP_PRESENT,
+	POWER_SUPPLY_PROP_ONLINE,
+	POWER_SUPPLY_PROP_MODEL_NAME,
+	POWER_SUPPLY_PROP_MANUFACTURER,
+};
+
+static int max77693_charger_get_property(struct power_supply *psy,
+			    enum power_supply_property psp,
+			    union power_supply_propval *val)
+{
+	struct max77693_charger *chg = container_of(psy,
+						  struct max77693_charger,
+						  charger);
+	struct regmap *regmap = chg->max77693->regmap;
+	int ret = 0;
+
+	switch (psp) {
+	case POWER_SUPPLY_PROP_STATUS:
+		val->intval = max77693_get_charger_state(regmap);
+		break;
+	case POWER_SUPPLY_PROP_CHARGE_TYPE:
+		val->intval = max77693_get_charge_type(regmap);
+		break;
+	case POWER_SUPPLY_PROP_HEALTH:
+		val->intval = max77693_get_battery_health(regmap);
+		break;
+	case POWER_SUPPLY_PROP_PRESENT:
+		val->intval = max77693_get_present(regmap);
+		break;
+	case POWER_SUPPLY_PROP_ONLINE:
+		val->intval = max77693_get_online(regmap);
+		break;
+	case POWER_SUPPLY_PROP_MODEL_NAME:
+		val->strval = max77693_charger_model;
+		break;
+	case POWER_SUPPLY_PROP_MANUFACTURER:
+		val->strval = max77693_charger_manufacturer;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static ssize_t device_attr_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count,
+		int (*fn)(struct max77693_charger *, unsigned long))
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned long val;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	ret = fn(chg, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static ssize_t fast_charge_timer_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_01,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_01_FCHGTIME_MASK;
+	data >>= CHG_CNFG_01_FCHGTIME_SHIFT;
+	switch (data) {
+	case 0x1 ... 0x7:
+		/* Starting from 4 hours, step by 2 hours */
+		val = 4 + (data - 1) * 2;
+		break;
+	case 0x0:
+	default:
+		val = 0;
+		break;
+	}
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_fast_charge_timer(struct max77693_charger *chg,
+		unsigned long hours)
+{
+	unsigned int data;
+
+	/*
+	 * 0x00 - disable
+	 * 0x01 - 4h
+	 * 0x02 - 6h
+	 * ...
+	 * 0x07 - 16h
+	 * Round down odd values.
+	 */
+	switch (hours) {
+	case 4 ... 16:
+		data = (hours - 4) / 2 + 1;
+		break;
+	case 0:
+		/* Disable */
+		data = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	data <<= CHG_CNFG_01_FCHGTIME_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_01,
+			CHG_CNFG_01_FCHGTIME_MASK, data);
+}
+
+static ssize_t fast_charge_timer_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_fast_charge_timer);
+}
+
+static ssize_t top_off_threshold_current_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_03,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_03_TOITH_MASK;
+	data >>= CHG_CNFG_03_TOITH_SHIFT;
+
+	if (data <= 0x04)
+		val = 100000 + data * 25000;
+	else
+		val = data * 50000;
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_top_off_threshold_current(struct max77693_charger *chg,
+		unsigned long uamp)
+{
+	unsigned int data;
+
+	if (uamp < 100000 || uamp > 350000)
+		return -EINVAL;
+
+	if (uamp <= 200000)
+		data = (uamp - 100000) / 25000;
+	else
+		/* (200000, 350000> */
+		data = uamp / 50000;
+
+	data <<= CHG_CNFG_03_TOITH_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_03,
+			CHG_CNFG_03_TOITH_MASK, data);
+}
+
+static ssize_t top_off_threshold_current_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_top_off_threshold_current);
+}
+
+static ssize_t top_off_timer_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct max77693_charger *chg = dev_get_drvdata(dev);
+	unsigned int data, val;
+	int ret;
+
+	ret = regmap_read(chg->max77693->regmap, MAX77693_CHG_REG_CHG_CNFG_03,
+			&data);
+	if (ret < 0)
+		return ret;
+
+	data &= CHG_CNFG_03_TOTIME_MASK;
+	data >>= CHG_CNFG_03_TOTIME_SHIFT;
+
+	val = data * 10;
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+}
+
+static int max77693_set_top_off_timer(struct max77693_charger *chg,
+		unsigned long minutes)
+{
+	unsigned int data;
+
+	if (minutes > 70)
+		return -EINVAL;
+
+	data = minutes / 10;
+	data <<= CHG_CNFG_03_TOTIME_SHIFT;
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_03,
+			CHG_CNFG_03_TOTIME_MASK, data);
+}
+
+static ssize_t top_off_timer_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	return device_attr_store(dev, attr, buf, count,
+			max77693_set_top_off_timer);
+}
+
+static DEVICE_ATTR_RW(fast_charge_timer);
+static DEVICE_ATTR_RW(top_off_threshold_current);
+static DEVICE_ATTR_RW(top_off_timer);
+
+static int max77693_set_constant_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	/*
+	 * 0x00 - 3.650 V
+	 * 0x01 - 3.675 V
+	 * ...
+	 * 0x1b - 4.325 V
+	 * 0x1c - 4.340 V
+	 * 0x1d - 4.350 V
+	 * 0x1e - 4.375 V
+	 * 0x1f - 4.400 V
+	 */
+	if (uvolt >= 3650000 && uvolt < 4340000)
+		data = (uvolt - 3650000) / 25000;
+	else if (uvolt >= 4340000 && uvolt < 4350000)
+		data = 0x1c;
+	else if (uvolt >= 4350000 && uvolt <= 4400000)
+		data = 0x1d + (uvolt - 4350000) / 25000;
+	else {
+		dev_err(chg->dev, "Wrong value for charging constant voltage\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_04_CHGCVPRM_SHIFT;
+
+	dev_dbg(chg->dev, "Charging constant voltage: %u (0x%x)\n", uvolt,
+			data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_04,
+			CHG_CNFG_04_CHGCVPRM_MASK, data);
+}
+
+static int max77693_set_min_system_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	if (uvolt < 3000000 || uvolt > 3700000) {
+		dev_err(chg->dev, "Wrong value for minimum system regulation voltage\n");
+		return -EINVAL;
+	}
+
+	data = (uvolt - 3000000) / 100000;
+
+	data <<= CHG_CNFG_04_MINVSYS_SHIFT;
+
+	dev_dbg(chg->dev, "Minimum system regulation voltage: %u (0x%x)\n",
+			uvolt, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_04,
+			CHG_CNFG_04_MINVSYS_MASK, data);
+}
+
+static int max77693_set_thermal_regulation_temp(struct max77693_charger *chg,
+		unsigned int cels)
+{
+	unsigned int data;
+
+	switch (cels) {
+	case 70:
+	case 85:
+	case 100:
+	case 115:
+		data = (cels - 70) / 15;
+		break;
+	default:
+		dev_err(chg->dev, "Wrong value for thermal regulation loop temperature\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_07_REGTEMP_SHIFT;
+
+	dev_dbg(chg->dev, "Thermal regulation loop temperature: %u (0x%x)\n",
+			cels, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_07,
+			CHG_CNFG_07_REGTEMP_MASK, data);
+}
+
+static int max77693_set_batttery_overcurrent(struct max77693_charger *chg,
+		unsigned int uamp)
+{
+	unsigned int data;
+
+	if (uamp && (uamp < 2000000 || uamp > 3500000)) {
+		dev_err(chg->dev, "Wrong value for battery overcurrent\n");
+		return -EINVAL;
+	}
+
+	if (uamp)
+		data = ((uamp - 2000000) / 250000) + 1;
+	else
+		data = 0; /* disable */
+
+	data <<= CHG_CNFG_12_B2SOVRC_SHIFT;
+
+	dev_dbg(chg->dev, "Battery overcurrent: %u (0x%x)\n", uamp, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_12,
+			CHG_CNFG_12_B2SOVRC_MASK, data);
+}
+
+static int max77693_set_charge_input_threshold_volt(struct max77693_charger *chg,
+		unsigned int uvolt)
+{
+	unsigned int data;
+
+	switch (uvolt) {
+	case 4300000:
+		data = 0x0;
+		break;
+	case 4700000:
+	case 4800000:
+	case 4900000:
+		data = (uvolt - 4700000) / 100000;
+	default:
+		dev_err(chg->dev, "Wrong value for charge input voltage regulation threshold\n");
+		return -EINVAL;
+	}
+
+	data <<= CHG_CNFG_12_VCHGINREG_SHIFT;
+
+	dev_dbg(chg->dev, "Charge input voltage regulation threshold: %u (0x%x)\n",
+			uvolt, data);
+
+	return regmap_update_bits(chg->max77693->regmap,
+			MAX77693_CHG_REG_CHG_CNFG_12,
+			CHG_CNFG_12_VCHGINREG_MASK, data);
+}
+
+/*
+ * Sets charger registers to proper and safe default values.
+ */
+static int max77693_reg_init(struct max77693_charger *chg)
+{
+	int ret;
+	unsigned int data;
+
+	/* Unlock charger register protection */
+	data = (0x3 << CHG_CNFG_06_CHGPROT_SHIFT);
+	ret = regmap_update_bits(chg->max77693->regmap,
+				MAX77693_CHG_REG_CHG_CNFG_06,
+				CHG_CNFG_06_CHGPROT_MASK, data);
+	if (ret) {
+		dev_err(chg->dev, "Error unlocking registers: %d\n", ret);
+		return ret;
+	}
+
+	ret = max77693_set_fast_charge_timer(chg, DEFAULT_FAST_CHARGE_TIMER);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_top_off_threshold_current(chg,
+			DEFAULT_TOP_OFF_THRESHOLD_CURRENT);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_top_off_timer(chg, DEFAULT_TOP_OFF_TIMER);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_constant_volt(chg, chg->constant_volt);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_min_system_volt(chg, chg->min_system_volt);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_thermal_regulation_temp(chg,
+			chg->thermal_regulation_temp);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_batttery_overcurrent(chg, chg->batttery_overcurrent);
+	if (ret)
+		return ret;
+
+	ret = max77693_set_charge_input_threshold_volt(chg,
+			chg->charge_input_threshold_volt);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+#ifdef CONFIG_OF
+static int max77693_dt_init(struct device *dev, struct max77693_charger *chg)
+{
+	struct device_node *np = dev->of_node;
+
+	if (!np) {
+		dev_err(dev, "no charger OF node\n");
+		return -EINVAL;
+	}
+
+	if (of_property_read_u32(np, "maxim,constant-microvolt",
+			&chg->constant_volt))
+		chg->constant_volt = DEFAULT_CONSTANT_VOLT;
+
+	if (of_property_read_u32(np, "maxim,min-system-microvolt",
+			&chg->min_system_volt))
+		chg->min_system_volt = DEFAULT_MIN_SYSTEM_VOLT;
+
+	if (of_property_read_u32(np, "maxim,thermal-regulation-celsius",
+			&chg->thermal_regulation_temp))
+		chg->thermal_regulation_temp = DEFAULT_THERMAL_REGULATION_TEMP;
+
+	if (of_property_read_u32(np, "maxim,battery-overcurrent-microamp",
+			&chg->batttery_overcurrent))
+		chg->batttery_overcurrent = DEFAULT_BATTERY_OVERCURRENT;
+
+	if (of_property_read_u32(np, "maxim,charge-input-threshold-microvolt",
+			&chg->charge_input_threshold_volt))
+		chg->charge_input_threshold_volt =
+			DEFAULT_CHARGER_INPUT_THRESHOLD_VOLT;
+
+	return 0;
+}
+#else /* CONFIG_OF */
+static int max77693_dt_init(struct device *dev, struct max77693_charger *chg)
+{
+	return 0;
+}
+#endif /* CONFIG_OF */
+
+static int max77693_charger_probe(struct platform_device *pdev)
+{
+	struct max77693_charger *chg;
+	struct max77693_dev *max77693 = dev_get_drvdata(pdev->dev.parent);
+	int ret;
+
+	chg = devm_kzalloc(&pdev->dev, sizeof(*chg), GFP_KERNEL);
+	if (!chg)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, chg);
+	chg->dev = &pdev->dev;
+	chg->max77693 = max77693;
+
+	ret = max77693_dt_init(&pdev->dev, chg);
+	if (ret)
+		return ret;
+
+	ret = max77693_reg_init(chg);
+	if (ret)
+		return ret;
+
+	chg->charger.name = max77693_charger_name;
+	chg->charger.type = POWER_SUPPLY_TYPE_BATTERY;
+	chg->charger.properties = max77693_charger_props;
+	chg->charger.num_properties = ARRAY_SIZE(max77693_charger_props);
+	chg->charger.get_property = max77693_charger_get_property;
+
+	ret = device_create_file(&pdev->dev, &dev_attr_fast_charge_timer);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create fast charge timer sysfs entry\n");
+		goto err;
+	}
+
+	ret = device_create_file(&pdev->dev,
+			&dev_attr_top_off_threshold_current);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create top off current sysfs entry\n");
+		goto err;
+	}
+
+	ret = device_create_file(&pdev->dev, &dev_attr_top_off_timer);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: create top off timer sysfs entry\n");
+		goto err;
+	}
+
+	ret = power_supply_register(&pdev->dev, &chg->charger);
+	if (ret) {
+		dev_err(&pdev->dev, "failed: power supply register\n");
+		goto err;
+	}
+
+	return 0;
+
+err:
+	device_remove_file(&pdev->dev, &dev_attr_top_off_timer);
+	device_remove_file(&pdev->dev, &dev_attr_top_off_threshold_current);
+	device_remove_file(&pdev->dev, &dev_attr_fast_charge_timer);
+
+	return ret;
+}
+
+static int max77693_charger_remove(struct platform_device *pdev)
+{
+	struct max77693_charger *chg = platform_get_drvdata(pdev);
+
+	device_remove_file(&pdev->dev, &dev_attr_top_off_timer);
+	device_remove_file(&pdev->dev, &dev_attr_top_off_threshold_current);
+	device_remove_file(&pdev->dev, &dev_attr_fast_charge_timer);
+
+	power_supply_unregister(&chg->charger);
+
+	return 0;
+}
+
+static const struct platform_device_id max77693_charger_id[] = {
+	{ "max77693-charger", 0, },
+	{ }
+};
+MODULE_DEVICE_TABLE(platform, max77693_charger_id);
+
+static struct platform_driver max77693_charger_driver = {
+	.driver = {
+		.owner	= THIS_MODULE,
+		.name	= "max77693-charger",
+	},
+	.probe		= max77693_charger_probe,
+	.remove		= max77693_charger_remove,
+	.id_table	= max77693_charger_id,
+};
+module_platform_driver(max77693_charger_driver);
+
+MODULE_AUTHOR("Krzysztof Kozlowski <k.kozlowski@samsung.com>");
+MODULE_DESCRIPTION("Maxim 77693 charger driver");
+MODULE_LICENSE("GPL");
-- 
1.9.1


^ permalink raw reply related

* [PATCH v5 4/5] Documentation: power: max77693-charger: Document exported sysfs entry
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel, linux-pm, devicetree,
	linux-api
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski@samsung.com>

Document the settings exported by max77693 charger driver through sysfs
entries:
 - fast_charge_timer
 - top_off_threshold_current
 - top_off_timer

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
---
 Documentation/ABI/testing/sysfs-class-power | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power
index 909e7602c717..369d2a2d7d3e 100644
--- a/Documentation/ABI/testing/sysfs-class-power
+++ b/Documentation/ABI/testing/sysfs-class-power
@@ -32,3 +32,45 @@ Description:
 		Valid values:
 		- 5, 6 or 7 (hours),
 		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/fast_charge_timer
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the maximum time the max77693
+		charger operates in fast-charge mode. When the timer expires
+		the device will terminate fast-charge mode (charging current
+		will drop to 0 A) and will trigger interrupt.
+
+		Valid values:
+		- 4 - 16 (hours), step by 2 (rounded down)
+		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/top_off_threshold_current
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the charging current threshold for
+		entering top-off charging mode. When charging current in fast
+		charge mode drops below this value, the charger will trigger
+		interrupt and start top-off charging mode.
+
+		Valid values:
+		- 100000 - 200000 (microamps), step by 25000 (rounded down)
+		- 200000 - 350000 (microamps), step by 50000 (rounded down)
+		- 0: disabled.
+
+What:		/sys/class/power_supply/max77693-charger/device/top_off_timer
+Date:		January 2015
+KernelVersion:	3.19.0
+Contact:	Krzysztof Kozlowski <k.kozlowski@samsung.com>
+Description:
+		This entry shows and sets the maximum time the max77693
+		charger operates in top-off charge mode. When the timer expires
+		the device will terminate top-off charge mode (charging current
+		will drop to 0 A) and will trigger interrupt.
+
+		Valid values:
+		- 0 - 70 (minutes), step by 10 (rounded down)
-- 
1.9.1

^ permalink raw reply related

* [PATCH v5 5/5] MAINTAINERS: Add entry for Maxim chargers on Samsung boards
From: Krzysztof Kozlowski @ 2015-01-20 10:00 UTC (permalink / raw)
  To: Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	Samuel Ortiz, Lee Jones, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Kyungmin Park, Bartlomiej Zolnierkiewicz, Krzysztof Kozlowski
In-Reply-To: <1421748056-25735-1-git-send-email-k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Add myself as supporter to help in reviewing patches for Maxim 14577 and
77693 MUIC charger drivers. These are used on Exynos-based boards
(Trats 2, Gear 1 and Gear 2).

Signed-off-by: Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
Cc: Sebastian Reichel <sre-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Acked-By: Sebastian Reichel <sre-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dbc17b3b51d4..c6fbb8cb58de 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6152,6 +6152,13 @@ F:	Documentation/devicetree/bindings/i2c/max6697.txt
 F:	drivers/hwmon/max6697.c
 F:	include/linux/platform_data/max6697.h
 
+MAXIM MUIC CHARGER DRIVERS FOR EXYNOS BASED BOARDS
+M:	Krzysztof Kozlowski <k.kozlowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
+L:	linux-pm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	drivers/power/max14577_charger.c
+F:	drivers/power/max77693_charger.c
+
 MAXIRADIO FM RADIO RECEIVER DRIVER
 M:	Hans Verkuil <hverkuil-qWit8jRvyhVmR6Xm/wNWPw@public.gmane.org>
 L:	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] tpm: fix format string error in tpm-chip.c
From: Jarkko Sakkinen @ 2015-01-20 10:03 UTC (permalink / raw)
  To: Peter Huewe, Ashley Lai, Marcel Selhorst
  Cc: tpmdd-devel, linux-kernel, josh, christophe.ricard,
	jason.gunthorpe, stefanb, linux-api, trousers-tech, jmorris,
	Jarkko Sakkinen

dev_set_name() takes three arguments where the second argument is
a format string. This patch fixes the call accordingly in tpm-chip.c

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 drivers/char/tpm/tpm-chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 6459af7..1d278cc 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -125,7 +125,7 @@ struct tpm_chip *tpmm_chip_alloc(struct device *dev,
 	else
 		chip->dev.devt = MKDEV(MAJOR(tpm_devt), chip->dev_num);
 
-	dev_set_name(&chip->dev, chip->devname);
+	dev_set_name(&chip->dev, "%s", chip->devname);
 
 	device_initialize(&chip->dev);
 
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Rasmus Villemoes @ 2015-01-20 10:37 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra, linux-fsdevel-u79uwXL29TbrhsbdSgBK9A
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>        
>        2) When you need higher precision than microsecond for wait timeout.

You probably want to say millisecond.

>            struct epoll_mod_cmd {
[...]
>            };


>            struct epoll_wait_spec {
[...]
>            } EPOLL_PACKED;

Either both or none of these should mention that EPOLL_PACKED is in fact
part of the actual definition. The changelog for 3/6 sorta mentions
that it's not really needed for epoll_mod_cmd. Why is it necessary for
either struct?

> RETURN VALUE
>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

And here, it doesn't make sense to mention a unit, since the new timeout
is given using struct timespec (this was the whole point, right?).

Rasmus

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Fam Zheng @ 2015-01-20 10:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra, linux-fsdevel-u79uwXL29TbrhsbdSgBK9A
In-Reply-To: <874mrl3fh9.fsf-qQsb+v5E8BnlAoU/VqSP6n9LOBIZ5rWg@public.gmane.org>

On Tue, 01/20 11:37, Rasmus Villemoes wrote:
> On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >        
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> You probably want to say millisecond.

Yes, you see that I just can't make this right. :)

> 
> >            struct epoll_mod_cmd {
> [...]
> >            };
> 
> 
> >            struct epoll_wait_spec {
> [...]
> >            } EPOLL_PACKED;
> 
> Either both or none of these should mention that EPOLL_PACKED is in fact
> part of the actual definition. The changelog for 3/6 sorta mentions
> that it's not really needed for epoll_mod_cmd. Why is it necessary for
> either struct?

Yeah. it's probably not really necessary.

> 
> > RETURN VALUE
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> And here, it doesn't make sense to mention a unit, since the new timeout
> is given using struct timespec (this was the whole point, right?).

Right!

Thanks,
Fam

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Johannes Stezenbach @ 2015-01-20 10:57 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: arnd, ebiederm, gnomes, teg, jkosina, luto, linux-api,
	linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <20150120011359.GE865@kroah.com>

On Tue, Jan 20, 2015 at 09:13:59AM +0800, Greg Kroah-Hartman wrote:
> On Tue, Jan 20, 2015 at 12:38:12AM +0100, Johannes Stezenbach wrote:
> > Those automotive applications you
> > were talking about, what was the OS they were ported from
> > and what was the messaging API they used?
> 
> They were ported from QNX and I don't know the exact api, it is wrapped
> up in a library layer for them to use.  And typically, they run about
> 40 thousand messages in the first few seconds of startup time.  Or was
> it 400 thousand?  Something huge and crazy to be doing on tiny ARM
> chips, but that's the IVI industry for you :(

So I did some googling and found in QNX servers create a channel
to receive messages, and clients connect to this channel.
Multiple clients can connect to the channel.
But it is not a bus -- no multicast/broadcast, and no name
service or policy rules like D-Bus has.  To me it looks
to be similar in functionality to UNIX domain sockets.

My guess is that the people porting from QNX were just confused
and their use of D-Bus was in error.  Maybe they should've used
plain sockets, capnproto, ZeroMQ or whatever.

> > As I said before, I'm seeing about a dozen D-Bus messages per minute,
> > nothing that would justify adding kdbus to the kernel for
> > performance reasons.  Wrt security I'm also not aware of any
> > open issues with D-Bus.  Thus I doubt normal users of D-Bus
> > would see any benefit from kdbus.  I also think none of the
> > applications I can install from my distribution has any performance
> > issue with D-Bus.
> 
> That's because people have not done anything really needing performance
> on the desktop over D-Bus in the past due to how slow the current
> implementation is.  Now that this is being resolved, that can change,
> and there are demos out there of even streaming audio over kdbus with no
> problems.
> 
> But performance is not just the only reason we want this in the kernel,
> I listed a whole long range of them.  Sure, it's great to now be faster,
> cutting down the number of context switches and copies by a huge amount,
> but the other things are equally important for future development
> (namespaces, containers, security, early-boot, etc.)
> 
> > And this is the point where I ask myself if I missed something.
> 
> Don't focus purely on performance for your existing desktop system,
> that's not the only use case here.  There are lots of others, as I
> document, that can benefit and want this.
> 
> One "fun" thing I've been talking to someone about is the ability to
> even port binder to be on top of kdbus.  But that's just a research
> project, and requires some API changes on the userspace binder side, but
> it shows real promise, and would then mean that we could deprecate the
> old binder code and a few hundred million devices could then use kdbus
> instead.  But that's long-term goals, not really all that relevant here,
> but it shows that having a solid bus IPC mechanism is a powerful thing
> that we have been missing in the past from Linux.

Well, IMHO you got it backwards.  Before adding a complex new IPC
API to the kernel you should do the homework and gather some
evidence that there will be enough users to justify the addition.

But maybe I misunderstood the purpose of this thread and you're
just advertising it to find possible users instead of already
suggesting to merge it?  If someone has some convincing story
to share why kdbus would solve their IPC needs, I'm all ears.

(I'm sorry this implies your responses so far were not convincing:
not verifiable facts, no numbers, no testimonials etc.)

FWIW, my gut feeling was that the earlier attempts to add a new
IPC primitve like multicast UNIX domain sockets
http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
were a much saner approach.  But now I think the comments
from this old thread have not been addressed, instead the
new approach just made the thing more complex and
put in ipc/ instead of net/ to bypass the guards.

Thanks,
Johannes

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Greg Kroah-Hartman @ 2015-01-20 11:26 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: arnd-r2nGTMty4D4, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <20150120105712.GA6260-FF7aIK3TAVNeoWH0uzbU5w@public.gmane.org>

On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:
> On Tue, Jan 20, 2015 at 09:13:59AM +0800, Greg Kroah-Hartman wrote:
> > On Tue, Jan 20, 2015 at 12:38:12AM +0100, Johannes Stezenbach wrote:
> > > Those automotive applications you
> > > were talking about, what was the OS they were ported from
> > > and what was the messaging API they used?
> > 
> > They were ported from QNX and I don't know the exact api, it is wrapped
> > up in a library layer for them to use.  And typically, they run about
> > 40 thousand messages in the first few seconds of startup time.  Or was
> > it 400 thousand?  Something huge and crazy to be doing on tiny ARM
> > chips, but that's the IVI industry for you :(
> 
> So I did some googling and found in QNX servers create a channel
> to receive messages, and clients connect to this channel.
> Multiple clients can connect to the channel.

Hence, a bus :)

> But it is not a bus -- no multicast/broadcast, and no name
> service or policy rules like D-Bus has.  To me it looks
> to be similar in functionality to UNIX domain sockets.

It's not as complex as D-Bus, but it's still subscribing to things and
getting messages.

> My guess is that the people porting from QNX were just confused
> and their use of D-Bus was in error.  Maybe they should've used
> plain sockets, capnproto, ZeroMQ or whatever.

I tend to trust that they knew what they were doing, they wouldn't have
picked D-Bus for no good reason.

> Well, IMHO you got it backwards.  Before adding a complex new IPC
> API to the kernel you should do the homework and gather some
> evidence that there will be enough users to justify the addition.

systemd wants this today for early boot.  It will remove lots of code
and enable a lot of good things to happen.  The first email in this
thread describes this quite well, is that not sufficient?

> FWIW, my gut feeling was that the earlier attempts to add a new
> IPC primitve like multicast UNIX domain sockets
> http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
> were a much saner approach.  But now I think the comments
> from this old thread have not been addressed, instead the
> new approach just made the thing more complex and
> put in ipc/ instead of net/ to bypass the guards.

Not at all, the networking maintainers said that that proposal was not
acceptable to them at all and it should not be done in the networking
stack at all.  So this was solution was created instead, which provides
a lot more things than the old networking patches did, which shows that
the networking developers were right to reject it.

thanks,

greg k-h

^ permalink raw reply

* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt@hitachi.com
In-Reply-To: <CAMEtUuwrpqRG4a=Hqnj3JBKuLbC4yV+trVAZhevKLbCsm_6U4Q@mail.gmail.com>

(2015/01/20 12:55), Alexei Starovoitov wrote:
> On Mon, Jan 19, 2015 at 6:58 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>>>
>>> it's done already... one can do the same skb->dev->name logic
>>> in kprobe attached program... so from bpf program point of view,
>>> tracepoints and kprobes feature-wise are exactly the same.
>>> Only input is different.
>>
>> No, I meant that the input should also be same, at least for the first step.
>> I guess it is easy to hook the ring buffer committing and fetch arguments
>> from the event entry.
> 
> No. That would be very slow. See my comment to Steven
> and more detailed numbers below.

Thank you for measuring the performance differences.
Indeed, the ring buffer looks slow.

> Allocating ring buffer takes too much time.
> 
>> And what I expected scenario was
>>
>> 1. setup kprobe traceevent with fd, buf, count by using perf-probe.
>> 2. load bpf module
>> 3. the module processes given event arguments.
> 
> from ring buffer? that's too slow.

Ok, BTW, would you think is it possible to use a reusable small scratchpad
memory for passing arguments? (just a thought)

> It's not usable for high frequency events which
> need this in-kernel aggregation.
> If events are rare, then just dumping everything
> into trace buffer is just fine. No in-kernel program is needed.

Hmm, let me ensure your point, the performance number is the reason why
we need to do it in the kernel, right? Not mainly for the flexibility but speed.

>> Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
>> Why don't you just reuse the existing facilities (perftools and ftrace)
>> instead of co-exist?
> 
> hmm. I don't think we're on the same page yet...
> ring buffer and tracing interface is fully reused.
> programs are run as soon as event triggers.
> They can return non-zero and kernel will allocate ring
> buffer which user space will consume.
> Please take a look at tracex1

I see, this code itself is not a destructive change.

>>> Just look how ktap scripts look alike for kprobes and tracepoints.
>>
>> Ktap is a good example, it provides only a language parser and a runtime engine.
>> Actually, currently it lacks a feature to execute "perf-probe" helper from
>> script, but it is easy to add such feature.
> ...
>> For this usecase, I've made --output option for perf probe
>> https://lkml.org/lkml/2014/10/31/210
> 
> you're proposing to call perf binary from ktap binary?

Yes, that's right :)

> I think packaging headaches and error conditions
> will make such approach very hard to use.

No, I don't think so. perf can be a "buffer" from the kernel API
and command-line API. If you need to get clearer error, you also
can join the upstream development.

> it would be much cleaner to have ktap as part of perf
> generating bpf on the fly and feeding into kernel.
> 'perf probe' parsing and functions don't belong in kernel
> when userspace can generate them in more efficient way.

No, perf probe still be needed to users who don't choose "injecting
binary blob" tracing. Efficiency is NOT only one index.

- perf probe and kprobe-event gives us a complete understandable
 interface for what will be recorded at where.
 (we can see the event definitions via kprobe_events interface,
  without any tools)
- kprobe-event gives a completely same interface as other tracepoint
  events.
- it also doesn't require any build-binary parts :) nor special tools.
  We can play with ftrace on just a small busybox.

However, this does NOT interfere your patch upstreaming. I just said current
ftrace method is also meaningful for some reasons :)


By the way, I concern about that bpf compiler can become another systemtap,
especially if you build it on llvm. Would you plan to develop it on kernel
tree? or apart from the kernel-side development?
I think it is hard to sync the development if you do it out-of-tree.


Thank you,



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api,
	netdev, linux-kernel, yrl.pp-manager.tt@hitachi.com
In-Reply-To: <1421381770-4866-5-git-send-email-ast@plumgrid.com>

(2015/01/16 13:16), Alexei Starovoitov wrote:
> simple packet drop monitor:
> - in-kernel eBPF program attaches to kfree_skb() event and records number
>   of packet drops at given location
> - userspace iterates over the map every second and prints stats

Hmm, this eBPF assembly macros are very interesting. Maybe I can
replace current kprobe's argument "fetching methods" with this.

Thank you,

> 
> Usage:
> $ sudo dropmon
> location 0xffffffff81695995 count 1
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 2
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 3
> location 0xffffffff816d0da9 count 2
> 
> $ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
> 0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
> 0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231
> 
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  samples/bpf/Makefile  |    2 +
>  samples/bpf/dropmon.c |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 131 insertions(+)
>  create mode 100644 samples/bpf/dropmon.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index b5b3600dcdf5..789691374562 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -6,7 +6,9 @@ hostprogs-y := test_verifier test_maps
>  hostprogs-y += sock_example
>  hostprogs-y += sockex1
>  hostprogs-y += sockex2
> +hostprogs-y += dropmon
>  
> +dropmon-objs := dropmon.o libbpf.o
>  test_verifier-objs := test_verifier.o libbpf.o
>  test_maps-objs := test_maps.o libbpf.o
>  sock_example-objs := sock_example.o libbpf.o
> diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
> new file mode 100644
> index 000000000000..9a2cd3344d69
> --- /dev/null
> +++ b/samples/bpf/dropmon.c
> @@ -0,0 +1,129 @@
> +/* simple packet drop monitor:
> + * - in-kernel eBPF program attaches to kfree_skb() event and records number
> + *   of packet drops at given location
> + * - userspace iterates over the map every second and prints stats
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include <linux/unistd.h>
> +#include <string.h>
> +#include <linux/filter.h>
> +#include <stdlib.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include "libbpf.h"
> +
> +#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
> +
> +static int write_to_file(const char *file, const char *str, bool keep_open)
> +{
> +	int fd, err;
> +
> +	fd = open(file, O_WRONLY);
> +	err = write(fd, str, strlen(str));
> +	(void) err;
> +
> +	if (keep_open) {
> +		return fd;
> +	} else {
> +		close(fd);
> +		return -1;
> +	}
> +}
> +
> +static int dropmon(void)
> +{
> +	long long key, next_key, value = 0;
> +	int prog_fd, map_fd, i;
> +	char fmt[32];
> +
> +	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
> +	if (map_fd < 0) {
> +		printf("failed to create map '%s'\n", strerror(errno));
> +		goto cleanup;
> +	}
> +
> +	/* the following eBPF program is equivalent to C:
> +	 * int filter(struct bpf_context *ctx)
> +	 * {
> +	 *   long loc = ctx->arg2;
> +	 *   long init_val = 1;
> +	 *   long *value;
> +	 *
> +	 *   value = bpf_map_lookup_elem(MAP_ID, &loc);
> +	 *   if (value) {
> +	 *      __sync_fetch_and_add(value, 1);
> +	 *   } else {
> +	 *      bpf_map_update_elem(MAP_ID, &loc, &init_val, BPF_ANY);
> +	 *   }
> +	 *   return 0;
> +	 * }
> +	 */
> +	struct bpf_insn prog[] = {
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
> +		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
> +		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
> +		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +		BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
> +		BPF_MOV64_IMM(BPF_REG_4, BPF_ANY),
> +		BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +	};
> +
> +	prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
> +				sizeof(prog), "GPL");
> +	if (prog_fd < 0) {
> +		printf("failed to load prog '%s'\n%s", strerror(errno), bpf_log_buf);
> +		return -1;
> +	}
> +
> +	sprintf(fmt, "bpf_%d", prog_fd);
> +
> +	write_to_file(TRACEPOINT "filter", fmt, true);
> +
> +	for (i = 0; i < 10; i++) {
> +		key = 0;
> +		while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
> +			bpf_lookup_elem(map_fd, &next_key, &value);
> +			printf("location 0x%llx count %lld\n", next_key, value);
> +			key = next_key;
> +		}
> +		if (key)
> +			printf("\n");
> +		sleep(1);
> +	}
> +
> +cleanup:
> +	/* maps, programs, tracepoint filters will auto cleanup on process exit */
> +
> +	return 0;
> +}
> +
> +int main(void)
> +{
> +	FILE *f;
> +
> +	/* start ping in the background to get some kfree_skb events */
> +	f = popen("ping -c5 localhost", "r");
> +	(void) f;
> +
> +	dropmon();
> +	return 0;
> +}
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:48 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
	Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
	David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
	Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra
In-Reply-To: <1421747878-30744-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Hello Fam Zheng,

I know this API has been through a number of iterations, and there were 
discussions about the design that led to it becoming more complex.
But, let us assume that someone has not seen those discussions,
or forgotten them, or is too lazy to go hunting list archives.

Then: this patch series should somewhere have an explanation of
why the API is what it is, ideally with links to previous relevant
discussions. I see that you do part of that in

    [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait

There are however no links to previous discussions in that mail (I guess
http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
relevant, nor is there any sort of change log in the commit message 
that explains the evolution of the API. Having those would ease the 
task of reviewers.

Coming back to THIS mail, this man page should also include an
explanation of why the API is what it is. That would include much
of the detail from the 5/6 patch, and probably more info besides.

Some specific points below.

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
> 
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
> 
> SYNOPSIS
> 
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
> 
> DESCRIPTION
> 
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.

s/microsecond/millisecond/

> 
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
> 
>            struct epoll_mod_cmd {
> 
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };
>        
>        There is no guartantee that all the commands are executed in order. Only

s/guartantee/guarantee/

I think the word "all" is not needed in this sentence.

Why is there no guarantee that the commands are run in order?
The order matters if there are operations on the same fd.

>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

Does the operation execute all commands, or stop when it encounters the first 
error? In other words, when looping over the returned 'error' fields, what
is the termination condition for the user-space application?

(Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
but the man page should explicitly state this so that I don't have to 
read the source, and also because it is only if you explicitly document 
the intended behavior that I can tell whether the actual implementation 
matches the intention.)

>        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
>        contains the information about how to poll the events. If it's NULL, this
>        call will immediately return after running all the commands in cmds.
> 
>        The structure is defined as below:
> 
>            struct epoll_wait_spec {
> 
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
> 
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
> 
>                   /* Which clock to use for timeout */
>                   int clockid;

Which clocks can be specified here?
CLOCK_MONOTONIC?
CLOCK_REALTIME?
CLOCK_PROCESS_CPUTIME_ID?
clock_getcpuclockid()?
others?

>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;

Is this timeout relative or absolute?

>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;

I just want to confirm here that 'sigmask' can be NULL, meaning
that we degenerate to epoll_wait() functionality, right?

>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

What is the "EPOLL_PACKED" here for?

> RETURN VALUE
> 
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.
> 
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

s/milliseconds//

> 
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.

s/occured/occurred/

> ERRORS
> 
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.
>
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EFAULT The memory area pointed to by events is not accessible with write
>               permissions.
> 
>        EINTR  The call was interrupted by a signal handler before either any of
>               the requested events occurred or the timeout expired; see
>               signal(7).
> 
>        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
>               or equal to zero, or fd is the same as epfd, or the requested
>               operation op is not supported by this interface.

Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_mod_wait() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Please add sigprocmask(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:50 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
	Peter Zijlstra
In-Reply-To: <1421747878-30744-6-git-send-email-famz@redhat.com>

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:54 UTC (permalink / raw)
  To: Daniel Mack, Florian Weimer, David Herrmann, Greg Kroah-Hartman
  Cc: mtk.manpages, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Djalal Harouni
In-Reply-To: <54BE1116.5060204@zonque.org>

On 01/20/2015 09:25 AM, Daniel Mack wrote:
> Hi Michael,
> 
> On 01/20/2015 09:09 AM, Michael Kerrisk (man-pages) wrote:
>> On 11/30/2014 06:23 PM, Florian Weimer wrote:
>>> * David Herrmann:
>>>
>>>> On Sun, Nov 30, 2014 at 10:02 AM, Florian Weimer <fw@deneb.enyo.de> wrote:
>>>>> * Greg Kroah-Hartman:
>>>>>
>>>>>> +7.4 Receiving messages
>>>
>>>>> What happens if this is not possible because the file descriptor limit
>>>>> of the processes would be exceeded?  EMFILE, and the message will not
>>>>> be received?
>>>>
>>>> The message is returned without installing the FDs. This is signaled
>>>> by EMFILE, but a valid pool offset.
>>>
>>> Oh.  This is really surprising, so it needs documentation.  But it's
>>> probably better than the alternative (return EMFILE and leave the
>>> message stuck, so that you receive it immediately again—this behavior
>>> makes non-blocking accept rather difficult to use correctly).
>>
>> So, was this point in the end explicitly documented? I not
>> obvious that it is documented in the revised kdbus.txt that
>> Greg K-H sent out 4 days ago.
> 
> No, we've revisited this point and changed the kernel behavior again in
> v3. We're no longer returning -EMFILE in this case, but rather set
> KDBUS_RECV_RETURN_INCOMPLETE_FDS in a new field in the receive ioctl
> struct called 'return_flags'. We believe that's a nicer way of signaling
> specific errors. The message will carry -1 for all FDs that failed to
> get installed, so the user can actually see which one is missing.
> 
> That's also documented in kdbus.txt, but we missed putting it into the
> Changelog - sorry for that.

Thanks for the info, Daniel.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v3 00/13] Add kdbus implementation
From: Johannes Stezenbach @ 2015-01-20 13:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: arnd-r2nGTMty4D4, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <20150120112609.GA17198-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Tue, Jan 20, 2015 at 07:26:09PM +0800, Greg Kroah-Hartman wrote:
> On Tue, Jan 20, 2015 at 11:57:12AM +0100, Johannes Stezenbach wrote:
> > 
> > So I did some googling and found in QNX servers create a channel
> > to receive messages, and clients connect to this channel.
> > Multiple clients can connect to the channel.
> 
> Hence, a bus :)
> 
> > But it is not a bus -- no multicast/broadcast, and no name
> > service or policy rules like D-Bus has.  To me it looks
> > to be similar in functionality to UNIX domain sockets.
> 
> It's not as complex as D-Bus, but it's still subscribing to things and
> getting messages.

Apparently you don't read what I write, probably you're not interested
in this discussion anymore...
QNX uses the term "channel" but it does not refer to a bus
or subscription facility, it is more like a socket in listening state.

> > My guess is that the people porting from QNX were just confused
> > and their use of D-Bus was in error.  Maybe they should've used
> > plain sockets, capnproto, ZeroMQ or whatever.
> 
> I tend to trust that they knew what they were doing, they wouldn't have
> picked D-Bus for no good reason.

The automotive developers I had the pleasure to work with would
use anything which is available via a mouse click in the
commercial Embedded Linux SDK IDE of their choice :)
Let's face it: QNX has a single IPC solution while Linux has
a confusing multitude of possibilities.

> > Well, IMHO you got it backwards.  Before adding a complex new IPC
> > API to the kernel you should do the homework and gather some
> > evidence that there will be enough users to justify the addition.
> 
> systemd wants this today for early boot.  It will remove lots of code
> and enable a lot of good things to happen.  The first email in this
> thread describes this quite well, is that not sufficient?

The first mail in this thread doesn't even mention systemd,
instead it has a lot of "marketing" buzzwords.
Of course it is no secret that systemd is the driving force
behind kdbus, but no public record exists to explain why
kdbus was chosen and designed the way it is, what alternatives
were considered and rejected etc.  (or if there is, please send a link)

> > FWIW, my gut feeling was that the earlier attempts to add a new
> > IPC primitve like multicast UNIX domain sockets
> > http://thread.gmane.org/gmane.linux.kernel/1255575/focus=1257999
> > were a much saner approach.  But now I think the comments
> > from this old thread have not been addressed, instead the
> > new approach just made the thing more complex and
> > put in ipc/ instead of net/ to bypass the guards.
> 
> Not at all, the networking maintainers said that that proposal was not
> acceptable to them at all and it should not be done in the networking
> stack at all.  So this was solution was created instead, which provides
> a lot more things than the old networking patches did, which shows that
> the networking developers were right to reject it.

Please read the gmane thread to the end.  It seems there were
several indications that D-Bus can be improved in userspace
using existing kernel facilities.  Havoc Pennington's mail
I quoted in my first response also contains some hints
about it.  I have no idea if any of this has ever been
pursued.  While adding complexity to critical net/ code paths
is probematic and a good reason to reject it, this was not
the only reason, the major one being "not neccessary".

Thanks,
Johannes

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Michael Kerrisk (man-pages) @ 2015-01-20 13:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZooxQFI55V6+gNQ, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A, Daniel Mack, Johannes Stezenbach
In-Reply-To: <1421435777-25306-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On 01/16/2015 08:16 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> 
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
> 
> The interface to all functions in this driver is implemented via ioctls
> on files exposed through a filesystem called 'kdbusfs'. The default
> mount point of kdbusfs is /sys/fs/kdbus. This patch adds detailed
> documentation about the kernel level API design.

I have some details feedback on the contents of this file, and some 
bigger questions. I'll split them out into separate mails.

So here, the bigger, general questions to start with. I've arrived late 
to this, so sorry if they've already been discussed, but the answers to 
some of the questions should actually be in this file, I would have 
expected.

This is an enormous and complex API. Why is the API ioctl() based,
rather than system-call-based? Have we learned nothing from the hydra
that the futex() multiplexing syscall became? (And kdbus is an order
of magnitude more complex, by the look of things.) At the very least,
a *good* justification of why the API is ioctl()-based should be part
of this documentation file.

An observation: The documentation below is substantial, but this API is 
enormous, so the documentation still feels rather thin. What would 
really help would be some example code in the doc. 

And on the subject of code examples... Are there any (prototype) 
working user-space applications that exercise the current kdbus 
implementation? That is, can I install these kdbus patches, and
then find a simple example application somewhere that does
something to exercise kdbus?

And then: is there any substantial real-world application (e.g., a 
full D-Bus port) that is being developed in tandem with this kernel
side patch? (I don't mean a user-space library; I mean a seriously
large application.) This is an incredibly complex API whose
failings are only going to become evident through real-world use.
Solidifying an API in the kernel and then discovering the API
problems later when writing real-world applications would make for
a sad story. A story something like that of inotify, an API which 
is an order of magnitude less complex than kdbus. (I can't help but
feel that many of inotify problems that I discuss at 
https://lwn.net/Articles/605128/ might have been fixed or mitigated 
if a few real-world applications had been implemented before the
API  was set in stone.)

> +For a kdbus specific userspace library implementation please refer to:
> +  http://cgit.freedesktop.org/systemd/systemd/tree/src/systemd/sd-bus.h

Is this library intended just for systemd? More generally, is there an 
intention to provide a general purpose library API for kdbus? Or is the
intention that each application will roll a library suitable to its
needs? I think an answer to that question would be useful in this 
Documentation file.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox