Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH] fanotify: Make wait for permission events interruptible
From: Jan Kara @ 2019-04-15  9:59 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Orion Poplawski, Vivek Trivedi, Amir Goldstein, linux-api, LKML,
	Eric W. Biederman, Jan Kara
In-Reply-To: <20190321151142.17104-1-jack@suse.cz>

On Thu 21-03-19 16:11:42, Jan Kara wrote:
> Switch waiting for response to fanotify permission events interruptible.
> This allows e.g. the system to be suspended while there are some
> fanotify permission events pending (which is reportedly pretty common
> when for example AV solution is in use). However just making the wait
> interruptible can result in e.g. open(2) returning -EINTR where
> previously such error code never happened in practice. To avoid
> confusion of userspace due to this error code, return -ERESTARTNOINTR
> instead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/notify/fanotify/fanotify.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Orion, can you give this patch some testing with your usecase? Also if anybody
> sees any issue with returning -ERESTARTNOINTR I have missed, please speak up.

Ping Orion? Did you have any chance to give this patch a try? Does it fix
hibernation issues you observe without causing issues with bash and other
programs? I'd like to queue this patch for the coming merge window but
I'd like to see some testing results showing that it actually helps
anything... Thanks!

								Honza

> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6b9c27548997..eb790853844b 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -92,10 +92,17 @@ static int fanotify_get_response(struct fsnotify_group *group,
>  
>  	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>  
> -	ret = wait_event_killable(group->fanotify_data.access_waitq,
> -				  event->state == FAN_EVENT_ANSWERED);
> +	ret = wait_event_interruptible(group->fanotify_data.access_waitq,
> +				       event->state == FAN_EVENT_ANSWERED);
>  	/* Signal pending? */
>  	if (ret < 0) {
> +		/*
> +		 * Force restarting a syscall so that this is mostly invisible
> +		 * for userspace which is not prepared for handling EINTR e.g.
> +		 * from open(2).
> +		 */
> +		if (ret == -ERESTARTSYS)
> +			ret = -ERESTARTNOINTR;
>  		spin_lock(&group->notification_lock);
>  		/* Event reported to userspace and no answer yet? */
>  		if (event->state == FAN_EVENT_REPORTED) {
> -- 
> 2.16.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: bug#35275: spelling mistake manual pages
From: Paul Eggert @ 2019-04-15  5:34 UTC (permalink / raw)
  To: Vikas Talan, 35275-done, mtk.manpages, majordomo, linux-kernel,
	linux-kernel-announce, linux-api, netdev, libc-alpha-subscribe,
	libc-help-subscribe, manpages, listar, linux-man
In-Reply-To: <CABCHxwFMp+vFpv8V-c0=CeXnNmymEuvm_7adGnOnUfCUH7C5+w@mail.gmail.com>

That report appears to be about the glibc manual. Please report glibc bugs as 
described here:

https://sourceware.org/glibc/wiki/FilingBugs

Also, please use plain text rather than images to describe bugs in a text file.

^ permalink raw reply

* [PATCH 4/4] samples: show race-free pidfd metadata access
From: Christian Brauner @ 2019-04-14 20:14 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>

This is a sample program showing userspace how to get race-free access to
process metadata from a pidfd. It is rather easy to do and userspace can
actually simply reuse code that currently parses a process's status file in
procfs.
The program can easily be extended into a generic helper suitable for
inclusion in a libc to make it even easier for userspace to gain metadata
access.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 172 +++++++++++++++++++++++++++++++++
 3 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

diff --git a/samples/Makefile b/samples/Makefile
index b1142a958811..fadadb1c3b05 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ livepatch/ \
 			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
 			   configfs/ connector/ v4l/ trace_printk/ \
-			   vfio-mdev/ statx/ qmi/ binderfs/
+			   vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
new file mode 100644
index 000000000000..0ff97784177a
--- /dev/null
+++ b/samples/pidfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+hostprogs-y := pidfd-metadata
+always := $(hostprogs-y)
+HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
+all: pidfd-metadata
diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
new file mode 100644
index 000000000000..23a44e582ccb
--- /dev/null
+++ b/samples/pidfd/pidfd-metadata.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+static int raw_clone_pidfd(void)
+{
+	unsigned long flags = CLONE_PIDFD | SIGCHLD;
+
+#if defined(__s390x__) || defined(__s390__) || defined(__CRIS__)
+	/*
+	 * On s390/s390x and cris the order of the first and second arguments
+	 * of the system call is reversed.
+	 */
+	return (int)syscall(__NR_clone, NULL, flags);
+#elif defined(__sparc__) && defined(__arch64__)
+	{
+		/*
+		 * sparc64 always returns the other process id in %o0, and a
+		 * boolean flag whether this is the child or the parent in %o1.
+		 * Inline assembly is needed to get the flag returned in %o1.
+		 */
+		int child_pid, in_child;
+
+		asm volatile("mov %2, %%g1\n\t"
+			     "mov %3, %%o0\n\t"
+			     "mov 0 , %%o1\n\t"
+			     "t 0x6d\n\t"
+			     "mov %%o1, %0\n\t"
+			     "mov %%o0, %1"
+			     : "=r"(in_child), "=r"(child_pid)
+			     : "i"(__NR_clone), "r"(flags)
+			     : "%o1", "%o0", "%g1");
+
+		if (in_child)
+			return 0;
+		else
+			return child_pid;
+	}
+#elif defined(__ia64__)
+	/* On ia64 stack and stack size are passed as separate arguments. */
+	return (int)syscall(__NR_clone, flags, NULL, 0UL);
+#else
+	return (int)syscall(__NR_clone, flags, NULL);
+#endif
+}
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+					unsigned int flags)
+{
+	return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static int pidfd_metadata_fd(int pidfd)
+{
+	int procfd, ret;
+	char path[100];
+	FILE *f;
+	size_t n = 0;
+	char *line = NULL;
+
+	snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
+
+	f = fopen(path, "re");
+	if (!f)
+		return -1;
+
+	ret = 0;
+	while (getline(&line, &n, f) != -1) {
+		char *numstr;
+		size_t len;
+
+		if (strncmp(line, "Pid:\t", 5))
+			continue;
+
+		numstr = line + 5;
+		len = strlen(numstr);
+		if (len > 0 && numstr[len - 1] == '\n')
+			numstr[len - 1] = '\0';
+		ret = snprintf(path, sizeof(path), "/proc/%s", numstr);
+		break;
+	}
+	free(line);
+	fclose(f);
+
+	if (!ret) {
+		errno = ENOENT;
+		warn("Failed to parse pid from fdinfo\n");
+		return -1;
+	}
+
+	procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+	if (procfd < 0) {
+		warn("Failed to open %s\n", path);
+		return -1;
+	}
+
+	/*
+	 * Verify that the pid has not been recycled and our /proc/<pid> handle
+	 * is still valid.
+	 */
+	ret = sys_pidfd_send_signal(pidfd, 0, NULL, 0);
+	if (ret < 0) {
+		switch (errno) {
+		case EPERM:
+			/* Process exists, just not allowed to signal it. */
+			break;
+		default:
+			warn("Failed to signal process\n");
+			close(procfd);
+			procfd = -1;
+		}
+	}
+
+	return procfd;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = EXIT_FAILURE;
+	char buf[4096] = { 0 };
+	int pidfd, procfd, statusfd;
+	ssize_t bytes;
+
+	pidfd = raw_clone_pidfd();
+	if (pidfd < 0)
+		exit(ret);
+
+	if (pidfd == 0) {
+		printf("%d\n", getpid());
+		exit(EXIT_SUCCESS);
+	}
+
+	procfd = pidfd_metadata_fd(pidfd);
+	close(pidfd);
+	if (procfd < 0)
+		goto out;
+
+	statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
+	close(procfd);
+	if (statusfd < 0)
+		goto out;
+
+	bytes = read(statusfd, buf, sizeof(buf));
+	if (bytes > 0)
+		bytes = write(STDOUT_FILENO, buf, bytes);
+	close(statusfd);
+	ret = EXIT_SUCCESS;
+
+out:
+	(void)wait(NULL);
+
+	exit(ret);
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
From: Christian Brauner @ 2019-04-14 20:14 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 kernel/signal.c | 14 ++++++++++----
 kernel/sys_ni.c |  3 ---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..cd83cc376767 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 	return kill_something_info(sig, &info, pid);
 }
 
-#ifdef CONFIG_PROC_FS
 /*
  * Verify that the signaler and signalee either are in the same pid namespace
  * or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 	return copy_siginfo_from_user(kinfo, info);
 }
 
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return tgid_pidfd_to_pid(file);
+}
+
 /**
  * sys_pidfd_send_signal - send a signal to a process through a task file
  *                          descriptor
@@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	if (flags)
 		return -EINVAL;
 
-	f = fdget_raw(pidfd);
+	f = fdget(pidfd);
 	if (!f.file)
 		return -EBADF;
 
 	/* Is this a pidfd? */
-	pid = tgid_pidfd_to_pid(f.file);
+	pid = pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
 		goto err;
@@ -3620,7 +3627,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	fdput(f);
 	return ret;
 }
-#endif /* CONFIG_PROC_FS */
 
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
 
 /* kernel/sched/core.c */
 
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
 COND_SYSCALL(setgid);
-- 
2.21.0

^ permalink raw reply related

* [PATCH 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-14 20:14 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>

As discussed this patchset makes it possible to retrieve pid file
descriptors at process creation time by introducing the new flag
CLONE_PIDFD to the clone() system call. Linus originally suggested to
implement this as a new flag to clone() instead of making it a separate
system call. As spotted by Linus, there is exactly one bit for clone()
left.

CLONE_PIDFD returns file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically, this
makes it possible to interpret a pidfd differently, narrowing or widening
the scope of various operations (e.g. signal sending). Thus, a pidfd cannot
just refer to a tgid, but also a tid, or in theory - given appropriate flag
arguments in relevant syscalls - a process group or session. A pidfd does
not represent a privilege. This does not imply it cannot ever be that way
but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers pid
namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

To remove worries about missing metadata access this patchset comes with a
sample program that illustrates how a combination of CLONE_PIDFD, fdinfo,
and pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be translated
into a helper that would be suitable for inclusion in libc so that users
don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/pid.h        |   2 +
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c              | 117 +++++++++++++++++++++++++++++++++++--
 3 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..06fa224d2c48 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD	0x00001000	/* set if a pidfd instead of a pid should be returned */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..4825d5205604 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/slab.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
@@ -21,8 +22,10 @@
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
 #include <linux/rtmutex.h>
 #include <linux/init.h>
+#include <linux/fsnotify.h>
 #include <linux/unistd.h>
 #include <linux/module.h>
 #include <linux/vmalloc.h>
@@ -1662,6 +1665,87 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+	struct pid *pid = f->private_data;
+
+	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+	seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid:  struct pid that the pidfd will reference
+ * @file: struct file referencing @pid to return to caller
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * potentially been shared to avoid leaking the pidfd to the new process.
+ *
+ * File descriptor numbering for pidfds starts at 1. This allows users
+ * of CLONE_PIDFD to perform the same checks as for pids, i.e.:
+ * pid/pidfd <  0: error
+ * pid/pidfd == 0: child
+ * pid/pidfd >  0: parent
+ *
+ * Return: On success, a cloexec pidfd ready to be installed through
+ *         fd_install() will be returned. The corresponding file will be
+ *         returned through @file.
+ *         On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid, struct file **file)
+{
+	unsigned int flags = O_RDWR | O_CLOEXEC;
+	int error, fd;
+	struct file *f;
+
+	error = __alloc_fd(current->files, 1, rlimit(RLIMIT_NOFILE), flags);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	f = anon_inode_getfile("pidfd", &pidfd_fops, get_pid(pid), flags);
+	if (IS_ERR(f)) {
+		put_pid(pid);
+		error = PTR_ERR(f);
+		goto err_put_unused_fd;
+	}
+
+	*file = f;
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+	return error;
+}
+
+static inline void pidfd_put(int fd, struct file *file)
+{
+	put_unused_fd(fd);
+	fput(file);
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -1678,11 +1762,12 @@ static __latent_entropy struct task_struct *copy_process(
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
-					int node)
+					int node, int *pidfd)
 {
 	int retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	struct file *pidfdf = NULL;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1936,6 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
+	/*
+	 * This has to happen after we've potentially unshared the file
+	 * descriptor table (so that the pidfd doesn't leak into the child if
+	 * the fd table isn't shared).
+	 */
+	if (clone_flags & CLONE_PIDFD) {
+		retval = pidfd_create(pid, &pidfdf);
+		if (retval < 0)
+			goto bad_fork_free_pid;
+		*pidfd = retval;
+	}
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1996,7 +2093,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	retval = cgroup_can_fork(p);
 	if (retval)
-		goto bad_fork_free_pid;
+		goto bad_fork_put_pidfd;
 
 	/*
 	 * From this point on we must avoid any synchronous user-space
@@ -2097,6 +2194,9 @@ static __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	if (clone_flags & CLONE_PIDFD)
+		fd_install(*pidfd, pidfdf);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	cgroup_threadgroup_change_end(current);
@@ -2111,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+	if (clone_flags & CLONE_PIDFD)
+		pidfd_put(*pidfd, pidfdf);
 bad_fork_free_pid:
 	cgroup_threadgroup_change_end(current);
 	if (pid != &init_struct_pid)
@@ -2177,7 +2280,7 @@ struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+			    cpu_to_node(cpu), NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2202,7 +2305,7 @@ long _do_fork(unsigned long clone_flags,
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
-	int trace = 0;
+	int pidfd, trace = 0;
 	long nr;
 
 	/*
@@ -2224,7 +2327,7 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE, &pidfd);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	put_pid(pid);
+
+	if (clone_flags & CLONE_PIDFD)
+		nr = pidfd;
+
 	return nr;
 }
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 1/4] Make anon_inodes unconditional
From: Christian Brauner @ 2019-04-14 20:14 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>

From: David Howells <dhowells@redhat.com>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
---
 arch/arm/kvm/Kconfig       |  1 -
 arch/arm64/kvm/Kconfig     |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/Kconfig           |  1 -
 arch/x86/kvm/Kconfig       |  1 -
 drivers/base/Kconfig       |  1 -
 drivers/char/tpm/Kconfig   |  1 -
 drivers/dma-buf/Kconfig    |  1 -
 drivers/gpio/Kconfig       |  1 -
 drivers/iio/Kconfig        |  1 -
 drivers/infiniband/Kconfig |  1 -
 drivers/vfio/Kconfig       |  1 -
 fs/Makefile                |  2 +-
 fs/notify/fanotify/Kconfig |  1 -
 fs/notify/inotify/Kconfig  |  1 -
 init/Kconfig               | 10 ----------
 18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on MMU && OF
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select ARM_GIC
 	select ARM_GIC_V3
 	select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	depends on OF
 	select MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	depends on MIPS_FP_SUPPORT
 	select EXPORT_UASM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..7a70fb58b2d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
-	select ANON_INODES
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
-	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
 config DMA_SHARED_BUFFER
 	bool
 	default n
-	select ANON_INODES
 	select IRQ_WORK
 	help
 	  This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
 config TCG_VTPM_PROXY
 	tristate "VTPM Proxy Interface"
 	depends on TCG_TPM
-	select ANON_INODES
 	---help---
 	  This driver proxies for an emulated TPM (vTPM) running in userspace.
 	  A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
 config SYNC_FILE
 	bool "Explicit Synchronization Framework"
 	default n
-	select ANON_INODES
 	select DMA_SHARED_BUFFER
 	---help---
 	  The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
 
 menuconfig GPIOLIB
 	bool "GPIO Support"
-	select ANON_INODES
 	help
 	  This enables GPIO support through the generic GPIO library.
 	  You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
 
 menuconfig IIO
 	tristate "Industrial I/O support"
-	select ANON_INODES
 	help
 	  The industrial I/O subsystem provides a unified framework for
 	  drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
 
 config INFINIBAND_USER_ACCESS
 	tristate "InfiniBand userspace access (verbs and CM)"
-	select ANON_INODES
 	depends on MMU
 	---help---
 	  Userspace InfiniBand access support.  This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
-	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
 config FANOTIFY
 	bool "Filesystem wide access notification"
 	select FSNOTIFY
-	select ANON_INODES
 	select EXPORTFS
 	default n
 	---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	select ANON_INODES
 	select FSNOTIFY
 	default y
 	---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help
-- 
2.21.0

^ permalink raw reply related

* [PATCH 0/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-14 20:14 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner

Hey Linus,

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call as previously discussed.

As decided last week [1] Jann and I have refined the implementation of
pidfds as anonymous inodes. Based on last weeks RFC we have only tweaked
documentation and naming, as well as making the sample program how to
get easy metadata access from a pidfd a little cleaner and more paranoid
when checking for errors.
The sample program can also serve as a test for the patchset.

When clone is called with CLONE_PIDFD a pidfd instead of a pid will be
returned. To make it possible for users of CLONE_PIDFD to apply standard
error checking that is common all across userspace, file descriptor
numbering for pidfds starts at 1 and not 0. This has the major advantage
that users can do:

int pidfd = clone(CLONE_PIDFD);
if (pidfd < 0) {
       /* handle error */
       exit(EXIT_FAILURE):
}

if (pidfd == 0) {
       /* child */
       exit(EXIT_SUCCESS);
}

/* parent */
exit(EXIT_SUCCESS);

We have also taken care that pidfds are created *after* the fd table has
been unshared to not leak pidfds into child processes.
pidfd creation during clone is split into two distinct steps:
1. preparing both an fd and a file referencing struct pid for fd_install()
2. fd_install()ing the pidfd
Step 1. is performed before clone's point of no return and especially
before write_lock_irq(&tasklist_lock) is taken.
Performing 1. before clone's point of no return ensures that we don't
need to fail a process that is already visible to userspace when pidfd
creation fails. Step 2. happens after attach_pid() is performed and the
process is visible to userspace.
Technically, we could have also performed step 1. and 2. together before
clone's point of no return and then calling close on the file descriptor
on failure. This would slightly increase code-locality but it is
semantically more correct and clean to bring the pidfd into existence
once the process is fully attached and not before.

The actual code for CLONE_PIDFD in patch 2 is completely confined to
fork.c (apart from the CLONE_PIDFD definition of course) and is rather
small and hopefully good to review. 

The additional changes listed under David's name in the diffstat below
are here to make anon_inodes available unconditionally. They are needed
for the new mount api and thus for core vfs code in addition to pidfds.
David knows this and he has informed Al that this patch is sent out
here. The changes themselves are rather automatic.

As promised I have also contacted Joel who has sent a patchset to make
pidfds pollable. He has been informed and is happy to port his patchset
once we have moved forward [2].
Jann and I currently plan to target this patchset for inclusion in the 5.2
merge window.

Thanks!
Jann & Christian

[1]: https://lore.kernel.org/lkml/CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20190411200059.GA75190@google.com/

Christian Brauner (3):
  clone: add CLONE_PIDFD
  signal: support CLONE_PIDFD with pidfd_send_signal
  samples: show race-free pidfd metadata access

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig           |   1 -
 arch/arm64/kvm/Kconfig         |   1 -
 arch/mips/kvm/Kconfig          |   1 -
 arch/powerpc/kvm/Kconfig       |   1 -
 arch/s390/kvm/Kconfig          |   1 -
 arch/x86/Kconfig               |   1 -
 arch/x86/kvm/Kconfig           |   1 -
 drivers/base/Kconfig           |   1 -
 drivers/char/tpm/Kconfig       |   1 -
 drivers/dma-buf/Kconfig        |   1 -
 drivers/gpio/Kconfig           |   1 -
 drivers/iio/Kconfig            |   1 -
 drivers/infiniband/Kconfig     |   1 -
 drivers/vfio/Kconfig           |   1 -
 fs/Makefile                    |   2 +-
 fs/notify/fanotify/Kconfig     |   1 -
 fs/notify/inotify/Kconfig      |   1 -
 include/linux/pid.h            |   2 +
 include/uapi/linux/sched.h     |   1 +
 init/Kconfig                   |  10 --
 kernel/fork.c                  | 117 +++++++++++++++++++++-
 kernel/signal.c                |  14 ++-
 kernel/sys_ni.c                |   3 -
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 172 +++++++++++++++++++++++++++++++++
 26 files changed, 305 insertions(+), 40 deletions(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

-- 
2.21.0

^ permalink raw reply

* Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
From: Amir Goldstein @ 2019-04-14  7:10 UTC (permalink / raw)
  To: Shawn Landden; +Cc: linux-api, linux-fsdevel, linux-kernel, Darrick J. Wong
In-Reply-To: <20190414010254.GA7408@magnolia>

On Sun, Apr 14, 2019 at 4:04 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Sat, Apr 13, 2019 at 03:54:39PM -0500, Shawn Landden wrote:
>
> /me pulls out his close-reading glasses and the copy_file_range manpage...
>
> > If flags includes COPY_FILE_RANGE_FILESIZE then the length
> > copied is the length of the file. off_in and off_out are
> > ignored.  len must be 0 or the file size.
>
> They're ignored?  As in the copy operation reads the number of bytes in
> the file referenced by fd_in from fd_in at its current position and is
> writes that out to fd_out at its current position?  I don't see why I
> would want such an operation...
>
> ...but I can see how people could make use of a CFR_ENTIRE_FILE that
> would check that both file descriptors are actually regular files, and
> if so copy the entire contents of the fd_in file into the same position
> in the fd_out file, and then set the fd_out file's length to match.  If
> @off_in or @off_out are non-NULL then they'll be updated to the new EOFs
> if the copy completes succesfully and @len can be anything.
>

IDGI. In what way would that be helpful?
Would the syscall fail if it cannot copy entire file (like clone_file_range)
or return bytes copied?
If latter, then user will have to call syscall again until getting 0
return value.
User can already call copy_file_range with len=SSIZE_MAX and get almost
the same thing.
Unless the idea is to optimize for less syscalls for copying very large files??
In that case, MAX_RW_COUNT limit for this syscall would need to be relaxed.

While on the subject, something that has been discussed in the past is that
copy_file_range() and sendfile() of a large file are not killable, so that is
that should be fixed, especially if the interface is going to be used to copy
more data in-kernel.

IOW, the motivation of the patch is not clear to me:

> This implementation saves a call to stat() in the common case

What is the real life workload where this micro optimization would
have any affect?

> It does not fix any race conditions, but that is possible in the future
> with this interface.

Then please present a plan or an implementation of how that interface
can solve race conditions and if that is the only motivation for the
interface than I do not see why we should merge the interface before
the implementation.

Please let me know if I am missing something.

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
From: Darrick J. Wong @ 2019-04-14  1:02 UTC (permalink / raw)
  To: Shawn Landden; +Cc: linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190413205439.9623-1-shawn@git.icu>

On Sat, Apr 13, 2019 at 03:54:39PM -0500, Shawn Landden wrote:

/me pulls out his close-reading glasses and the copy_file_range manpage...

> If flags includes COPY_FILE_RANGE_FILESIZE then the length
> copied is the length of the file. off_in and off_out are
> ignored.  len must be 0 or the file size.

They're ignored?  As in the copy operation reads the number of bytes in
the file referenced by fd_in from fd_in at its current position and is
writes that out to fd_out at its current position?  I don't see why I
would want such an operation...

...but I can see how people could make use of a CFR_ENTIRE_FILE that
would check that both file descriptors are actually regular files, and
if so copy the entire contents of the fd_in file into the same position
in the fd_out file, and then set the fd_out file's length to match.  If
@off_in or @off_out are non-NULL then they'll be updated to the new EOFs
if the copy completes succesfully and @len can be anything.

Also: please update the manual page and the xfstests regression test for
this syscall.

> This implementation saves a call to stat() in the common case
> of copying files. It does not fix any race conditions, but that
> is possible in the future with this interface.
> 
> EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
> or the file size.

The values are invalid, so why would we tell userspace to try again
instead of the EINVAL that we usually use?

> Signed-off-by: Shawn Landden <shawn@git.icu>
> CC: <linux-api@vger.kernel.org>
> ---
>  fs/read_write.c           | 14 +++++++++++++-
>  include/uapi/linux/stat.h |  4 ++++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..6d06361f0856 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	struct inode *inode_out = file_inode(file_out);
>  	ssize_t ret;
>  
> -	if (flags != 0)
> +	if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)

FWIW you might as well shorten the prefix to "CFR_" since nobody else is
using it.

--D

>  		return -EINVAL;
>  
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>  		return -EINVAL;
>  
> +	if (flags & COPY_FILE_RANGE_FILESIZE) {
> +		struct kstat stat;
> +		int error;
> +		error = vfs_getattr(&file_in->f_path, &stat,
> +				    STATX_SIZE, 0);
> +		if (error < 0)
> +			return error;
> +		if (!(len == 0 || len == stat.size))
> +			return -EAGAIN;
> +		len = stat.size;
> +	}
> +
>  	ret = rw_verify_area(READ, file_in, &pos_in, len);
>  	if (unlikely(ret))
>  		return ret;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..1075aa4666ef 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -170,5 +170,9 @@ struct statx {
>  
>  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
>  
> +/*
> + * Flags for copy_file_range()
> + */
> +#define COPY_FILE_RANGE_FILESIZE	0x00000001 /* Copy the full length of the input file */
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> -- 
> 2.20.1
> 

^ permalink raw reply

* Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
From: Andy Lutomirski @ 2019-04-13 21:48 UTC (permalink / raw)
  To: Shawn Landden; +Cc: linux-api, linux-fsdevel, linux-kernel, linux-api
In-Reply-To: <20190413204947.9394-1-shawn@git.icu>



> On Apr 13, 2019, at 1:49 PM, Shawn Landden <shawn@git.icu> wrote:
> 
> If flags includes COPY_FILE_RANGE_FILESIZE then the length
> copied is the length of the file. off_in and off_out are
> ignored. len must be 0 or the file size.
> 
> This implementation saves a call to stat() in the common case
> of copying files. It does not fix any race conditions, but that
> is possible in the future with this interface.
> 
> EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
> or the file size.

I think you’re asking for trouble here. I assume you have some kind of race prevention in mind here.  The trouble is that passing zero means copy the whole thing, but the size-checking behavior is only available for nonzero sizes. This means that anyone who passes their idea of the size needs to account for inconsistent behavior if the size is zero.

Also, what happens if the file size changes mid copy?  I assume the result is more or less arbitrary, but you should document what behavior is allowed.  The docs should cover the case where you race against an O_APPEND write.

> 
> Signed-off-by: Shawn Landden <shawn@git.icu>
> CC: <linux-api@vger.kernel.org>
> ---
> fs/read_write.c           | 14 +++++++++++++-
> include/uapi/linux/stat.h |  4 ++++
> 2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..6d06361f0856 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>    struct inode *inode_out = file_inode(file_out);
>    ssize_t ret;
> 
> -    if (flags != 0)
> +    if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
>        return -EINVAL;
> 
>    if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>    if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>        return -EINVAL;
> 
> +    if (flags & COPY_FILE_RANGE_FILESIZE) {
> +        struct kstat stat;
> +        int error;
> +        error = vfs_getattr(&file_in->f_path, &stat,
> +                    STATX_SIZE, 0);
> +        if (error < 0)
> +            return error;
> +        if (!(len == 0 || len == stat.size))
> +            return -EAGAIN;
> +        len = stat.size;
> +    }
> +
>    ret = rw_verify_area(READ, file_in, &pos_in, len);
>    if (unlikely(ret))
>        return ret;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..1075aa4666ef 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -170,5 +170,9 @@ struct statx {
> 
> #define STATX_ATTR_AUTOMOUNT        0x00001000 /* Dir: Automount trigger */
> 
> +/*
> + * Flags for copy_file_range()
> + */
> +#define COPY_FILE_RANGE_FILESIZE    0x00000001 /* Copy the full length of the input file */
> 
> #endif /* _UAPI_LINUX_STAT_H */
> -- 
> 2.20.1
> 

^ permalink raw reply

* [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
From: Shawn Landden @ 2019-04-13 20:54 UTC (permalink / raw)
  Cc: linux-api, linux-fsdevel, linux-kernel, Shawn Landden

If flags includes COPY_FILE_RANGE_FILESIZE then the length
copied is the length of the file. off_in and off_out are
ignored. len must be 0 or the file size.

This implementation saves a call to stat() in the common case
of copying files. It does not fix any race conditions, but that
is possible in the future with this interface.

EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
or the file size.

Signed-off-by: Shawn Landden <shawn@git.icu>
CC: <linux-api@vger.kernel.org>
---
 fs/read_write.c           | 14 +++++++++++++-
 include/uapi/linux/stat.h |  4 ++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..6d06361f0856 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_out = file_inode(file_out);
 	ssize_t ret;
 
-	if (flags != 0)
+	if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
 		return -EINVAL;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
+	if (flags & COPY_FILE_RANGE_FILESIZE) {
+		struct kstat stat;
+		int error;
+		error = vfs_getattr(&file_in->f_path, &stat,
+				    STATX_SIZE, 0);
+		if (error < 0)
+			return error;
+		if (!(len == 0 || len == stat.size))
+			return -EAGAIN;
+		len = stat.size;
+	}
+
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (unlikely(ret))
 		return ret;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..1075aa4666ef 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -170,5 +170,9 @@ struct statx {
 
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 
+/*
+ * Flags for copy_file_range()
+ */
+#define COPY_FILE_RANGE_FILESIZE	0x00000001 /* Copy the full length of the input file */
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.20.1

^ permalink raw reply related

* [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
From: Shawn Landden @ 2019-04-13 20:49 UTC (permalink / raw)
  Cc: linux-api, linux-fsdevel, linux-kernel, Shawn Landden, linux-api

If flags includes COPY_FILE_RANGE_FILESIZE then the length
copied is the length of the file. off_in and off_out are
ignored. len must be 0 or the file size.

This implementation saves a call to stat() in the common case
of copying files. It does not fix any race conditions, but that
is possible in the future with this interface.

EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
or the file size.

Signed-off-by: Shawn Landden <shawn@git.icu>
CC: <linux-api@vger.kernel.org>
---
 fs/read_write.c           | 14 +++++++++++++-
 include/uapi/linux/stat.h |  4 ++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..6d06361f0856 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_out = file_inode(file_out);
 	ssize_t ret;
 
-	if (flags != 0)
+	if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
 		return -EINVAL;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
+	if (flags & COPY_FILE_RANGE_FILESIZE) {
+		struct kstat stat;
+		int error;
+		error = vfs_getattr(&file_in->f_path, &stat,
+				    STATX_SIZE, 0);
+		if (error < 0)
+			return error;
+		if (!(len == 0 || len == stat.size))
+			return -EAGAIN;
+		len = stat.size;
+	}
+
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (unlikely(ret))
 		return ret;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..1075aa4666ef 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -170,5 +170,9 @@ struct statx {
 
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 
+/*
+ * Flags for copy_file_range()
+ */
+#define COPY_FILE_RANGE_FILESIZE	0x00000001 /* Copy the full length of the input file */
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.20.1

^ permalink raw reply related

* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Moritz Fischer @ 2019-04-12 21:05 UTC (permalink / raw)
  To: Alan Tull
  Cc: Wu Hao, linux-fpga, linux-kernel, linux-api, Luwei Kang, Xu Yilun
In-Reply-To: <CANk1AXTji9qm0Nfbme+2=nR-PNavnv__GwzDRazhH78hYkWN6g@mail.gmail.com>

Hi Hao,

this looks suspiciously like a hwmon driver ;-)

https://www.kernel.org/doc/Documentation/hwmon/hwmon-kernel-api.txt

Cheers,
Moritz


On Thu, Apr 11, 2019 at 1:08 PM Alan Tull <atull@kernel.org> wrote:
>
> On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
>
> Hi Hao,
>
> >
> > This patch adds support for power management private feature under
> > FPGA Management Engine (FME), sysfs interfaces are introduced for
> > different power management functions, users could use these sysfs
> > interface to get current number of consumed power, throttling
>
> How about
> s/number/measurement/
> ?
>
> > thresholds, threshold status and other information, and configure
> > different value for throttling thresholds too.
> >
> > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  56 +++++
> >  drivers/fpga/dfl-fme-main.c                      | 257 +++++++++++++++++++++++
> >  2 files changed, 313 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > index d3aeb88..4b6448f 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > @@ -100,3 +100,59 @@ Description:       Read-only. Read this file to get the policy of temperature
> >                 threshold1. It only supports two value (policy):
> >                     0 - AP2 state (90% throttling)
> >                     1 - AP1 state (50% throttling)
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns current power consumed by FPGA.
>
> What are the units?
>
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Write. Read/Write this file to get/set current power
> > +               threshold1 in Watts.
>
> Perhaps document error codes here and for threshold2 below.
>
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Write. Read/Write this file to get/set current power
> > +               threshold2 in Watts.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if power consumption reaches the
> > +               threshold1, otherwise 0.
>
> I'm used to things like this requiring user to reset the status, so it
> may be worth making it explicit that it will return to zero if
> consumption drops below threshold if that's what's happening here.
> If it's correct, perhaps could just say something like 'returns 1 if
> power consumption is currently at or above threshold1, otherwise 0'
>
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if power consumption reaches the
> > +               threshold2, otherwise 0.
>
> Same here.
>
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get current Latency Tolerance
> > +               Reporting (ltr) value, it's only valid for integrated
> > +               solution as it blocks CPU on low power state.
>
> If we're not on the integrated solution, it returns a value but it is
> not really real?
>
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/xeon_limit
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get power limit for xeon, it
> > +               is only valid for integrated solution.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/fpga_limit
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get power limit for fpga, it
> > +               is only valid for integrated solution.
> > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > index 449a17d..dafa6580 100644
> > --- a/drivers/fpga/dfl-fme-main.c
> > +++ b/drivers/fpga/dfl-fme-main.c
> > @@ -415,6 +415,259 @@ static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> >         .uinit = fme_thermal_mgmt_uinit,
> >  };
> >
> > +#define FME_PWR_STATUS         0x8
> > +#define FME_LATENCY_TOLERANCE  BIT_ULL(18)
> > +#define PWR_CONSUMED           GENMASK_ULL(17, 0)
> > +
> > +#define FME_PWR_THRESHOLD      0x10
> > +#define PWR_THRESHOLD1         GENMASK_ULL(6, 0)       /* in Watts */
> > +#define PWR_THRESHOLD2         GENMASK_ULL(14, 8)      /* in Watts */
> > +#define PWR_THRESHOLD_MAX      0x7f
> > +#define PWR_THRESHOLD1_STATUS  BIT_ULL(16)
> > +#define PWR_THRESHOLD2_STATUS  BIT_ULL(17)
> > +
> > +#define FME_PWR_XEON_LIMIT     0x18
> > +#define XEON_PWR_LIMIT         GENMASK_ULL(14, 0)
> > +#define XEON_PWR_EN            BIT_ULL(15)
> > +#define FME_PWR_FPGA_LIMIT     0x20
> > +#define FPGA_PWR_LIMIT         GENMASK_ULL(14, 0)
> > +#define FPGA_PWR_EN            BIT_ULL(15)
> > +
> > +#define POWER_ATTR(_name, _mode, _show, _store)        \
> > +struct device_attribute power_attr_##_name =           \
> > +       __ATTR(_name, _mode, _show, _store)
> > +
> > +#define POWER_ATTR_RO(_name, _show)                    \
> > +       POWER_ATTR(_name, 0444, _show, NULL)
> > +
> > +#define POWER_ATTR_RW(_name, _show, _store)            \
> > +       POWER_ATTR(_name, 0644, _show, _store)
>
> Are these #defines necessary?  Seems like you could just use DEVICE_ATTR*
>
> > +
> > +static ssize_t pwr_consumed_show(struct device *dev,
> > +                                struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_STATUS);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_CONSUMED, v));
> > +}
> > +static POWER_ATTR_RO(consumed, pwr_consumed_show);
> > +
> > +static ssize_t pwr_threshold1_show(struct device *dev,
> > +                                  struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1, v));
> > +}
> > +
> > +static ssize_t pwr_threshold1_store(struct device *dev,
> > +                                   struct device_attribute *attr,
> > +                                   const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +       void __iomem *base;
> > +       u8 threshold;
> > +       int ret;
> > +       u64 v;
> > +
> > +       ret = kstrtou8(buf, 0, &threshold);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (threshold > PWR_THRESHOLD_MAX)
> > +               return -EINVAL;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +       v &= ~PWR_THRESHOLD1;
> > +       v |= FIELD_PREP(PWR_THRESHOLD1, threshold);
> > +       writeq(v, base + FME_PWR_THRESHOLD);
> > +       mutex_unlock(&pdata->lock);
> > +
> > +       return count;
> > +}
> > +static POWER_ATTR_RW(threshold1, pwr_threshold1_show, pwr_threshold1_store);
> > +
> > +static ssize_t pwr_threshold2_show(struct device *dev,
> > +                                  struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2, v));
> > +}
> > +
> > +static ssize_t pwr_threshold2_store(struct device *dev,
> > +                                   struct device_attribute *attr,
> > +                                   const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +       void __iomem *base;
> > +       u8 threshold;
> > +       int ret;
> > +       u64 v;
> > +
> > +       ret = kstrtou8(buf, 0, &threshold);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (threshold > PWR_THRESHOLD_MAX)
> > +               return -EINVAL;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +       v &= ~PWR_THRESHOLD2;
> > +       v |= FIELD_PREP(PWR_THRESHOLD2, threshold);
> > +       writeq(v, base + FME_PWR_THRESHOLD);
> > +       mutex_unlock(&pdata->lock);
> > +
> > +       return count;
> > +}
> > +static POWER_ATTR_RW(threshold2, pwr_threshold2_show, pwr_threshold2_store);
> > +
> > +static ssize_t pwr_threshold1_status_show(struct device *dev,
> > +                                         struct device_attribute *attr,
> > +                                         char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> > +}
> > +static POWER_ATTR_RO(threshold1_status, pwr_threshold1_status_show);
> > +
> > +static ssize_t pwr_threshold2_status_show(struct device *dev,
> > +                                         struct device_attribute *attr,
> > +                                         char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> > +}
> > +static POWER_ATTR_RO(threshold2_status, pwr_threshold2_status_show);
> > +
> > +static ssize_t ltr_show(struct device *dev,
> > +                       struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_STATUS);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> > +}
> > +static POWER_ATTR_RO(ltr, ltr_show);
> > +
> > +static ssize_t xeon_limit_show(struct device *dev,
> > +                              struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u16 xeon_limit = 0;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_XEON_LIMIT);
> > +
> > +       if (FIELD_GET(XEON_PWR_EN, v))
> > +               xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit);
> > +}
> > +static POWER_ATTR_RO(xeon_limit, xeon_limit_show);
> > +
> > +static ssize_t fpga_limit_show(struct device *dev,
> > +                              struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u16 fpga_limit = 0;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_FPGA_LIMIT);
> > +
> > +       if (FIELD_GET(FPGA_PWR_EN, v))
> > +               fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit);
> > +}
> > +static POWER_ATTR_RO(fpga_limit, fpga_limit_show);
> > +
> > +static struct attribute *power_mgmt_attrs[] = {
> > +       &power_attr_consumed.attr,
> > +       &power_attr_threshold1.attr,
> > +       &power_attr_threshold2.attr,
> > +       &power_attr_threshold1_status.attr,
> > +       &power_attr_threshold2_status.attr,
> > +       &power_attr_xeon_limit.attr,
> > +       &power_attr_fpga_limit.attr,
> > +       &power_attr_ltr.attr,
>
> This is a nit, but I would expect to see these listed in the same
> order as their show/store functions above.  So ltr_attr would come
> between threshold2_status_attr and xeon_limit_attr.
>
> > +       NULL,
> > +};
> > +
> > +static struct attribute_group power_mgmt_attr_group = {
> > +       .attrs  = power_mgmt_attrs,
> > +       .name   = "power_mgmt",
> > +};
> > +
> > +static int fme_power_mgmt_init(struct platform_device *pdev,
> > +                              struct dfl_feature *feature)
> > +{
> > +       return sysfs_create_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > +}
> > +
> > +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> > +                                struct dfl_feature *feature)
> > +{
> > +       sysfs_remove_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > +}
> > +
> > +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> > +       {.id = FME_FEATURE_ID_POWER_MGMT,},
> > +       {0,}
> > +};
> > +
> > +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> > +       .init = fme_power_mgmt_init,
> > +       .uinit = fme_power_mgmt_uinit,
> > +};
> > +
> >  static struct dfl_feature_driver fme_feature_drvs[] = {
> >         {
> >                 .id_table = fme_hdr_id_table,
> > @@ -429,6 +682,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
> >                 .ops = &fme_thermal_mgmt_ops,
> >         },
> >         {
> > +               .id_table = fme_power_mgmt_id_table,
> > +               .ops = &fme_power_mgmt_ops,
> > +       },
> > +       {
> >                 .ops = NULL,
> >         },
> >  };
> > --
> > 2.7.4
> >
>
> Thanks,
> Alan

^ permalink raw reply

* Re: [PATCH v2 10/11] platform/x86: asus-wmi: Switch fan boost mode
From: Yurii Pavlovskyi @ 2019-04-12 20:50 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Corentin Chary, Darren Hart, Andy Shevchenko, acpi4asus-user,
	Platform Driver, Linux Kernel, linux-api,
	Endless Linux Upstreaming Team
In-Reply-To: <CAD8Lp45ivFJiW0FZo+ZOyUdpV1JFwPg=o7rx7eyp_YdSAbRRkg@mail.gmail.com>

On 12.04.19 10:03, Daniel Drake wrote:
> On Thu, Apr 11, 2019 at 1:47 PM Yurii Pavlovskyi
> <yurii.pavlovskyi@gmail.com> wrote:
>> * 0x00 - is normal,
>> * 0x01 - is obviously turbo by the amount of noise, might be useful to
>> avoid CPU frequency throttling on high load,
>> * 0x02 - the meaning is unknown at the time as modes are not named
>> in the vendor documentation, but it does look like a quiet mode as CPU
>> temperature does increase about 10 degrees on maximum load.
> 
> I'm curious which vendor documentation you're working with here.

That would be user manual for FX505 series, which is pretty minimalistic.
It says it has a fan mode switch key and that's it. Following up on your
comment, I've searched more and actually did found their names hidden in
marketing web page on the website. You're right, they call them balanced,
overboost and silent there.

> From the spec,
> 0 = normal
> 1 = overboost
> 2 = silent

> Also you can use DSTS on the 0x00110018 device to check the exact
> capabilities supported, which you should use to refine which modes can
> be cycled through.
> Bit 0 = overboost supported
> Bit 1 = silent supported
> 
> Thanks
> Daniel
> 

Thanks! I was guessing that the 3 in DSTS must've meant something.

Appreciate your comments! Will definitely implement them. I'm going to post
the v3 patch series approximately middle of next week.

Best regards,
Yurii

^ permalink raw reply

* Re: [PATCH v2 10/11] platform/x86: asus-wmi: Switch fan boost mode
From: Daniel Drake @ 2019-04-12  8:03 UTC (permalink / raw)
  To: Yurii Pavlovskyi
  Cc: Corentin Chary, Darren Hart, Andy Shevchenko, acpi4asus-user,
	Platform Driver, Linux Kernel, linux-api,
	Endless Linux Upstreaming Team
In-Reply-To: <fecb4870-7875-a524-bf6d-3108e25af888@gmail.com>

On Thu, Apr 11, 2019 at 1:47 PM Yurii Pavlovskyi
<yurii.pavlovskyi@gmail.com> wrote:
> * 0x00 - is normal,
> * 0x01 - is obviously turbo by the amount of noise, might be useful to
> avoid CPU frequency throttling on high load,
> * 0x02 - the meaning is unknown at the time as modes are not named
> in the vendor documentation, but it does look like a quiet mode as CPU
> temperature does increase about 10 degrees on maximum load.

I'm curious which vendor documentation you're working with here.

>From the spec,
0 = normal
1 = overboost
2 = silent

Also you can use DSTS on the 0x00110018 device to check the exact
capabilities supported, which you should use to refine which modes can
be cycled through.
Bit 0 = overboost supported
Bit 1 = silent supported

Thanks
Daniel

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Florian Weimer @ 2019-04-12  7:50 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, hpa, linux-api, linuxppc-dev
In-Reply-To: <db468a0f-4a7b-d251-601f-428885275d08@linaro.org>

* Adhemerval Zanella:

> On 11/04/2019 08:07, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> This allows us to adjust the baud rates to non-standard values using termios
>>> interfaces without to resorting to add new headers and use a different API
>>> (ioctl).
>> 
>> How much symbol versioning will be required for this change?
>
> I think all interfaces that have termios as input for sparc and mips 
> (tcgetattr, tcsetattr, cfmakeraw, cfgetispeed, cfgetospeed, cfsetispeed,
> cfsetospeed, cfsetspeed).
>
> Alpha will also need to use termios1 for pre-4.20 kernels.

So only new symbol versions there?  Hmm.

>>> As Peter Anvin has indicated, he create a POC [1] with the aforementioned
>>> new interfaces.  It has not been rebased against master, more specially against
>>> my termios refactor to simplify the multiple architecture header definitions,
>>> but I intend to use as a base.
>> 
>> Reference [1] is still missing. 8-(
>
> Oops... it is https://git.zytor.com/users/hpa/glibc/termbaud.git/log/?h=wip.termbaud

This doesn't really illuminate things.  “Drop explicit baud setting
interfaces in favor of cfenc|decspeed()” removes the new symbol version
for the cf* functions.

My gut feeling is that it's safer to add new interfaces, based on the
actual kernel/userspace interface, rather than trying to fix up existing
interfaces with symbol versioning.  The main reason is that code
involving serial interfaces is difficult to test, so it will take years
until we find the last application broken by the glibc interface bump.

I don't feel strongly about this.  This came out of a request for
enabling TCGETS2 support downstream.  If I can't fix this upstream, I
will just reject that request.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-04-12  2:50 UTC (permalink / raw)
  To: Alan Tull
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun
In-Reply-To: <CANk1AXTji9qm0Nfbme+2=nR-PNavnv__GwzDRazhH78hYkWN6g@mail.gmail.com>

On Thu, Apr 11, 2019 at 03:07:35PM -0500, Alan Tull wrote:
> On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> 
> Hi Hao,
> 
> >
> > This patch adds support for power management private feature under
> > FPGA Management Engine (FME), sysfs interfaces are introduced for
> > different power management functions, users could use these sysfs
> > interface to get current number of consumed power, throttling
> 
> How about
> s/number/measurement/
> ?

Sounds better. : )

> 
> > thresholds, threshold status and other information, and configure
> > different value for throttling thresholds too.
> >
> > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  56 +++++
> >  drivers/fpga/dfl-fme-main.c                      | 257 +++++++++++++++++++++++
> >  2 files changed, 313 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > index d3aeb88..4b6448f 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > @@ -100,3 +100,59 @@ Description:       Read-only. Read this file to get the policy of temperature
> >                 threshold1. It only supports two value (policy):
> >                     0 - AP2 state (90% throttling)
> >                     1 - AP1 state (50% throttling)
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns current power consumed by FPGA.
> 
> What are the units?
> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Write. Read/Write this file to get/set current power
> > +               threshold1 in Watts.
> 
> Perhaps document error codes here and for threshold2 below.
> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Write. Read/Write this file to get/set current power
> > +               threshold2 in Watts.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if power consumption reaches the
> > +               threshold1, otherwise 0.
> 
> I'm used to things like this requiring user to reset the status, so it
> may be worth making it explicit that it will return to zero if
> consumption drops below threshold if that's what's happening here.
> If it's correct, perhaps could just say something like 'returns 1 if
> power consumption is currently at or above threshold1, otherwise 0'
> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if power consumption reaches the
> > +               threshold2, otherwise 0.
> 
> Same here.

Sure, will fix all above comments in this sysfs doc.

> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get current Latency Tolerance
> > +               Reporting (ltr) value, it's only valid for integrated
> > +               solution as it blocks CPU on low power state.
> 
> If we're not on the integrated solution, it returns a value but it is
> not really real?

Currently only integrated solution is implementing this private feature, other
devices e.g. Intel PAC card is not using this private feature, so user will
not see these sysfs interfaces at all.

If in the future, other devices want

> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/xeon_limit
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get power limit for xeon, it
> > +               is only valid for integrated solution.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/fpga_limit
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get power limit for fpga, it
> > +               is only valid for integrated solution.
> > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > index 449a17d..dafa6580 100644
> > --- a/drivers/fpga/dfl-fme-main.c
> > +++ b/drivers/fpga/dfl-fme-main.c
> > @@ -415,6 +415,259 @@ static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> >         .uinit = fme_thermal_mgmt_uinit,
> >  };
> >
> > +#define FME_PWR_STATUS         0x8
> > +#define FME_LATENCY_TOLERANCE  BIT_ULL(18)
> > +#define PWR_CONSUMED           GENMASK_ULL(17, 0)
> > +
> > +#define FME_PWR_THRESHOLD      0x10
> > +#define PWR_THRESHOLD1         GENMASK_ULL(6, 0)       /* in Watts */
> > +#define PWR_THRESHOLD2         GENMASK_ULL(14, 8)      /* in Watts */
> > +#define PWR_THRESHOLD_MAX      0x7f
> > +#define PWR_THRESHOLD1_STATUS  BIT_ULL(16)
> > +#define PWR_THRESHOLD2_STATUS  BIT_ULL(17)
> > +
> > +#define FME_PWR_XEON_LIMIT     0x18
> > +#define XEON_PWR_LIMIT         GENMASK_ULL(14, 0)
> > +#define XEON_PWR_EN            BIT_ULL(15)
> > +#define FME_PWR_FPGA_LIMIT     0x20
> > +#define FPGA_PWR_LIMIT         GENMASK_ULL(14, 0)
> > +#define FPGA_PWR_EN            BIT_ULL(15)
> > +
> > +#define POWER_ATTR(_name, _mode, _show, _store)        \
> > +struct device_attribute power_attr_##_name =           \
> > +       __ATTR(_name, _mode, _show, _store)
> > +
> > +#define POWER_ATTR_RO(_name, _show)                    \
> > +       POWER_ATTR(_name, 0444, _show, NULL)
> > +
> > +#define POWER_ATTR_RW(_name, _show, _store)            \
> > +       POWER_ATTR(_name, 0644, _show, _store)
> 
> Are these #defines necessary?  Seems like you could just use DEVICE_ATTR*

Actually it adds a prefix power_attr_xxx there to avoid name conflicts with
other ones from different private features, e.g. for the thermal threshold.

> 
> > +
> > +static ssize_t pwr_consumed_show(struct device *dev,
> > +                                struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_STATUS);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_CONSUMED, v));
> > +}
> > +static POWER_ATTR_RO(consumed, pwr_consumed_show);
> > +
> > +static ssize_t pwr_threshold1_show(struct device *dev,
> > +                                  struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1, v));
> > +}
> > +
> > +static ssize_t pwr_threshold1_store(struct device *dev,
> > +                                   struct device_attribute *attr,
> > +                                   const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +       void __iomem *base;
> > +       u8 threshold;
> > +       int ret;
> > +       u64 v;
> > +
> > +       ret = kstrtou8(buf, 0, &threshold);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (threshold > PWR_THRESHOLD_MAX)
> > +               return -EINVAL;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +       v &= ~PWR_THRESHOLD1;
> > +       v |= FIELD_PREP(PWR_THRESHOLD1, threshold);
> > +       writeq(v, base + FME_PWR_THRESHOLD);
> > +       mutex_unlock(&pdata->lock);
> > +
> > +       return count;
> > +}
> > +static POWER_ATTR_RW(threshold1, pwr_threshold1_show, pwr_threshold1_store);
> > +
> > +static ssize_t pwr_threshold2_show(struct device *dev,
> > +                                  struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2, v));
> > +}
> > +
> > +static ssize_t pwr_threshold2_store(struct device *dev,
> > +                                   struct device_attribute *attr,
> > +                                   const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +       void __iomem *base;
> > +       u8 threshold;
> > +       int ret;
> > +       u64 v;
> > +
> > +       ret = kstrtou8(buf, 0, &threshold);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (threshold > PWR_THRESHOLD_MAX)
> > +               return -EINVAL;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +       v &= ~PWR_THRESHOLD2;
> > +       v |= FIELD_PREP(PWR_THRESHOLD2, threshold);
> > +       writeq(v, base + FME_PWR_THRESHOLD);
> > +       mutex_unlock(&pdata->lock);
> > +
> > +       return count;
> > +}
> > +static POWER_ATTR_RW(threshold2, pwr_threshold2_show, pwr_threshold2_store);
> > +
> > +static ssize_t pwr_threshold1_status_show(struct device *dev,
> > +                                         struct device_attribute *attr,
> > +                                         char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> > +}
> > +static POWER_ATTR_RO(threshold1_status, pwr_threshold1_status_show);
> > +
> > +static ssize_t pwr_threshold2_status_show(struct device *dev,
> > +                                         struct device_attribute *attr,
> > +                                         char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> > +}
> > +static POWER_ATTR_RO(threshold2_status, pwr_threshold2_status_show);
> > +
> > +static ssize_t ltr_show(struct device *dev,
> > +                       struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_STATUS);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> > +}
> > +static POWER_ATTR_RO(ltr, ltr_show);
> > +
> > +static ssize_t xeon_limit_show(struct device *dev,
> > +                              struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u16 xeon_limit = 0;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_XEON_LIMIT);
> > +
> > +       if (FIELD_GET(XEON_PWR_EN, v))
> > +               xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit);
> > +}
> > +static POWER_ATTR_RO(xeon_limit, xeon_limit_show);
> > +
> > +static ssize_t fpga_limit_show(struct device *dev,
> > +                              struct device_attribute *attr, char *buf)
> > +{
> > +       void __iomem *base;
> > +       u16 fpga_limit = 0;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > +
> > +       v = readq(base + FME_PWR_FPGA_LIMIT);
> > +
> > +       if (FIELD_GET(FPGA_PWR_EN, v))
> > +               fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit);
> > +}
> > +static POWER_ATTR_RO(fpga_limit, fpga_limit_show);
> > +
> > +static struct attribute *power_mgmt_attrs[] = {
> > +       &power_attr_consumed.attr,
> > +       &power_attr_threshold1.attr,
> > +       &power_attr_threshold2.attr,
> > +       &power_attr_threshold1_status.attr,
> > +       &power_attr_threshold2_status.attr,
> > +       &power_attr_xeon_limit.attr,
> > +       &power_attr_fpga_limit.attr,
> > +       &power_attr_ltr.attr,
> 
> This is a nit, but I would expect to see these listed in the same
> order as their show/store functions above.  So ltr_attr would come
> between threshold2_status_attr and xeon_limit_attr.

Sure, it does make sense.

> 
> > +       NULL,
> > +};
> > +
> > +static struct attribute_group power_mgmt_attr_group = {
> > +       .attrs  = power_mgmt_attrs,
> > +       .name   = "power_mgmt",
> > +};
> > +
> > +static int fme_power_mgmt_init(struct platform_device *pdev,
> > +                              struct dfl_feature *feature)
> > +{
> > +       return sysfs_create_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > +}
> > +
> > +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> > +                                struct dfl_feature *feature)
> > +{
> > +       sysfs_remove_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > +}
> > +
> > +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> > +       {.id = FME_FEATURE_ID_POWER_MGMT,},
> > +       {0,}
> > +};
> > +
> > +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> > +       .init = fme_power_mgmt_init,
> > +       .uinit = fme_power_mgmt_uinit,
> > +};
> > +
> >  static struct dfl_feature_driver fme_feature_drvs[] = {
> >         {
> >                 .id_table = fme_hdr_id_table,
> > @@ -429,6 +682,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
> >                 .ops = &fme_thermal_mgmt_ops,
> >         },
> >         {
> > +               .id_table = fme_power_mgmt_id_table,
> > +               .ops = &fme_power_mgmt_ops,
> > +       },
> > +       {
> >                 .ops = NULL,
> >         },
> >  };
> > --
> > 2.7.4
> >

Thanks a lot for the review and comments. :)

Hao

> 
> Thanks,
> Alan

^ permalink raw reply

* Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-12  0:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Andi Kleen, Tim Chen, Dave Hansen, Arjan van de Ven,
	Alexey Dobriyan, Andrew Morton, aubrey.li, Linux API, LKML
In-Reply-To: <3c2a8a1a-ed4b-348e-f06f-f71dbf64bbba@linux.intel.com>

On 2019/4/11 9:02, Li, Aubrey wrote:
> On 2019/4/10 22:54, Andy Lutomirski wrote:
>> On Tue, Apr 9, 2019 at 8:40 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>>
>>> On 2019/4/10 10:36, Li, Aubrey wrote:
>>>> On 2019/4/10 10:25, Andy Lutomirski wrote:
>>>>> On Tue, Apr 9, 2019 at 7:20 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>>>>>
>>>>>> On 2019/4/10 9:58, Andy Lutomirski wrote:
>>>>>>> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>>>>>>>
>>>>>>>> The architecture specific information of the running processes could
>>>>>>>> be useful to the userland. Add support to examine process architecture
>>>>>>>> specific information externally.
>>>>>>>>
>>>>>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>>>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>>>>>> Cc: Andi Kleen <ak@linux.intel.com>
>>>>>>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>>>>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>>>>>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>>>>>>>> Cc: Linux API <linux-api@vger.kernel.org>
>>>>>>>> Cc: Alexey Dobriyan <adobriyan@gmail.com>
>>>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>>>> ---
>>>>>>>>  fs/proc/array.c         | 5 +++++
>>>>>>>>  include/linux/proc_fs.h | 2 ++
>>>>>>>>  2 files changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>>>>>> index 2edbb657f859..331592a61718 100644
>>>>>>>> --- a/fs/proc/array.c
>>>>>>>> +++ b/fs/proc/array.c
>>>>>>>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>>>>>>>         seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
>>>>>>>> +{
>>>>>>>> +}
>>>>>>>
>>>>>>> This pointlessly bloats other architectures.  Do this instead in an
>>>>>>> appropriate header:
>>>>>>>
>>>>>>> #ifndef arch_proc_pid_status
>>>>>>> static inline void arch_proc_pid_status(...)
>>>>>>> {
>>>>>>> }
>>>>>>> #endif
>>>>>>>
>>>>>>
>>>>>> I saw a bunch of similar weak functions, is it not acceptable?
>>>>>>
>>>>>> fs/proc$ grep weak *.c
>>>>>> cpuinfo.c:__weak void arch_freq_prepare_all(void)
>>>>>> meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
>>>>>> vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
>>>>>> vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
>>>>>> vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>>>>>> vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
>>>>>> vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>>>>>> vmcore.c:ssize_t __weak
>>>>>
>>>>> I think they're acceptable, but I don't personally like them.
>>>>>
>>>>
>>>> okay, let me try to see if I can refine it in an appropriate way.
>>>
>>> Hi Andy,
>>>
>>> Is this what you want?
>>>
>>> Thanks,
>>> -Aubrey
>>>
>>> ====================================================================
>>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>>> index 2bb3a648fc12..82d77d3aefff 100644
>>> --- a/arch/x86/include/asm/processor.h
>>> +++ b/arch/x86/include/asm/processor.h
>>> @@ -990,5 +990,8 @@ enum l1tf_mitigations {
>>>  };
>>>
>>>  extern enum l1tf_mitigations l1tf_mitigation;
>>> +/* Add support for architecture specific output in /proc/pid/status */
>>> +void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
>>> +#define arch_proc_pid_status arch_proc_pid_status
>>>
>>>  #endif /* _ASM_X86_PROCESSOR_H */
>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>> index 2edbb657f859..fd65a6ba2864 100644
>>> --- a/fs/proc/array.c
>>> +++ b/fs/proc/array.c
>>> @@ -401,6 +401,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>>         seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>>  }
>>>
>>> +/* Add support for architecture specific output in /proc/pid/status */
>>> +#ifndef arch_proc_pid_status
>>> +#define arch_proc_pid_status(m, task)
>>> +#endif
>>> +
>>>  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>>>                         struct pid *pid, struct task_struct *task)
>>>  {
>>> @@ -424,6 +429,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>>>         task_cpus_allowed(m, task);
>>>         cpuset_task_status_allowed(m, task);
>>>         task_context_switch_counts(m, task);
>>> +       arch_proc_pid_status(m, task);
>>>         return 0;
>>>  }
>>>
>>
>> Yes.  But I still think it would be nicer to separate the arch stuff
>> into its own file.  Others might reasonably disagree with me.
>>
> I like arch_status, I proposed but no other arch shows interesting in it.
> 
> I think the problem is similar for x86_status, it does not make sense for
> those x86 platform without AVX512 to have an empty arch file. I personally
> don't like [arch]_status because the code may become unclean if more arches
> added in future.
> 
> Maybe it's too early to have a separated arch staff file for now.

Hi Andy,

Is it acceptable to you if I make the above change and post v15?

Thanks,
-Aubrey

^ permalink raw reply

* Re: [PATCH 09/17] fpga: dfl: add id_table for dfl private feature driver
From: Alan Tull @ 2019-04-11 20:55 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: Wu Hao, Alan Tull, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <20190402150956.GC15773@archbook>

On Tue, Apr 2, 2019 at 10:09 AM Moritz Fischer <mdf@kernel.org> wrote:
>
> Hi Wu,
>
> On Mon, Mar 25, 2019 at 11:07:36AM +0800, Wu Hao wrote:
> > This patch adds id_table for each dfl private feature driver,
> > it allows to reuse same private feature driver to match and support
> > multiple dfl private features.
> >
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Moritz Fischer <mdf@kernel.org>
Acked-by: Alan Tull <atull@kernel.org>

Thanks,
Alan

^ permalink raw reply

* Re: [PATCH 10/17] fpga: dfl: afu: export __port_enable/disable function.
From: Alan Tull @ 2019-04-11 20:45 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: Wu Hao, Alan Tull, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <20190402155018.GA1453@archbook>

On Tue, Apr 2, 2019 at 10:50 AM Moritz Fischer <mdf@kernel.org> wrote:

Hi Hao,

>
> On Mon, Mar 25, 2019 at 11:07:37AM +0800, Wu Hao wrote:
> > As these two functions are used by other private features. e.g.
> > in error reporting private feature, it requires to check port status
> > and reset port for error clearing.
> >
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Moritz Fischer <mdf@kernel.org>

Acked-by: Alan Tull <atull@kernel.org>

Thanks,
Alan

^ permalink raw reply

* Re: [PATCH 12/17] fpga: dfl: afu: add STP (SignalTap) support
From: Alan Tull @ 2019-04-11 20:41 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: Wu Hao, Alan Tull, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <20190402150719.GB15773@archbook>

On Tue, Apr 2, 2019 at 10:07 AM Moritz Fischer <mdf@kernel.org> wrote:
>
> Hi Wu,
>
> On Mon, Mar 25, 2019 at 11:07:39AM +0800, Wu Hao wrote:
> > STP (SignalTap) is one of the private features under the port for
> > debugging. This patch adds private feature driver support for it
> > to allow userspace applications to mmap related mmio region and
> > provide STP service.
> >
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Moritz Fischer <mdf@kernel.org>

Acked-by: Alan Tull <atull@kernel.org>

Thanks,
Alan

^ permalink raw reply

* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Alan Tull @ 2019-04-11 20:07 UTC (permalink / raw)
  To: Wu Hao
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun
In-Reply-To: <1553483264-5379-16-git-send-email-hao.wu@intel.com>

On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:

Hi Hao,

>
> This patch adds support for power management private feature under
> FPGA Management Engine (FME), sysfs interfaces are introduced for
> different power management functions, users could use these sysfs
> interface to get current number of consumed power, throttling

How about
s/number/measurement/
?

> thresholds, threshold status and other information, and configure
> different value for throttling thresholds too.
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme |  56 +++++
>  drivers/fpga/dfl-fme-main.c                      | 257 +++++++++++++++++++++++
>  2 files changed, 313 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index d3aeb88..4b6448f 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -100,3 +100,59 @@ Description:       Read-only. Read this file to get the policy of temperature
>                 threshold1. It only supports two value (policy):
>                     0 - AP2 state (90% throttling)
>                     1 - AP1 state (50% throttling)
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns current power consumed by FPGA.

What are the units?

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Write. Read/Write this file to get/set current power
> +               threshold1 in Watts.

Perhaps document error codes here and for threshold2 below.

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Write. Read/Write this file to get/set current power
> +               threshold2 in Watts.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if power consumption reaches the
> +               threshold1, otherwise 0.

I'm used to things like this requiring user to reset the status, so it
may be worth making it explicit that it will return to zero if
consumption drops below threshold if that's what's happening here.
If it's correct, perhaps could just say something like 'returns 1 if
power consumption is currently at or above threshold1, otherwise 0'

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if power consumption reaches the
> +               threshold2, otherwise 0.

Same here.

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get current Latency Tolerance
> +               Reporting (ltr) value, it's only valid for integrated
> +               solution as it blocks CPU on low power state.

If we're not on the integrated solution, it returns a value but it is
not really real?

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/xeon_limit
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get power limit for xeon, it
> +               is only valid for integrated solution.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/fpga_limit
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get power limit for fpga, it
> +               is only valid for integrated solution.
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 449a17d..dafa6580 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -415,6 +415,259 @@ static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
>         .uinit = fme_thermal_mgmt_uinit,
>  };
>
> +#define FME_PWR_STATUS         0x8
> +#define FME_LATENCY_TOLERANCE  BIT_ULL(18)
> +#define PWR_CONSUMED           GENMASK_ULL(17, 0)
> +
> +#define FME_PWR_THRESHOLD      0x10
> +#define PWR_THRESHOLD1         GENMASK_ULL(6, 0)       /* in Watts */
> +#define PWR_THRESHOLD2         GENMASK_ULL(14, 8)      /* in Watts */
> +#define PWR_THRESHOLD_MAX      0x7f
> +#define PWR_THRESHOLD1_STATUS  BIT_ULL(16)
> +#define PWR_THRESHOLD2_STATUS  BIT_ULL(17)
> +
> +#define FME_PWR_XEON_LIMIT     0x18
> +#define XEON_PWR_LIMIT         GENMASK_ULL(14, 0)
> +#define XEON_PWR_EN            BIT_ULL(15)
> +#define FME_PWR_FPGA_LIMIT     0x20
> +#define FPGA_PWR_LIMIT         GENMASK_ULL(14, 0)
> +#define FPGA_PWR_EN            BIT_ULL(15)
> +
> +#define POWER_ATTR(_name, _mode, _show, _store)        \
> +struct device_attribute power_attr_##_name =           \
> +       __ATTR(_name, _mode, _show, _store)
> +
> +#define POWER_ATTR_RO(_name, _show)                    \
> +       POWER_ATTR(_name, 0444, _show, NULL)
> +
> +#define POWER_ATTR_RW(_name, _show, _store)            \
> +       POWER_ATTR(_name, 0644, _show, _store)

Are these #defines necessary?  Seems like you could just use DEVICE_ATTR*

> +
> +static ssize_t pwr_consumed_show(struct device *dev,
> +                                struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_STATUS);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_CONSUMED, v));
> +}
> +static POWER_ATTR_RO(consumed, pwr_consumed_show);
> +
> +static ssize_t pwr_threshold1_show(struct device *dev,
> +                                  struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1, v));
> +}
> +
> +static ssize_t pwr_threshold1_store(struct device *dev,
> +                                   struct device_attribute *attr,
> +                                   const char *buf, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       void __iomem *base;
> +       u8 threshold;
> +       int ret;
> +       u64 v;
> +
> +       ret = kstrtou8(buf, 0, &threshold);
> +       if (ret)
> +               return ret;
> +
> +       if (threshold > PWR_THRESHOLD_MAX)
> +               return -EINVAL;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       mutex_lock(&pdata->lock);
> +       v = readq(base + FME_PWR_THRESHOLD);
> +       v &= ~PWR_THRESHOLD1;
> +       v |= FIELD_PREP(PWR_THRESHOLD1, threshold);
> +       writeq(v, base + FME_PWR_THRESHOLD);
> +       mutex_unlock(&pdata->lock);
> +
> +       return count;
> +}
> +static POWER_ATTR_RW(threshold1, pwr_threshold1_show, pwr_threshold1_store);
> +
> +static ssize_t pwr_threshold2_show(struct device *dev,
> +                                  struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2, v));
> +}
> +
> +static ssize_t pwr_threshold2_store(struct device *dev,
> +                                   struct device_attribute *attr,
> +                                   const char *buf, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       void __iomem *base;
> +       u8 threshold;
> +       int ret;
> +       u64 v;
> +
> +       ret = kstrtou8(buf, 0, &threshold);
> +       if (ret)
> +               return ret;
> +
> +       if (threshold > PWR_THRESHOLD_MAX)
> +               return -EINVAL;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       mutex_lock(&pdata->lock);
> +       v = readq(base + FME_PWR_THRESHOLD);
> +       v &= ~PWR_THRESHOLD2;
> +       v |= FIELD_PREP(PWR_THRESHOLD2, threshold);
> +       writeq(v, base + FME_PWR_THRESHOLD);
> +       mutex_unlock(&pdata->lock);
> +
> +       return count;
> +}
> +static POWER_ATTR_RW(threshold2, pwr_threshold2_show, pwr_threshold2_store);
> +
> +static ssize_t pwr_threshold1_status_show(struct device *dev,
> +                                         struct device_attribute *attr,
> +                                         char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> +}
> +static POWER_ATTR_RO(threshold1_status, pwr_threshold1_status_show);
> +
> +static ssize_t pwr_threshold2_status_show(struct device *dev,
> +                                         struct device_attribute *attr,
> +                                         char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> +}
> +static POWER_ATTR_RO(threshold2_status, pwr_threshold2_status_show);
> +
> +static ssize_t ltr_show(struct device *dev,
> +                       struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_STATUS);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> +}
> +static POWER_ATTR_RO(ltr, ltr_show);
> +
> +static ssize_t xeon_limit_show(struct device *dev,
> +                              struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u16 xeon_limit = 0;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_XEON_LIMIT);
> +
> +       if (FIELD_GET(XEON_PWR_EN, v))
> +               xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit);
> +}
> +static POWER_ATTR_RO(xeon_limit, xeon_limit_show);
> +
> +static ssize_t fpga_limit_show(struct device *dev,
> +                              struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u16 fpga_limit = 0;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> +
> +       v = readq(base + FME_PWR_FPGA_LIMIT);
> +
> +       if (FIELD_GET(FPGA_PWR_EN, v))
> +               fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit);
> +}
> +static POWER_ATTR_RO(fpga_limit, fpga_limit_show);
> +
> +static struct attribute *power_mgmt_attrs[] = {
> +       &power_attr_consumed.attr,
> +       &power_attr_threshold1.attr,
> +       &power_attr_threshold2.attr,
> +       &power_attr_threshold1_status.attr,
> +       &power_attr_threshold2_status.attr,
> +       &power_attr_xeon_limit.attr,
> +       &power_attr_fpga_limit.attr,
> +       &power_attr_ltr.attr,

This is a nit, but I would expect to see these listed in the same
order as their show/store functions above.  So ltr_attr would come
between threshold2_status_attr and xeon_limit_attr.

> +       NULL,
> +};
> +
> +static struct attribute_group power_mgmt_attr_group = {
> +       .attrs  = power_mgmt_attrs,
> +       .name   = "power_mgmt",
> +};
> +
> +static int fme_power_mgmt_init(struct platform_device *pdev,
> +                              struct dfl_feature *feature)
> +{
> +       return sysfs_create_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> +}
> +
> +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> +                                struct dfl_feature *feature)
> +{
> +       sysfs_remove_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> +}
> +
> +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> +       {.id = FME_FEATURE_ID_POWER_MGMT,},
> +       {0,}
> +};
> +
> +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> +       .init = fme_power_mgmt_init,
> +       .uinit = fme_power_mgmt_uinit,
> +};
> +
>  static struct dfl_feature_driver fme_feature_drvs[] = {
>         {
>                 .id_table = fme_hdr_id_table,
> @@ -429,6 +682,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
>                 .ops = &fme_thermal_mgmt_ops,
>         },
>         {
> +               .id_table = fme_power_mgmt_id_table,
> +               .ops = &fme_power_mgmt_ops,
> +       },
> +       {
>                 .ops = NULL,
>         },
>  };
> --
> 2.7.4
>

Thanks,
Alan

^ permalink raw reply

* Re: [RFC PATCH] fork: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-11 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jann Horn, David Howells, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, Jonathan Kowalski,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov, Aleksa Sarai,
	Joel Fernandes, Daniel
In-Reply-To: <CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com>

On April 11, 2019 6:50:47 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner
><christian@brauner.io> wrote:
>>
>> RFC-2:
>> This alternative patchset uses anonymous file descriptors instead of
>> file descriptors from /proc/<pid>.
>
>I think I prefer this one. Your diffstat makes it look bigger, but
>that's because you added the example code to use this to that rfc..

Jann and I think this is the correct way to proceed as well.
And this will be way more acceptable for Al too we think.
I have spoken to Joel just now and he's
happy to change RFC patchsets related to
pidfds he has to the anon inode approach as well.
I'll take care to do the necessary coordination before anything
gets sent your way before the next merge window.

>
>Plus making anon_inodes unconditional makes sense anyway.  They are
>always selected in practice to begin with.

Yes, indeed. I coordinated this with David and Al.
They need anonymous inodes for the mount API as well
and thus need anon inodes useable in core vfs functions.

Christian

^ permalink raw reply

* [PATCH v3] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
From: Amir Goldstein @ 2019-04-11 17:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-mm, linux-api,
	linux-fsdevel, linux-kernel

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
Those users explicitly requested to wait for in-flight IO as well as to
writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
writeback, to perform the full range sync request.

Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Andrew,

Oops found a braino in my patch. Here is v3.

You may find the discussion on V1 here:
https://lore.kernel.org/lkml/20190409114922.30095-1-amir73il@gmail.com/

Thanks,
Amir.

Changes since v2:
- Return after filemap_write_and_wait_range()

Changes since v1:
- Remove non-guaranties of the API from commit message
- Added ACK by Jan

 fs/sync.c               | 24 +++++++++++++++++-------
 include/uapi/linux/fs.h |  3 +++
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..3a923652d720 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -18,8 +18,8 @@
 #include <linux/backing-dev.h>
 #include "internal.h"
 
-#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
-			SYNC_FILE_RANGE_WAIT_AFTER)
+#define VALID_FLAGS (SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AND_WAIT | \
+		     SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Do the filesystem syncing work. For simple filesystems
@@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 }
 
 /*
- * sys_sync_file_range() permits finely controlled syncing over a segment of
+ * ksys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
- * zero then sys_sync_file_range() will operate from offset out to EOF.
+ * zero then ksys_sync_file_range() will operate from offset out to EOF.
  *
  * The flag bits are:
  *
@@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * Useful combinations of the flag bits are:
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
- * in the range which were dirty on entry to sys_sync_file_range() are placed
+ * in the range which were dirty on entry to ksys_sync_file_range() are placed
  * under writeout.  This is a start-write-for-data-integrity operation.
  *
  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
@@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
  * for that operation to complete and to return the result.
  *
- * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
+ * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
+ * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
  * a traditional sync() operation.  This is a write-for-data-integrity operation
  * which will ensure that all pages in the range which were dirty on entry to
- * sys_sync_file_range() are committed to disk.
+ * ksys_sync_file_range() are written to disk.  It should be noted that disk
+ * caches are not flushed by this call, so there are no guarantees here that the
+ * data will be available on disk after a crash.
  *
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
@@ -338,6 +341,13 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 
 	mapping = f.file->f_mapping;
 	ret = 0;
+	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
+		     SYNC_FILE_RANGE_WRITE_AND_WAIT) {
+		/* Unlike SYNC_FILE_RANGE_WRITE alone uses WB_SYNC_ALL */
+		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
+		goto out_put;
+	}
+
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
 		ret = file_fdatawait_range(f.file, offset, endbyte);
 		if (ret < 0)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 121e82ce296b..59c71fa8c553 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -320,6 +320,9 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Flags for preadv2/pwritev2:
-- 
2.17.1

^ permalink raw reply related

* Re: [RFC PATCH] fork: add CLONE_PIDFD
From: Linus Torvalds @ 2019-04-11 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Jann Horn, David Howells, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, Jonathan Kowalski,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov, Aleksa Sarai,
	Joel Fernandes, Daniel
In-Reply-To: <20190410234045.29846-1-christian@brauner.io>

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner <christian@brauner.io> wrote:
>
> RFC-2:
> This alternative patchset uses anonymous file descriptors instead of
> file descriptors from /proc/<pid>.

I think I prefer this one. Your diffstat makes it look bigger, but
that's because you added the example code to use this to that rfc..

Plus making anon_inodes unconditional makes sense anyway.  They are
always selected in practice to begin with.

               Linus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox