Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v6 4/6] namei: aggressively check for nd->root escape on ".." resolution
From: Aleksa Sarai @ 2019-05-06 16:54 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Jann Horn, Kees Cook, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov,
	Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-fsdevel, linux-api, linux-kernel, linux-arch
In-Reply-To: <20190506165439.9155-1-cyphar@cyphar.com>

This patch allows for O_BENEATH and O_THISROOT to safely permit ".."
resolution (in the case of O_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of O_THISROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3a3cba593b85..2b6a1bf4e745 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1739,19 +1739,35 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
-		 * cause our parent to have moved outside of the root and us to skip
-		 * over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * Don't bother checking unless there's a racing
+			 * rename(2) or MS_MOVE.
+			 */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+			if (!path_is_under(&nd->path, &nd->root))
+				return -EXDEV;
+		}
 	}
 	return 0;
 }
@@ -2272,6 +2288,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2282,7 +2303,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2293,8 +2313,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
-
 	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
-- 
2.21.0

^ permalink raw reply related

* [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-06 16:54 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch
In-Reply-To: <20190506165439.9155-1-cyphar@cyphar.com>

The need to be able to scope path resolution of interpreters became
clear with one of the possible vectors used in CVE-2019-5736 (which
most major container runtimes were vulnerable to).

Naively, it might seem that openat(2) -- which supports path scoping --
can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
binary being executed. Unfortunately, a "bad binary" (usually a symlink)
could be written as a #!-style script with the symlink target as the
interpreter -- which would be completely missed by just scoping the
openat(2). An example of this being exploitable is CVE-2019-5736.

In order to get around this, we need to pass down to each binfmt_*
implementation the scoping flags requested in execveat(2). In order to
maintain backwards-compatibility we only pass the scoping AT_* flags.

To avoid breaking userspace (in the exceptionally rare cases where you
have #!-scripts with a relative path being execveat(2)-ed with dfd !=
AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
set in execveat(2).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/binfmt_elf.c            |  2 +-
 fs/binfmt_elf_fdpic.c      |  2 +-
 fs/binfmt_em86.c           |  4 ++--
 fs/binfmt_misc.c           |  2 +-
 fs/binfmt_script.c         |  2 +-
 fs/exec.c                  | 26 ++++++++++++++++++++++----
 include/linux/binfmts.h    |  1 +
 include/linux/fs.h         |  9 +++++++--
 include/uapi/linux/fcntl.h |  6 ++++++
 9 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7d09d125f148..473420eed477 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -772,7 +772,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
 				goto out_free_interp;
 
-			interpreter = open_exec(elf_interpreter);
+			interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter))
 				goto out_free_interp;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b53bb3729ac1..c463c6428f77 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -263,7 +263,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			kdebug("Using ELF interpreter %s", interpreter_name);
 
 			/* replace the program with the interpreter */
-			interpreter = open_exec(interpreter_name);
+			interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter)) {
 				interpreter = NULL;
diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index dd2d3f0cd55d..3ee46b0dc0d4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -81,10 +81,10 @@ static int load_em86(struct linux_binprm *bprm)
 
 	/*
 	 * OK, now restart the process with the interpreter's inode.
-	 * Note that we use open_exec() as the name is now in kernel
+	 * Note that we use openat_exec() as the name is now in kernel
 	 * space, and we don't need to copy it.
 	 */
-	file = open_exec(interp);
+	file = openat_exec(binprm->dfd, interp, binprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index aa4a7a23ff99..573ef06ff5a1 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -209,7 +209,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
 		if (!IS_ERR(interp_file))
 			deny_write_access(interp_file);
 	} else {
-		interp_file = open_exec(fmt->interpreter);
+		interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags);
 	}
 	retval = PTR_ERR(interp_file);
 	if (IS_ERR(interp_file))
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index e996174cbfc0..aaff12d12bfd 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -137,7 +137,7 @@ static int load_script(struct linux_binprm *bprm)
 	/*
 	 * OK, now restart the process with the interpreter's dentry.
 	 */
-	file = open_exec(i_name);
+	file = openat_exec(bprm->dfd, i_name, bprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..d0f244d9a541 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -846,12 +846,24 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH |
+		       AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+		       AT_THIS_ROOT)) != 0)
 		return ERR_PTR(-EINVAL);
 	if (flags & AT_SYMLINK_NOFOLLOW)
 		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
 	if (flags & AT_EMPTY_PATH)
 		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+	if (flags & AT_BENEATH)
+		open_exec_flags.lookup_flags |= LOOKUP_BENEATH;
+	if (flags & AT_XDEV)
+		open_exec_flags.lookup_flags |= LOOKUP_XDEV;
+	if (flags & AT_NO_MAGICLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & AT_NO_SYMLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & AT_THIS_ROOT)
+		open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT;
 
 	file = do_filp_open(fd, name, &open_exec_flags);
 	if (IS_ERR(file))
@@ -879,18 +891,18 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	return ERR_PTR(err);
 }
 
-struct file *open_exec(const char *name)
+struct file *openat_exec(int dfd, const char *name, int flags)
 {
 	struct filename *filename = getname_kernel(name);
 	struct file *f = ERR_CAST(filename);
 
 	if (!IS_ERR(filename)) {
-		f = do_open_execat(AT_FDCWD, filename, 0);
+		f = do_open_execat(dfd, filename, flags);
 		putname(filename);
 	}
 	return f;
 }
-EXPORT_SYMBOL(open_exec);
+EXPORT_SYMBOL(openat_exec);
 
 int kernel_read_file(struct file *file, void **buf, loff_t *size,
 		     loff_t max_size, enum kernel_read_file_id id)
@@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
 
 	sched_exec();
 
+	bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+			       AT_THIS_ROOT);
+	bprm->dfd = AT_FDCWD;
+	if (bprm->flags)
+		bprm->dfd = fd;
+
 	bprm->file = file;
 	if (!filename) {
 		bprm->filename = "none";
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 688ab0de7810..e4da2d36e97f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -50,6 +50,7 @@ struct linux_binprm {
 	unsigned int taso:1;
 #endif
 	unsigned int recursion_depth; /* only for search_binary_handler() */
+	int dfd, flags;		/* passed down to execat_open() */
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fe94c48481a6..cdcd2e021101 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2945,8 +2945,13 @@ extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t,
 extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
 extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
 extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
-extern struct file * open_exec(const char *);
- 
+
+extern struct file *openat_exec(int, const char *, int);
+static inline struct file *open_exec(const char *name)
+{
+	return openat_exec(AT_FDCWD, name, 0);
+}
+
 /* fs/dcache.c -- generic fs support functions */
 extern bool is_subdir(struct dentry *, struct dentry *);
 extern bool path_is_under(const struct path *, const struct path *);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..bfca7f87cd2a 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,11 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+#define AT_RESOLUTION_TYPE	0x1F0000 /* Type of path-resolution scoping we are applying. */
+#define AT_BENEATH		0x010000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define AT_XDEV			0x020000 /* - Block mount-point crossings (includes bind-mounts). */
+#define AT_NO_MAGICLINKS	0x040000 /* - Block procfs-style "magic" symlinks. */
+#define AT_NO_SYMLINKS		0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define AT_THIS_ROOT		0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.21.0

^ permalink raw reply related

* [PATCH v6 6/6] namei: resolveat(2) syscall
From: Aleksa Sarai @ 2019-05-06 16:54 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch
In-Reply-To: <20190506165439.9155-1-cyphar@cyphar.com>

The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2) (along with the required execveat(2) change
included in this series). However, there are a few reasons to not do
this:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2).

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOw is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface in the
   future. There are some potential security improvements that can be
   made to O_PATH (handling /proc/self/fd re-opening of file descriptors
   much more sanely) which could be made even better with some other
   bits (such as ACC_MODE bits which work for O_PATH).

To this end, we introduce the resolveat(2) syscall. At the moment it's
effectively another way of getting a bog-standard O_PATH descriptor but
with the ability to use the new LOOKUP_* flags.

Because resolveat(2) only provides the ability to get O_PATH
descriptors, users will need to get creative with /proc/self/fd in order
to get a usable file descriptor for other uses. However, in future we
can add O_EMPTYPATH support to openat(2) which would allow for
re-opening without procfs (though as mentioned above there are some
security improvements that should be made to the interfaces).

NOTE: This patch adds the syscall to all architectures using the new
      unified syscall numbering, but several architectures are missing
      newer (nr > 423) syscalls -- hence the uneven gaps in the syscall
      tables.

Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/namei.c                                  | 46 +++++++++++++++++++++
 include/uapi/linux/fcntl.h                  | 12 ++++++
 18 files changed, 74 insertions(+)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 63ed39cbd3bd..72f431b1dc9c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -461,5 +461,6 @@
 530	common	getegid				sys_getegid
 531	common	geteuid				sys_geteuid
 532	common	getppid				sys_getppid
+533	common	resolveat			sys_resolveat
 # all other architectures have common numbers for new syscall, alpha
 # is the exception.
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 9016f4081bb9..1bc0282a67f7 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -437,3 +437,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index ab9cda5f6136..d3ae73ffaf48 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -344,3 +344,4 @@
 332	common	pkey_free			sys_pkey_free
 333	common	rseq				sys_rseq
 # 334 through 423 are reserved to sync up with other architectures
+428	common	resolveat			sys_resolveat
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 125c14178979..81b7389e9e58 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -423,3 +423,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 8ee3a8c18498..626aed10e2b5 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 15f4117900ee..8cbd6032d6bf 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -362,3 +362,4 @@
 421	n32	rt_sigtimedwait_time64		compat_sys_rt_sigtimedwait_time64
 422	n32	futex_time64			sys_futex
 423	n32	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	n32	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c85502e67b44..234923a1fc88 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -338,3 +338,4 @@
 327	n64	rseq				sys_rseq
 328	n64	io_pgetevents			sys_io_pgetevents
 # 329 through 423 are reserved to sync up with other architectures
+428	n64	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 2e063d0f837e..7b4586acf35d 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -411,3 +411,4 @@
 421	o32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	o32	futex_time64			sys_futex			sys_futex
 423	o32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	o32	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index b26766c6647d..19a9a92dc5f8 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -420,3 +420,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index b18abb0c3dae..bfcd75b928de 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -505,3 +505,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 02579f95f391..084e51f02e65 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
 421	32	rt_sigtimedwait_time64	-				compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64		-				sys_futex
 423	32	sched_rr_get_interval_time64	-			sys_sched_rr_get_interval
+428	common	resolveat		sys_resolveat			-
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bfda678576e4..e9115c5cec72 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index b9a5a04b2d2c..2d3fdd913d89 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..101feef22473 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	resolveat		sys_resolveat			__ia32_sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c7c197bf07bc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	resolveat		__x64_sys_resolveat
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 6af49929de85..86c44f15dfa4 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -394,3 +394,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/fs/namei.c b/fs/namei.c
index 2b6a1bf4e745..30db5833aac3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3656,6 +3656,52 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 	return filp;
 }
 
+SYSCALL_DEFINE3(resolveat, int, dfd, const char __user *, path,
+		unsigned long, flags)
+{
+	int fd;
+	struct filename *tmp;
+	struct open_flags op = {
+		.open_flag = O_PATH,
+	};
+
+	if (flags & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+
+	if (flags & RESOLVE_CLOEXEC)
+		op.open_flag |= O_CLOEXEC;
+	if (!(flags & RESOLVE_NOFOLLOW))
+		op.lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & RESOLVE_BENEATH)
+		op.lookup_flags |= LOOKUP_BENEATH;
+	if (flags & RESOLVE_XDEV)
+		op.lookup_flags |= LOOKUP_XDEV;
+	if (flags & RESOLVE_NO_MAGICLINKS)
+		op.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & RESOLVE_NO_SYMLINKS)
+		op.lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & RESOLVE_THIS_ROOT)
+		op.lookup_flags |= LOOKUP_IN_ROOT;
+
+	tmp = getname(path);
+	if (IS_ERR(tmp))
+		return PTR_ERR(tmp);
+
+	fd = get_unused_fd_flags(op.open_flag);
+	if (fd >= 0) {
+		struct file *f = do_filp_open(dfd, tmp, &op);
+		if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f);
+			fd_install(fd, f);
+		}
+	}
+	putname(tmp);
+	return fd;
+}
+
 struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 		const char *name, const struct open_flags *op)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index bfca7f87cd2a..d5a2a97929c6 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -100,4 +100,16 @@
 #define AT_NO_SYMLINKS		0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
 #define AT_THIS_ROOT		0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
 
+/* First two bits of RESOLVE_* are reserved for future ACC_MODE extensions. */
+#define RESOLVE_CLOEXEC		0x004 /* Set O_CLOEXEC on the returned fd. */
+#define RESOLVE_NOFOLLOW	0x008 /* Don't follow trailing symlinks. */
+#define RESOLVE_RESOLUTION_TYPE	0x1F0 /* Type of path-resolution scoping we are applying. */
+#define RESOLVE_BENEATH		0x010 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define RESOLVE_XDEV		0x020 /* - Block mount-point crossings (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x040 /* - Block procfs-style "magic" symlinks. */
+#define RESOLVE_NO_SYMLINKS	0x080 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define RESOLVE_THIS_ROOT	0x100 /* - Scope ".." resolution to dirfd (like chroot(2)). */
+
+#define VALID_RESOLVE_FLAGS	(RESOLVE_CLOEXEC | RESOLVE_NOFOLLOW | RESOLVE_RESOLUTION_TYPE)
+
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Jann Horn @ 2019-05-06 18:37 UTC (permalink / raw)
  To: Aleksa Sarai, Andy Lutomirski
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Eric Biederman, Andrew Morton, Alexei Starovoitov,
	Kees Cook, Christian Brauner, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, Linus Torvalds,
	containers, linux-fsdevel, Linux API, kernel list
In-Reply-To: <20190506165439.9155-6-cyphar@cyphar.com>

On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> The need to be able to scope path resolution of interpreters became
> clear with one of the possible vectors used in CVE-2019-5736 (which
> most major container runtimes were vulnerable to).
>
> Naively, it might seem that openat(2) -- which supports path scoping --
> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> could be written as a #!-style script with the symlink target as the
> interpreter -- which would be completely missed by just scoping the
> openat(2). An example of this being exploitable is CVE-2019-5736.
>
> In order to get around this, we need to pass down to each binfmt_*
> implementation the scoping flags requested in execveat(2). In order to
> maintain backwards-compatibility we only pass the scoping AT_* flags.
>
> To avoid breaking userspace (in the exceptionally rare cases where you
> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> set in execveat(2).

This seems extremely dangerous. I like the overall series, but not this patch.

> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>
>         sched_exec();
>
> +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> +                              AT_THIS_ROOT);
[...]
> +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */

So now what happens if there is a setuid root ELF binary with program
interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
going to let the unprivileged user decide which interpreter the
setuid-root process should use? From a high-level perspective, opening
the interpreter should be controlled by the program that is being
loaded, not by the program that invoked it.


In my opinion, CVE-2019-5736 points out two different problems:

The big problem: The __ptrace_may_access() logic has a special-case
short-circuit for "introspection" that you can't opt out of; this
makes it possible to open things in procfs that are related to the
current process even if the credentials of the process wouldn't permit
accessing another process like it. I think the proper fix to deal with
this would be to add a prctl() flag for "set whether introspection is
allowed for this process", and if userspace has manually un-set that
flag, any introspection special-case logic would be skipped.

An additional problem: /proc/*/exe can be used to open a file for
writing; I think it may have been Andy Lutomirski who pointed out some
time ago that it would be nice if you couldn't use /proc/*/fd/* to
re-open files with more privileges, which is sort of the same thing.

^ permalink raw reply

* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-06 19:17 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers, linux-fsdevel, Linux API
In-Reply-To: <CAG48ez0-CiODf6UBHWTaog97prx=VAd3HgHvEjdGNz344m1xKw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4411 bytes --]

On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > The need to be able to scope path resolution of interpreters became
> > clear with one of the possible vectors used in CVE-2019-5736 (which
> > most major container runtimes were vulnerable to).
> >
> > Naively, it might seem that openat(2) -- which supports path scoping --
> > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> > binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> > could be written as a #!-style script with the symlink target as the
> > interpreter -- which would be completely missed by just scoping the
> > openat(2). An example of this being exploitable is CVE-2019-5736.
> >
> > In order to get around this, we need to pass down to each binfmt_*
> > implementation the scoping flags requested in execveat(2). In order to
> > maintain backwards-compatibility we only pass the scoping AT_* flags.
> >
> > To avoid breaking userspace (in the exceptionally rare cases where you
> > have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> > set in execveat(2).
> 
> This seems extremely dangerous. I like the overall series, but not this patch.
> 
> > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
> >
> >         sched_exec();
> >
> > +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> > +                              AT_THIS_ROOT);
> [...]
> > +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
> 
> So now what happens if there is a setuid root ELF binary with program
> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
> going to let the unprivileged user decide which interpreter the
> setuid-root process should use? From a high-level perspective, opening
> the interpreter should be controlled by the program that is being
> loaded, not by the program that invoked it.

I went a bit nuts with openat_exec(), and I did end up adding it to the
ELF interpreter lookup (and you're completely right that this is a bad
idea -- I will drop it from this patch if it's included in the next
series).

The proposed solutions you give below are much nicer than this patch so
I can drop it and work on fixing those issues separately.

> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of; this
> makes it possible to open things in procfs that are related to the
> current process even if the credentials of the process wouldn't permit
> accessing another process like it. I think the proper fix to deal with
> this would be to add a prctl() flag for "set whether introspection is
> allowed for this process", and if userspace has manually un-set that
> flag, any introspection special-case logic would be skipped.

We could do PR_SET_DUMPABLE=3 for this, I guess?

> An additional problem: /proc/*/exe can be used to open a file for
> writing; I think it may have been Andy Lutomirski who pointed out some
> time ago that it would be nice if you couldn't use /proc/*/fd/* to
> re-open files with more privileges, which is sort of the same thing.

This is something I'm currently working on a series for, which would
boil down to some restrictions on how re-opening of file descriptors
works through procfs.

However, execveat() of a procfs magiclink is a bit hard to block --
there is no way for userspace to to represent a file being "open for
execute" so they are all "open for execute" by default and blocking it
outright seems a bit extreme (though I actually hope to eventually add
the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
why I've reserved some bits).

(Thinking more about it, there is an argument that I should include the
above patch into this series so that we can block re-opening of fds
opened through resolveat(2) without explicit flags from the outset.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Andy Lutomirski @ 2019-05-06 23:41 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers, linux-fsdevel, Linux
In-Reply-To: <20190506191735.nmzf7kwfh7b6e2tf@yavin>



> On May 6, 2019, at 12:17 PM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
>> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
>>> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>>> The need to be able to scope path resolution of interpreters became
>>> clear with one of the possible vectors used in CVE-2019-5736 (which
>>> most major container runtimes were vulnerable to).
>>> 
>>> Naively, it might seem that openat(2) -- which supports path scoping --
>>> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
>>> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
>>> could be written as a #!-style script with the symlink target as the
>>> interpreter -- which would be completely missed by just scoping the
>>> openat(2). An example of this being exploitable is CVE-2019-5736.
>>> 
>>> In order to get around this, we need to pass down to each binfmt_*
>>> implementation the scoping flags requested in execveat(2). In order to
>>> maintain backwards-compatibility we only pass the scoping AT_* flags.
>>> 
>>> To avoid breaking userspace (in the exceptionally rare cases where you
>>> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
>>> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
>>> set in execveat(2).
>> 
>> This seems extremely dangerous. I like the overall series, but not this patch.
>> 
>>> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>>> 
>>>        sched_exec();
>>> 
>>> +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
>>> +                              AT_THIS_ROOT);
>> [...]
>>> +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
>> 
>> So now what happens if there is a setuid root ELF binary with program
>> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
>> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
>> going to let the unprivileged user decide which interpreter the
>> setuid-root process should use? From a high-level perspective, opening
>> the interpreter should be controlled by the program that is being
>> loaded, not by the program that invoked it.
> 
> I went a bit nuts with openat_exec(), and I did end up adding it to the
> ELF interpreter lookup (and you're completely right that this is a bad
> idea -- I will drop it from this patch if it's included in the next
> series).
> 
> The proposed solutions you give below are much nicer than this patch so
> I can drop it and work on fixing those issues separately.
> 
>> In my opinion, CVE-2019-5736 points out two different problems:
>> 
>> The big problem: The __ptrace_may_access() logic has a special-case
>> short-circuit for "introspection" that you can't opt out of; this
>> makes it possible to open things in procfs that are related to the
>> current process even if the credentials of the process wouldn't permit
>> accessing another process like it. I think the proper fix to deal with
>> this would be to add a prctl() flag for "set whether introspection is
>> allowed for this process", and if userspace has manually un-set that
>> flag, any introspection special-case logic would be skipped.
> 
> We could do PR_SET_DUMPABLE=3 for this, I guess?
> 
>> An additional problem: /proc/*/exe can be used to open a file for
>> writing; I think it may have been Andy Lutomirski who pointed out some
>> time ago that it would be nice if you couldn't use /proc/*/fd/* to
>> re-open files with more privileges, which is sort of the same thing.
> 
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.
> 
> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).

There’s an O_MAYEXEC series floating around.

> 
> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

^ permalink raw reply

* Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX
From: Patrick Bellasi @ 2019-05-07 10:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpHN4kMBScdEdJodtmbHQ2qhVDnXrJKFDdaSYyjWH0JH5Q@mail.gmail.com>

On 17-Apr 13:36, Suren Baghdasaryan wrote:
>  Hi Patrick,
> 
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > When a task sleeps it removes its max utilization clamp from its CPU.
> > However, the blocked utilization on that CPU can be higher than the max
> > clamp value enforced while the task was running. This allows undesired
> > CPU frequency increases while a CPU is idle, for example, when another
> > CPU on the same frequency domain triggers a frequency update, since
> > schedutil can now see the full not clamped blocked utilization of the
> > idle CPU.
> >
> > Fix this by using
> >   uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
> >     uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> > to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> > condition.
> >
> 
> If I understand the intent correctly, you are trying to exclude idle
> CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
> true I think description can be simplified a bit :)

That's not entirely correct. What I want to avoid is an OPP increase
because of an idle CPU. Maybe an example can explain it better,
consider this sequence:

 1. A task is running unconstrained on a CPUx and it generates a 100%
    utilization
 2. The task is now constrained by setting util_max=20
 3. We now select an OPP which provides 20% capacity on CPUx

In this scenario the task is still running flat out on that CPUx which
will keep it's util_avg to 1024. Note that after Vincet's PELT rewrite
we don't converge down to the current capacity.

 4. The task sleep, it's removed from CPUx but the "blocked
    utilization" is still 1024

After this point: the CPU is idle, its "blocked utilization" starts
to "slowly" decay but we _already_ removed the 20% util_max constraint
on that CPU since there are no RUNNABLE tasks (i.e no active buckets).

At this point in time, if there is a schedutil update requested from
another CPU of the same frequency domain, by looking at CPUx we will
see its full "blocked utilization" signal, which can be above 20%.

> In particular it took me some time to understand what "blocked
> utilization" means, however if it's a widely accepted term then feel
> free to ignore my input.

Yes, "blocked utilization" is a commonly used term to refer to the
utilization generated by tasks executed on a CPU.

[...]

> > +static inline unsigned int
> > +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> > +{
> > +       /*
> > +        * Avoid blocked utilization pushing up the frequency when we go
> > +        * idle (which drops the max-clamp) by retaining the last known
> > +        * max-clamp.
> > +        */
> > +       if (clamp_id == UCLAMP_MAX) {
> > +               rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> > +               return clamp_value;
> > +       }
> > +
> > +       return uclamp_none(UCLAMP_MIN);
> > +}
> > +
> > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> > +                                    unsigned int clamp_value)
> > +{
> > +       /* Reset max-clamp retention only on idle exit */
> > +       if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > +               return;
> > +
> > +       WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > +}
> > +
> >  static inline
> > -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> > +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> > +                                unsigned int clamp_value)
> 
> IMHO the name of uclamp_rq_max_value() is a bit misleading because:

That's very similar to what you proposed in:

   https://lore.kernel.org/lkml/20190314122256.7wb3ydswpkfmntvf@e110439-lin/

> 1. It does not imply that it has to be called only when there are no
> more runnable tasks on a CPU. This is currently the case because it's
> called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
> nothing in the name of this function indicates that it can't be called
> from other places.
> 2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

Even if you call it from other places, which is not required, it does
not arm. That function still return the current max clamp for a CPU
given its current state. If the CPU is idle we set the flag one more
time but that's not a problem too.

However, do you have any other proposal for a better name ?

> >  {
> >         struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> >         int bucket_id = UCLAMP_BUCKETS - 1;
> > @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> >         }
> >
> >         /* No tasks -- default clamp values */
> > -       return uclamp_none(clamp_id);
> > +       return uclamp_idle_value(rq, clamp_id, clamp_value);
> >  }

[...]

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps
From: Patrick Bellasi @ 2019-05-07 10:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpGcN-CWCoo16_xKzohUbBTCY3W5D1E8izhA8wtEjCtF+Q@mail.gmail.com>

On 17-Apr 17:51, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:

[...]

> > +/*
> > + * The effective clamp bucket index of a task depends on, by increasing
> > + * priority:
> > + * - the task specific clamp value, when explicitly requested from userspace
> > + * - the system default clamp value, defined by the sysadmin
> > + */
> > +static inline struct uclamp_se
> > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > +{
> > +       struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > +       struct uclamp_se uc_max = uclamp_default[clamp_id];
> > +
> > +       /* System default restrictions always apply */
> > +       if (unlikely(uc_req.value > uc_max.value))
> > +               return uc_max;
> > +
> > +       return uc_req;
> > +}
> > +
> > +static inline unsigned int
> > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> 
> This function is not used anywhere AFAIKT.

Right, this is the dual of uclamp_eff_value() but, since we don't
actually use it in the corrent code, let's remove it and keep only the
latter.

> uclamp_eff_bucket_id() and
> uclamp_eff_value() look very similar, maybe they can be combined into
> one function returning struct uclamp_se?

I would prefer not since at the callsites of uclamp_eff_value() we
actually need just the value.

> > +{
> > +       struct uclamp_se uc_eff;
> > +
> > +       /* Task currently refcounted: use back-annotated (effective) bucket */
> > +       if (p->uclamp[clamp_id].active)
> > +               return p->uclamp[clamp_id].bucket_id;
> > +
> > +       uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > +       return uc_eff.bucket_id;
> > +}
> > +
> > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > +{
> > +       struct uclamp_se uc_eff;
> > +
> > +       /* Task currently refcounted: use back-annotated (effective) value */
> > +       if (p->uclamp[clamp_id].active)
> > +               return p->uclamp[clamp_id].value;
> > +
> > +       uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > +       return uc_eff.value;
> > +}
> > +

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
From: Patrick Bellasi @ 2019-05-07 11:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpH3htcr3xB_Y4nr7HXCdQd1hOdOAXbtZJB1SOt7Of_qbw@mail.gmail.com>

On 17-Apr 15:26, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:

[...]

> > Do not allow to change sched class specific params and non class
> > specific params (i.e. clamp values) at the same time.  This keeps things
> > simple and still works for the most common cases since we are usually
> > interested in just one of the two actions.
> 
> Sorry, I can't find where you are checking to eliminate the
> possibility of simultaneous changes to both sched class specific
> params and non class specific params... Am I too tired or they are
> indeed missing?

No, you right... that limitation has been removed in v8 :)

I'll remove the above paragraph in v9, thanks for spotting it.

[...]

> > +static int uclamp_validate(struct task_struct *p,
> > +                          const struct sched_attr *attr)
> > +{
> > +       unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
> > +       unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
> > +
> > +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > +               lower_bound = attr->sched_util_min;
> > +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > +               upper_bound = attr->sched_util_max;
> > +
> > +       if (lower_bound > upper_bound)
> > +               return -EINVAL;
> > +       if (upper_bound > SCHED_CAPACITY_SCALE)
> > +               return -EINVAL;
> > +
> > +       return 0;
> > +}

[...]

> >  static void uclamp_fork(struct task_struct *p)
> >  {
> >         unsigned int clamp_id;
> > @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> >  #else /* CONFIG_UCLAMP_TASK */
> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> >  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> > +static inline int uclamp_validate(struct task_struct *p,
> > +                                 const struct sched_attr *attr)
> > +{
> > +       return -ENODEV;
> 
> ENOSYS might be more appropriate?

Yep, agree, thanks!

> 
> > +}
> > +static void __setscheduler_uclamp(struct task_struct *p,
> > +                                 const struct sched_attr *attr) { }
> >  static inline void uclamp_fork(struct task_struct *p) { }
> >  static inline void init_uclamp(void) { }
> >  #endif /* CONFIG_UCLAMP_TASK */
> > @@ -4424,6 +4475,13 @@ static void __setscheduler_params(struct task_struct *p,
> >  static void __setscheduler(struct rq *rq, struct task_struct *p,
> >                            const struct sched_attr *attr, bool keep_boost)
> >  {
> > +       /*
> > +        * If params can't change scheduling class changes aren't allowed
> > +        * either.
> > +        */
> > +       if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
> > +               return;
> > +
> >         __setscheduler_params(p, attr);
> >
> >         /*
> > @@ -4561,6 +4619,13 @@ static int __sched_setscheduler(struct task_struct *p,
> >                         return retval;
> >         }
> >
> > +       /* Update task specific "requested" clamps */
> > +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> > +               retval = uclamp_validate(p, attr);
> > +               if (retval)
> > +                       return retval;
> > +       }
> > +
> >         /*
> >          * Make sure no PI-waiters arrive (or leave) while we are
> >          * changing the priority of the task:

[...]

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks
From: Patrick Bellasi @ 2019-05-07 11:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpH9E2Khc1WP8MZgLK+yF7GgwW29ECES4hqPN3jaB34RqA@mail.gmail.com>

On 17-Apr 16:07, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > By default FAIR tasks start without clamps, i.e. neither boosted nor
> > capped, and they run at the best frequency matching their utilization
> > demand.  This default behavior does not fit RT tasks which instead are
> > expected to run at the maximum available frequency, if not otherwise
> > required by explicitly capping them.
> >
> > Enforce the correct behavior for RT tasks by setting util_min to max
> > whenever:
> >
> >  1. the task is switched to the RT class and it does not already have a
> >     user-defined clamp value assigned.
> >
> >  2. an RT task is forked from a parent with RESET_ON_FORK set.
> >
> > NOTE: utilization clamp values are cross scheduling class attributes and
> > thus they are never changed/reset once a value has been explicitly
> > defined from user-space.
> >
> > Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bdebdabe9bc4..71c9dd6487b1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
> >  static void __setscheduler_uclamp(struct task_struct *p,
> >                                   const struct sched_attr *attr)
> >  {
> > +       unsigned int clamp_id;
> > +
> > +       /*
> > +        * On scheduling class change, reset to default clamps for tasks
> > +        * without a task-specific value.
> > +        */
> > +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > +               struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
> > +               unsigned int clamp_value = uclamp_none(clamp_id);
> > +
> > +               /* Keep using defined clamps across class changes */
> > +               if (uc_se->user_defined)
> > +                       continue;
> > +
> > +               /* By default, RT tasks always get 100% boost */
> > +               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > +                       clamp_value = uclamp_none(UCLAMP_MAX);
> > +
> > +               uc_se->bucket_id = uclamp_bucket_id(clamp_value);
> > +               uc_se->value = clamp_value;
> 
> Is it possible for p->uclamp_req[UCLAMP_MAX].value to be less than
> uclamp_none(UCLAMP_MAX) for this RT task? If that's a possibility then
> I think we will end up with a case of
>   p->uclamp_req[UCLAMP_MIN].value > p->uclamp_req[UCLAMP_MAX].value
> after these assignments are done.
> 

The util_max for an RT task can be less then uclamp_none(UCLAMP_MAX),
however, requesting a task specific util_max which is smaller then the
current util_min will fail in:

   __sched_setscheduler()
      uclamp_validate()

since we only allow util_min <= util_max for all task specific values.

> > +       }
> > +
> >         if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> >                 return;
> >
> > @@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
> >         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> >                 unsigned int clamp_value = uclamp_none(clamp_id);
> >
> > +               /* By default, RT tasks always get 100% boost */
> > +               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > +                       clamp_value = uclamp_none(UCLAMP_MAX);
> > +
> >                 p->uclamp_req[clamp_id].user_defined = false;
> >                 p->uclamp_req[clamp_id].value = clamp_value;
> >                 p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> > --
> > 2.20.1
> >

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-05-07 11:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpFFSgRUFb9pyckpXWxr-z+mrrhcsLjZiVN5fZMvYC5XxQ@mail.gmail.com>

On 17-Apr 17:12, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:43 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > The cgroup CPU bandwidth controller allows to assign a specified
> > (maximum) bandwidth to the tasks of a group. However this bandwidth is
> > defined and enforced only on a temporal base, without considering the
> > actual frequency a CPU is running on. Thus, the amount of computation
> > completed by a task within an allocated bandwidth can be very different
> > depending on the actual frequency the CPU is running that task.
> > The amount of computation can be affected also by the specific CPU a
> > task is running on, especially when running on asymmetric capacity
> > systems like Arm's big.LITTLE.
> >
> > With the availability of schedutil, the scheduler is now able
> > to drive frequency selections based on actual task utilization.
> > Moreover, the utilization clamping support provides a mechanism to
> > bias the frequency selection operated by schedutil depending on
> > constraints assigned to the tasks currently RUNNABLE on a CPU.
> >
> > Giving the mechanisms described above, it is now possible to extend the
> > cpu controller to specify the minimum (or maximum) utilization which
> > should be considered for tasks RUNNABLE on a cpu.
> > This makes it possible to better defined the actual computational
> > power assigned to task groups, thus improving the cgroup CPU bandwidth
> > controller which is currently based just on time constraints.
> >
> > Extend the CPU controller with a couple of new attributes util.{min,max}
> > which allows to enforce utilization boosting and capping for all the
> > tasks in a group. Specifically:
> >
> > - util.min: defines the minimum utilization which should be considered
> >             i.e. the RUNNABLE tasks of this group will run at least at a
> >                  minimum frequency which corresponds to the util.min
> >                  utilization
> >
> > - util.max: defines the maximum utilization which should be considered
> >             i.e. the RUNNABLE tasks of this group will run up to a
> >                  maximum frequency which corresponds to the util.max
> >                  utilization
> >
> > These attributes:
> >
> > a) are available only for non-root nodes, both on default and legacy
> >    hierarchies, while system wide clamps are defined by a generic
> >    interface which does not depends on cgroups. This system wide
> >    interface enforces constraints on tasks in the root node.
> >
> > b) enforce effective constraints at each level of the hierarchy which
> >    are a restriction of the group requests considering its parent's
> >    effective constraints. Root group effective constraints are defined
> >    by the system wide interface.
> >    This mechanism allows each (non-root) level of the hierarchy to:
> >    - request whatever clamp values it would like to get
> >    - effectively get only up to the maximum amount allowed by its parent
> >
> > c) have higher priority than task-specific clamps, defined via
> >    sched_setattr(), thus allowing to control and restrict task requests
> >
> > Add two new attributes to the cpu controller to collect "requested"
> > clamp values. Allow that at each non-root level of the hierarchy.
> > Validate local consistency by enforcing util.min < util.max.
> > Keep it simple by do not caring now about "effective" values computation
> > and propagation along the hierarchy.
> >
> > Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Tejun Heo <tj@kernel.org>
> >
> > --
> > Changes in v8:
> >  Message-ID: <20190214154817.GN50184@devbig004.ftw2.facebook.com>
> >  - update changelog description for points b), c) and following paragraph
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  27 +++++
> >  init/Kconfig                            |  22 ++++
> >  kernel/sched/core.c                     | 142 +++++++++++++++++++++++-
> >  kernel/sched/sched.h                    |   6 +
> >  4 files changed, 196 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 7bf3f129c68b..47710a77f4fa 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
> >  normal scheduling policy and absolute bandwidth allocation model for
> >  realtime scheduling policy.
> >
> > +Cycles distribution is based, by default, on a temporal base and it
> > +does not account for the frequency at which tasks are executed.
> > +The (optional) utilization clamping support allows to enforce a minimum
> > +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> > +which should never be exceeded by a CPU.
> > +
> >  WARNING: cgroup2 doesn't yet support control of realtime processes and
> >  the cpu controller can only be enabled when all RT processes are in
> >  the root cgroup.  Be aware that system management software may already
> > @@ -974,6 +980,27 @@ All time durations are in microseconds.
> >         Shows pressure stall information for CPU. See
> >         Documentation/accounting/psi.txt for details.
> >
> > +  cpu.util.min
> > +        A read-write single value file which exists on non-root cgroups.
> > +        The default is "0", i.e. no utilization boosting.
> > +
> > +        The requested minimum utilization in the range [0, 1024].
> > +
> > +        This interface allows reading and setting minimum utilization clamp
> > +        values similar to the sched_setattr(2). This minimum utilization
> > +        value is used to clamp the task specific minimum utilization clamp.
> > +
> > +  cpu.util.max
> > +        A read-write single value file which exists on non-root cgroups.
> > +        The default is "1024". i.e. no utilization capping
> > +
> > +        The requested maximum utilization in the range [0, 1024].
> > +
> > +        This interface allows reading and setting maximum utilization clamp
> > +        values similar to the sched_setattr(2). This maximum utilization
> > +        value is used to clamp the task specific maximum utilization clamp.
> > +
> > +
> >
> >  Memory
> >  ------
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 7439cbf4d02e..33006e8de996 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -877,6 +877,28 @@ config RT_GROUP_SCHED
> >
> >  endif #CGROUP_SCHED
> >
> > +config UCLAMP_TASK_GROUP
> > +       bool "Utilization clamping per group of tasks"
> > +       depends on CGROUP_SCHED
> > +       depends on UCLAMP_TASK
> > +       default n
> > +       help
> > +         This feature enables the scheduler to track the clamped utilization
> > +         of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> > +
> > +         When this option is enabled, the user can specify a min and max
> > +         CPU bandwidth which is allowed for each single task in a group.
> > +         The max bandwidth allows to clamp the maximum frequency a task
> > +         can use, while the min bandwidth allows to define a minimum
> > +         frequency a task will always use.
> > +
> > +         When task group based utilization clamping is enabled, an eventually
> > +         specified task-specific clamp value is constrained by the cgroup
> > +         specified clamp value. Both minimum and maximum task clamping cannot
> > +         be bigger than the corresponding clamping defined at task group level.
> > +
> > +         If in doubt, say N.
> > +
> >  config CGROUP_PIDS
> >         bool "PIDs controller"
> >         help
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 71c9dd6487b1..aeed2dd315cc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1130,8 +1130,12 @@ static void __init init_uclamp(void)
> >         /* System defaults allow max clamp values for both indexes */
> >         uc_max.value = uclamp_none(UCLAMP_MAX);
> >         uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> > -       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> >                 uclamp_default[clamp_id] = uc_max;
> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > +               root_task_group.uclamp_req[clamp_id] = uc_max;
> > +#endif
> > +       }
> >  }
> >
> >  #else /* CONFIG_UCLAMP_TASK */
> > @@ -6720,6 +6724,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
> >  /* task_group_lock serializes the addition/removal of task groups */
> >  static DEFINE_SPINLOCK(task_group_lock);
> >
> > +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> > +                                          struct task_group *parent)
> > +{
> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > +       int clamp_id;
> > +
> > +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > +               tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
> > +#endif
> > +
> > +       return 1;
> 
> Looks like you never return anything else neither here nor in the
> following patches I think...

That's right, I just preferred to keep the same structure in the
callsite below...

> > +}
> > +
> >  static void sched_free_group(struct task_group *tg)
> >  {
> >         free_fair_sched_group(tg);
> > @@ -6743,6 +6760,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> >         if (!alloc_rt_sched_group(tg, parent))
> >                 goto err;
> >
> > +       if (!alloc_uclamp_sched_group(tg, parent))
> > +               goto err;
> > +

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

... under the assumption the compiler is smart enough to optimized that.

But perhaps  it's less confusing to just use void, will update in v9.

> >         return tg;
> >
> >  err:
-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* [PATCH v7 0/5] namei: resolveat(2) path resolution restriction API
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch

Patch changelog:
  v7:
    * Remove execveat(2) support for these flags since it might
      result in some pretty hairy security issues with setuid binaries.
      There are other avenues we can go down to solve the issues with
      CVE-2019-5736. [Jann]
    * Reserve an additional bit in resolveat(2) for the eXecute access
      mode if we end up implementing it.
  v6:
    * Drop O_* flags API to the new LOOKUP_ path scoping bits and
      instead introduce resolveat(2) as an alternative method of
      obtaining an O_PATH. The justification for this is included in
      patch 6 (though switching back to O_* flags is trivial).
  v5:
    * In response to CVE-2019-5736 (one of the vectors showed that
      open(2)+fexec(3) cannot be used to scope binfmt_script's implicit
      open_exec()), AT_* flags have been re-added and are now piped
      through to binfmt_script (and other binfmt_* that use open_exec)
      but are only supported for execveat(2) for now.
  v4:
    * Remove AT_* flag reservations, as they require more discussion.
    * Switch to path_is_under() over __d_path() for breakout checking.
    * Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
      is now ignored for absolute paths to match other flags.
    * Improve the dirfd_path_init() refactor and move it to a separate
      commit.
    * Remove reference to Linux-capsicum.
    * Switch "proclink" name to "magic link".
  v3: [resend]
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
      some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
      hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
    * Removed stat(2) support, since you can already get it through
      O_PATH and fstatat(2).

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall
resolveat(2) which provides an alternative way to get an O_PATH file
descriptor (the reasoning for doing this is included in patch 6). The
following new LOOKUP_ (and corresponding uapi) flags are added:

  * LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or
    through absolute links). Absolute pathnames alone in openat(2) do
    not trigger this.

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
    links. This is done by blocking the usage of nd_jump_link() during
    resolution in a filesystem. The term "magic links" is used to match
    with the only reference to these links in Documentation/, but I'm
    happy to change the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do resolveat(NOFOLLOW|NO_MAGICLINKS) on a "magic link" and
    it will *not* fail (assuming that no parent component was a "magic
    link"), and you will have an fd for the "magic link".

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed. Conceptually this flag is to
    ensure you "stay below" a certain point in the filesystem tree --
    but this requires some additional to protect against various races
    that would allow escape using ".." (see patch 4 for more detail).

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks as in
    patch 4, but that requires more discussion.

In addition, two new flags were added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
    resolution is allowed at all, including "magic links". Just as with
    LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
    fd for the symlink as long as no parent path had a symlink
    component.

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
    blocking attempts to move past the root, forces all such movements
    to be scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that chroot(2)
    is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to cross
    "magic links" with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[6] when opening
    paths in a potentially malicious container. There is a long list of
    CVEs that could have bene mitigated by having O_THISROOT (such as
    CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
    CVE-2019-5736, just to name a few).

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (5):
  namei: split out nd->dfd handling to dirfd_path_init
  namei: O_BENEATH-style path resolution flags
  namei: LOOKUP_IN_ROOT: chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution
  namei: resolveat(2) syscall

 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/namei.c                                  | 251 +++++++++++++++-----
 include/linux/namei.h                       |   8 +
 include/uapi/linux/fcntl.h                  |  13 +
 19 files changed, 229 insertions(+), 59 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH v7 1/5] namei: split out nd->dfd handling to dirfd_path_init
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch
In-Reply-To: <20190507164317.13562-1-cyphar@cyphar.com>

Previously, path_init's handling of *at(dfd, ...) was only done once,
but with O_BENEATH (and O_THISROOT) we have to parse the initial
nd->path at different times (before or after absolute path handling)
depending on whether we have been asked to scope resolution within a
root.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 103 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 59 insertions(+), 44 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5ebd64b21970..2a91b72aa5e9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2166,9 +2166,59 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
@@ -2202,52 +2252,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	error = dirfd_path_init(nd);
+	if (unlikely(error))
+		return ERR_PTR(error);
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
-- 
2.21.0

^ permalink raw reply related

* [PATCH v7 2/5] namei: O_BENEATH-style path resolution flags
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Eric Biederman, Christian Brauner, Kees Cook,
	David Drysdale, Andy Lutomirski, Linus Torvalds, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, containers, linux-fsdevel, linux-api,
	linux-kernel, linux-arch
In-Reply-To: <20190507164317.13562-1-cyphar@cyphar.com>

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

These flags are exposed to userspace in a later patchset.

* LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or
  *up* from one). The primary "scoping" use is to blocking resolution
  that crosses a bind-mount, which has a similar property to a symlink
  (in the way that it allows for escape from the starting-point). Since
  it is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as
  well as -EXDEV (though find(1) doesn't walk upwards, the semantics
  seem obvious).

* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is
  a very specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies
  LOOKUP_NO_MAGICLINKS (obviously).

* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "magic link" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "magic link" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "magic
  links" are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable "magic link" jumping when such jumps
  would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 76 ++++++++++++++++++++++++++++++++++++-------
 include/linux/namei.h |  7 ++++
 2 files changed, 72 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2a91b72aa5e9..e13a02720a9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -843,6 +843,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1051,6 +1057,9 @@ const char *get_link(struct nameidata *nd)
 	int error;
 	const char *res;
 
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return ERR_PTR(-ELOOP);
+
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
 		cond_resched();
@@ -1081,14 +1090,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1269,12 +1287,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1331,6 +1353,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1351,8 +1375,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1377,6 +1404,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1391,6 +1420,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1479,8 +1510,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1489,6 +1523,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1703,6 +1739,13 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+		 * cause our parent to have moved outside of the root and us to skip
+		 * over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -2251,6 +2294,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
 	if (*s == '/') {
 		if (likely(!nd->root.mnt))
 			set_root(nd);
@@ -2259,9 +2311,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 			s = ERR_PTR(error);
 		return s;
 	}
-	error = dirfd_path_init(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	return s;
 }
 
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..7bc819ad0cd3 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -50,6 +50,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_MAGICLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
-- 
2.21.0

^ permalink raw reply related

* [PATCH v7 3/5] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Eric Biederman, Christian Brauner, Kees Cook,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch
In-Reply-To: <20190507164317.13562-1-cyphar@cyphar.com>

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_XDEV and LOOKUP_NO_MAGICLINKS help defend against other potential
attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and "magic link" jumping are disallowed for the
    same reason it is disabled for LOOKUP_BENEATH -- currently it is not
    safe to allow it. Future patches may enable it unconditionally once
    we have resolved the possible races (for "..") and semantics (for
    "magic link" jumping).

The most significant openat(2) semantic change with LOOKUP_IN_ROOT is
that absolute pathnames no longer cause dirfd to be ignored completely.
The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).

[1]: https://github.com/cyphar/filepath-securejoin

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 6 +++---
 include/linux/namei.h | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e13a02720a9d..3a3cba593b85 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1095,7 +1095,7 @@ const char *get_link(struct nameidata *nd)
 			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
 				return ERR_PTR(-ELOOP);
 			/* Not currently safe. */
-			if (unlikely(nd->flags & LOOKUP_BENEATH))
+			if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
 				return ERR_PTR(-EXDEV);
 		}
 		if (IS_ERR_OR_NULL(res))
@@ -1744,7 +1744,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		 * cause our parent to have moved outside of the root and us to skip
 		 * over it.
 		 */
-		if (unlikely(nd->flags & LOOKUP_BENEATH))
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
 			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
@@ -2295,7 +2295,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 
-	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 7bc819ad0cd3..4b1ee717cb14 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -56,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
 					    Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as %current->fs->root. */
 
 extern int path_pts(struct path *path);
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v7 4/5] namei: aggressively check for nd->root escape on ".." resolution
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Jann Horn, Kees Cook, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov,
	Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
	linux-fsdevel, linux-api, linux-kernel, linux-arch
In-Reply-To: <20190507164317.13562-1-cyphar@cyphar.com>

This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). "magic link" jumps
are still disallowed entirely because now they could result in
inconsistent behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
LOOKUP_IN_ROOT and LOOKUP_BENEATH) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      resolveat(dirb, "b/c/../../etc/shadow", RESOLVE_IN_ROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3a3cba593b85..2b6a1bf4e745 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1739,19 +1739,35 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
-		 * cause our parent to have moved outside of the root and us to skip
-		 * over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * Don't bother checking unless there's a racing
+			 * rename(2) or MS_MOVE.
+			 */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+			if (!path_is_under(&nd->path, &nd->root))
+				return -EXDEV;
+		}
 	}
 	return 0;
 }
@@ -2272,6 +2288,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2282,7 +2303,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2293,8 +2313,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
-
 	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
-- 
2.21.0

^ permalink raw reply related

* [PATCH v7 5/5] namei: resolveat(2) syscall
From: Aleksa Sarai @ 2019-05-07 16:43 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells
  Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch
In-Reply-To: <20190507164317.13562-1-cyphar@cyphar.com>

The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2) (along with the required execveat(2) change
included in this series). However, there are a few reasons to not do
this:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2).

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOw is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface in the
   future. There are some potential security improvements that can be
   made to O_PATH (handling /proc/self/fd re-opening of file descriptors
   much more sanely) which could be made even better with some other
   bits (such as ACC_MODE bits which work for O_PATH).

To this end, we introduce the resolveat(2) syscall. At the moment it's
effectively another way of getting a bog-standard O_PATH descriptor but
with the ability to use the new LOOKUP_* flags.

Because resolveat(2) only provides the ability to get O_PATH
descriptors, users will need to get creative with /proc/self/fd in order
to get a usable file descriptor for other uses. However, in future we
can add O_EMPTYPATH support to openat(2) which would allow for
re-opening without procfs (though as mentioned above there are some
security improvements that should be made to the interfaces).

NOTE: This patch adds the syscall to all architectures using the new
      unified syscall numbering, but several architectures are missing
      newer (nr > 423) syscalls -- hence the uneven gaps in the syscall
      tables.

Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/namei.c                                  | 46 +++++++++++++++++++++
 include/uapi/linux/fcntl.h                  | 13 ++++++
 18 files changed, 75 insertions(+)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 63ed39cbd3bd..72f431b1dc9c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -461,5 +461,6 @@
 530	common	getegid				sys_getegid
 531	common	geteuid				sys_geteuid
 532	common	getppid				sys_getppid
+533	common	resolveat			sys_resolveat
 # all other architectures have common numbers for new syscall, alpha
 # is the exception.
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 9016f4081bb9..1bc0282a67f7 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -437,3 +437,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index ab9cda5f6136..d3ae73ffaf48 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -344,3 +344,4 @@
 332	common	pkey_free			sys_pkey_free
 333	common	rseq				sys_rseq
 # 334 through 423 are reserved to sync up with other architectures
+428	common	resolveat			sys_resolveat
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 125c14178979..81b7389e9e58 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -423,3 +423,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 8ee3a8c18498..626aed10e2b5 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 15f4117900ee..8cbd6032d6bf 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -362,3 +362,4 @@
 421	n32	rt_sigtimedwait_time64		compat_sys_rt_sigtimedwait_time64
 422	n32	futex_time64			sys_futex
 423	n32	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	n32	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c85502e67b44..234923a1fc88 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -338,3 +338,4 @@
 327	n64	rseq				sys_rseq
 328	n64	io_pgetevents			sys_io_pgetevents
 # 329 through 423 are reserved to sync up with other architectures
+428	n64	resolveat			sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 2e063d0f837e..7b4586acf35d 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -411,3 +411,4 @@
 421	o32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	o32	futex_time64			sys_futex			sys_futex
 423	o32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	o32	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index b26766c6647d..19a9a92dc5f8 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -420,3 +420,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index b18abb0c3dae..bfcd75b928de 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -505,3 +505,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 02579f95f391..084e51f02e65 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
 421	32	rt_sigtimedwait_time64	-				compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64		-				sys_futex
 423	32	sched_rr_get_interval_time64	-			sys_sched_rr_get_interval
+428	common	resolveat		sys_resolveat			-
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bfda678576e4..e9115c5cec72 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index b9a5a04b2d2c..2d3fdd913d89 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
 421	32	rt_sigtimedwait_time64		sys_rt_sigtimedwait		compat_sys_rt_sigtimedwait_time64
 422	32	futex_time64			sys_futex			sys_futex
 423	32	sched_rr_get_interval_time64	sys_sched_rr_get_interval	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat			sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..101feef22473 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	resolveat		sys_resolveat			__ia32_sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c7c197bf07bc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	resolveat		__x64_sys_resolveat
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 6af49929de85..86c44f15dfa4 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -394,3 +394,4 @@
 421	common	rt_sigtimedwait_time64		sys_rt_sigtimedwait
 422	common	futex_time64			sys_futex
 423	common	sched_rr_get_interval_time64	sys_sched_rr_get_interval
+428	common	resolveat			sys_resolveat
diff --git a/fs/namei.c b/fs/namei.c
index 2b6a1bf4e745..2cc5b171f6ec 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3656,6 +3656,52 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 	return filp;
 }
 
+SYSCALL_DEFINE3(resolveat, int, dfd, const char __user *, path,
+		unsigned long, flags)
+{
+	int fd;
+	struct filename *tmp;
+	struct open_flags op = {
+		.open_flag = O_PATH,
+	};
+
+	if (flags & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+
+	if (flags & RESOLVE_CLOEXEC)
+		op.open_flag |= O_CLOEXEC;
+	if (!(flags & RESOLVE_NOFOLLOW))
+		op.lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & RESOLVE_BENEATH)
+		op.lookup_flags |= LOOKUP_BENEATH;
+	if (flags & RESOLVE_XDEV)
+		op.lookup_flags |= LOOKUP_XDEV;
+	if (flags & RESOLVE_NO_MAGICLINKS)
+		op.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & RESOLVE_NO_SYMLINKS)
+		op.lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & RESOLVE_IN_ROOT)
+		op.lookup_flags |= LOOKUP_IN_ROOT;
+
+	tmp = getname(path);
+	if (IS_ERR(tmp))
+		return PTR_ERR(tmp);
+
+	fd = get_unused_fd_flags(op.open_flag);
+	if (fd >= 0) {
+		struct file *f = do_filp_open(dfd, tmp, &op);
+		if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f);
+			fd_install(fd, f);
+		}
+	}
+	putname(tmp);
+	return fd;
+}
+
 struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 		const char *name, const struct open_flags *op)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..c57245a21281 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -94,4 +94,17 @@
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
 
+/* Bottom 3 bits of RESOLVE_* are reserved for future ACC_MODE extensions. */
+#define RESOLVE_CLOEXEC		0x008 /* Set O_CLOEXEC on the returned fd. */
+#define RESOLVE_NOFOLLOW	0x010 /* Don't follow trailing symlinks. */
+
+#define RESOLVE_RESOLUTION_TYPE	0x3E0 /* Type of path-resolution scoping we are applying. */
+#define RESOLVE_BENEATH		0x020 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define RESOLVE_XDEV		0x040 /* - Block mount-point crossings (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x080 /* - Block procfs-style "magic" symlinks. */
+#define RESOLVE_NO_SYMLINKS	0x100 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define RESOLVE_IN_ROOT		0x200 /* - Scope ".." resolution to dirfd (like chroot(2)). */
+
+#define VALID_RESOLVE_FLAGS	(RESOLVE_CLOEXEC | RESOLVE_NOFOLLOW | RESOLVE_RESOLUTION_TYPE)
+
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v2 02/18] fpga: dfl: fme: remove copy_to_user() in ioctl for PR
From: Moritz Fischer @ 2019-05-07 17:26 UTC (permalink / raw)
  To: Wu Hao; +Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <1556528151-17221-3-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 04:55:35PM +0800, Wu Hao wrote:
> This patch removes copy_to_user() code in partial reconfiguration
> ioctl, as it's useless as user never needs to read the data
> structure after ioctl.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: clean up code split from patch 2 in v1 patchset.
> ---
>  drivers/fpga/dfl-fme-pr.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c
> index d9ca955..6ec0f09 100644
> --- a/drivers/fpga/dfl-fme-pr.c
> +++ b/drivers/fpga/dfl-fme-pr.c
> @@ -159,9 +159,6 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
>  	mutex_unlock(&pdata->lock);
>  free_exit:
>  	vfree(buf);
> -	if (copy_to_user((void __user *)arg, &port_pr, minsz))
> -		return -EFAULT;
> -
>  	return ret;
>  }
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 03/18] fpga: dfl: fme: align PR buffer size per PR datawidth
From: Moritz Fischer @ 2019-05-07 17:27 UTC (permalink / raw)
  To: Wu Hao; +Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <1556528151-17221-4-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 04:55:36PM +0800, Wu Hao wrote:
> Current driver checks if input bitstream file size is aligned or
> not per PR data width (default 32bits). It requires one additional
> step for end user when they generate the bitstream file, padding
> extra zeros to bitstream file to align its size per PR data width,
> but they don't have to as hardware will drop extra padding bytes
> automatically.
> 
> In order to simplify the user steps, this patch aligns PR buffer
> size per PR data width in driver, to allow user to pass unaligned
> size bitstream files to driver.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-fme-pr.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c
> index 6ec0f09..3c71dc3 100644
> --- a/drivers/fpga/dfl-fme-pr.c
> +++ b/drivers/fpga/dfl-fme-pr.c
> @@ -74,6 +74,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
>  	struct dfl_fme *fme;
>  	unsigned long minsz;
>  	void *buf = NULL;
> +	size_t length;
>  	int ret = 0;
>  	u64 v;
>  
> @@ -85,9 +86,6 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
>  	if (port_pr.argsz < minsz || port_pr.flags)
>  		return -EINVAL;
>  
> -	if (!IS_ALIGNED(port_pr.buffer_size, 4))
> -		return -EINVAL;
> -
>  	/* get fme header region */
>  	fme_hdr = dfl_get_feature_ioaddr_by_id(&pdev->dev,
>  					       FME_FEATURE_ID_HEADER);
> @@ -103,7 +101,13 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
>  		       port_pr.buffer_size))
>  		return -EFAULT;
>  
> -	buf = vmalloc(port_pr.buffer_size);
> +	/*
> +	 * align PR buffer per PR bandwidth, as HW ignores the extra padding
> +	 * data automatically.
> +	 */
> +	length = ALIGN(port_pr.buffer_size, 4);
> +
> +	buf = vmalloc(length);
>  	if (!buf)
>  		return -ENOMEM;
>  
> @@ -140,7 +144,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
>  	fpga_image_info_free(region->info);
>  
>  	info->buf = buf;
> -	info->count = port_pr.buffer_size;
> +	info->count = length;
>  	info->region_id = port_pr.port_id;
>  	region->info = info;
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 06/18] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.
From: Moritz Fischer @ 2019-05-07 17:33 UTC (permalink / raw)
  To: Wu Hao
  Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Zhang Yi Z,
	Xu Yilun
In-Reply-To: <1556528151-17221-7-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 04:55:39PM +0800, Wu Hao wrote:
> In order to support virtualization usage via PCIe SRIOV, this patch
> adds two ioctls under FPGA Management Engine (FME) to release and
> assign back the port device. In order to safely turn Port from PF
> into VF and enable PCIe SRIOV, it requires user to invoke this
> PORT_RELEASE ioctl to release port firstly to remove userspace
> interfaces, and then configure the PF/VF access register in FME.
> After disable SRIOV, it requires user to invoke this PORT_ASSIGN
> ioctl to attach the port back to PF.
> 
>  Ioctl interfaces:
>  * DFL_FPGA_FME_PORT_RELEASE
>    Release platform device of given port, it deletes port platform
>    device to remove related userspace interfaces on PF, then
>    configures PF/VF access mode to VF.
> 
>  * DFL_FPGA_FME_PORT_ASSIGN
>    Assign platform device of given port back to PF, it configures
>    PF/VF access mode to PF, then adds port platform device back to
>    re-enable related userspace interfaces on PF.
> 
> Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-fme-main.c   |  54 +++++++++++++++++++++
>  drivers/fpga/dfl.c            | 107 +++++++++++++++++++++++++++++++++++++-----
>  drivers/fpga/dfl.h            |  10 ++++
>  include/uapi/linux/fpga-dfl.h |  32 +++++++++++++
>  4 files changed, 191 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 076d74f..8b2a337 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -16,6 +16,7 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> +#include <linux/uaccess.h>
>  #include <linux/fpga-dfl.h>
>  
>  #include "dfl.h"
> @@ -105,9 +106,62 @@ static void fme_hdr_uinit(struct platform_device *pdev,
>  	sysfs_remove_files(&pdev->dev.kobj, fme_hdr_attrs);
>  }
>  
> +static long fme_hdr_ioctl_release_port(struct dfl_feature_platform_data *pdata,
> +				       void __user *arg)
> +{
> +	struct dfl_fpga_cdev *cdev = pdata->dfl_cdev;
> +	struct dfl_fpga_fme_port_release release;
> +	unsigned long minsz;
> +
> +	minsz = offsetofend(struct dfl_fpga_fme_port_release, port_id);
> +
> +	if (copy_from_user(&release, arg, minsz))
> +		return -EFAULT;
> +
> +	if (release.argsz < minsz || release.flags)
> +		return -EINVAL;
> +
> +	return dfl_fpga_cdev_config_port(cdev, release.port_id, true);
> +}
> +
> +static long fme_hdr_ioctl_assign_port(struct dfl_feature_platform_data *pdata,
> +				      void __user *arg)
> +{
> +	struct dfl_fpga_cdev *cdev = pdata->dfl_cdev;
> +	struct dfl_fpga_fme_port_assign assign;
> +	unsigned long minsz;
> +
> +	minsz = offsetofend(struct dfl_fpga_fme_port_assign, port_id);
> +
> +	if (copy_from_user(&assign, arg, minsz))
> +		return -EFAULT;
> +
> +	if (assign.argsz < minsz || assign.flags)
> +		return -EINVAL;
> +
> +	return dfl_fpga_cdev_config_port(cdev, assign.port_id, false);
> +}
> +
> +static long fme_hdr_ioctl(struct platform_device *pdev,
> +			  struct dfl_feature *feature,
> +			  unsigned int cmd, unsigned long arg)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
> +
> +	switch (cmd) {
> +	case DFL_FPGA_FME_PORT_RELEASE:
> +		return fme_hdr_ioctl_release_port(pdata, (void __user *)arg);
> +	case DFL_FPGA_FME_PORT_ASSIGN:
> +		return fme_hdr_ioctl_assign_port(pdata, (void __user *)arg);
> +	}
> +
> +	return -ENODEV;
> +}
> +
>  static const struct dfl_feature_ops fme_hdr_ops = {
>  	.init = fme_hdr_init,
>  	.uinit = fme_hdr_uinit,
> +	.ioctl = fme_hdr_ioctl,
>  };
>  
>  static struct dfl_feature_driver fme_feature_drvs[] = {
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index 2c09e50..a6b6d38 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -224,16 +224,20 @@ void dfl_fpga_port_ops_del(struct dfl_fpga_port_ops *ops)
>   */
>  int dfl_fpga_check_port_id(struct platform_device *pdev, void *pport_id)
>  {
> -	struct dfl_fpga_port_ops *port_ops = dfl_fpga_port_ops_get(pdev);
> -	int port_id;
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
> +	struct dfl_fpga_port_ops *port_ops;
> +
> +	if (pdata->id != FEATURE_DEV_ID_UNUSED)
> +		return pdata->id == *(int *)pport_id;
>  
> +	port_ops = dfl_fpga_port_ops_get(pdev);
>  	if (!port_ops || !port_ops->get_id)
>  		return 0;
>  
> -	port_id = port_ops->get_id(pdev);
> +	pdata->id = port_ops->get_id(pdev);
>  	dfl_fpga_port_ops_put(port_ops);
>  
> -	return port_id == *(int *)pport_id;
> +	return pdata->id == *(int *)pport_id;
>  }
>  EXPORT_SYMBOL_GPL(dfl_fpga_check_port_id);
>  
> @@ -462,6 +466,7 @@ static int build_info_commit_dev(struct build_feature_devs_info *binfo)
>  	pdata->dev = fdev;
>  	pdata->num = binfo->feature_num;
>  	pdata->dfl_cdev = binfo->cdev;
> +	pdata->id = FEATURE_DEV_ID_UNUSED;
>  	mutex_init(&pdata->lock);
>  
>  	/*
> @@ -959,25 +964,27 @@ void dfl_fpga_feature_devs_remove(struct dfl_fpga_cdev *cdev)
>  {
>  	struct dfl_feature_platform_data *pdata, *ptmp;
>  
> -	remove_feature_devs(cdev);
> -
>  	mutex_lock(&cdev->lock);
> -	if (cdev->fme_dev) {
> -		/* the fme should be unregistered. */
> -		WARN_ON(device_is_registered(cdev->fme_dev));
> +	if (cdev->fme_dev)
>  		put_device(cdev->fme_dev);
> -	}
>  
>  	list_for_each_entry_safe(pdata, ptmp, &cdev->port_dev_list, node) {
>  		struct platform_device *port_dev = pdata->dev;
>  
> -		/* the port should be unregistered. */
> -		WARN_ON(device_is_registered(&port_dev->dev));
> +		/* remove released ports */
> +		if (!device_is_registered(&port_dev->dev)) {
> +			dfl_id_free(feature_dev_id_type(port_dev),
> +				    port_dev->id);
> +			platform_device_put(port_dev);
> +		}
> +
>  		list_del(&pdata->node);
>  		put_device(&port_dev->dev);
>  	}
>  	mutex_unlock(&cdev->lock);
>  
> +	remove_feature_devs(cdev);
> +
>  	fpga_region_unregister(cdev->region);
>  	devm_kfree(cdev->parent, cdev);
>  }
> @@ -1015,6 +1022,82 @@ struct platform_device *
>  }
>  EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_find_port);
>  
> +static int attach_port_dev(struct dfl_fpga_cdev *cdev, u32 port_id)
> +{
> +	struct platform_device *port_pdev;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&cdev->lock);
> +	port_pdev = __dfl_fpga_cdev_find_port(cdev, &port_id,
> +					      dfl_fpga_check_port_id);
> +	if (!port_pdev)
> +		goto unlock_exit;
> +
> +	if (device_is_registered(&port_pdev->dev)) {
> +		ret = -EBUSY;
> +		goto put_dev_exit;
> +	}
> +
> +	ret = platform_device_add(port_pdev);
> +	if (ret)
> +		goto put_dev_exit;
> +
> +	dfl_feature_dev_use_end(dev_get_platdata(&port_pdev->dev));
> +	cdev->released_port_num--;
> +put_dev_exit:
> +	put_device(&port_pdev->dev);
> +unlock_exit:
> +	mutex_unlock(&cdev->lock);
> +	return ret;
> +}
> +
> +static int detach_port_dev(struct dfl_fpga_cdev *cdev, u32 port_id)
> +{
> +	struct platform_device *port_pdev;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&cdev->lock);
> +	port_pdev = __dfl_fpga_cdev_find_port(cdev, &port_id,
> +					      dfl_fpga_check_port_id);
> +	if (!port_pdev)
> +		goto unlock_exit;
> +
> +	if (!device_is_registered(&port_pdev->dev)) {
> +		ret = -EBUSY;
> +		goto put_dev_exit;
> +	}
> +
> +	ret = dfl_feature_dev_use_begin(dev_get_platdata(&port_pdev->dev));
> +	if (ret)
> +		goto put_dev_exit;
> +
> +	platform_device_del(port_pdev);
> +	cdev->released_port_num++;
> +put_dev_exit:
> +	put_device(&port_pdev->dev);
> +unlock_exit:
> +	mutex_unlock(&cdev->lock);
> +	return ret;
> +}
> +
> +/**
> + * dfl_fpga_cdev_config_port - configure a port feature dev
> + * @cdev: parent container device.
> + * @port_id: id of the port feature device.
> + * @release: release port or assign port back.
> + *
> + * This function allows user to release port platform device or assign it back.
> + * e.g. to safely turn one port from PF into VF for PCI device SRIOV support,
> + * release port platform device is one necessary step.
> + */
> +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
> +			      u32 port_id, bool release)
> +{
> +	return release ? detach_port_dev(cdev, port_id) :
> +			 attach_port_dev(cdev, port_id);
> +}
> +EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
> +
>  static int __init dfl_fpga_init(void)
>  {
>  	int ret;
> diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> index 8851c6c..63f39ab 100644
> --- a/drivers/fpga/dfl.h
> +++ b/drivers/fpga/dfl.h
> @@ -183,6 +183,8 @@ struct dfl_feature {
>  
>  #define DEV_STATUS_IN_USE	0
>  
> +#define FEATURE_DEV_ID_UNUSED	(-1)
> +
>  /**
>   * struct dfl_feature_platform_data - platform data for feature devices
>   *
> @@ -191,6 +193,7 @@ struct dfl_feature {
>   * @cdev: cdev of feature dev.
>   * @dev: ptr to platform device linked with this platform data.
>   * @dfl_cdev: ptr to container device.
> + * @id: id used for this feature device.
>   * @disable_count: count for port disable.
>   * @num: number for sub features.
>   * @dev_status: dev status (e.g. DEV_STATUS_IN_USE).
> @@ -203,6 +206,7 @@ struct dfl_feature_platform_data {
>  	struct cdev cdev;
>  	struct platform_device *dev;
>  	struct dfl_fpga_cdev *dfl_cdev;
> +	int id;
>  	unsigned int disable_count;
>  	unsigned long dev_status;
>  	void *private;
> @@ -378,6 +382,7 @@ int dfl_fpga_enum_info_add_dfl(struct dfl_fpga_enum_info *info,
>   * @fme_dev: FME feature device under this container device.
>   * @lock: mutex lock to protect the port device list.
>   * @port_dev_list: list of all port feature devices under this container device.
> + * @released_port_num: released port number under this container device.
>   */
>  struct dfl_fpga_cdev {
>  	struct device *parent;
> @@ -385,6 +390,7 @@ struct dfl_fpga_cdev {
>  	struct device *fme_dev;
>  	struct mutex lock;
>  	struct list_head port_dev_list;
> +	int released_port_num;
>  };
>  
>  struct dfl_fpga_cdev *
> @@ -412,4 +418,8 @@ struct platform_device *
>  
>  	return pdev;
>  }
> +
> +int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
> +			      u32 port_id, bool release);
> +
>  #endif /* __FPGA_DFL_H */
> diff --git a/include/uapi/linux/fpga-dfl.h b/include/uapi/linux/fpga-dfl.h
> index 2e324e5..e9a00e0 100644
> --- a/include/uapi/linux/fpga-dfl.h
> +++ b/include/uapi/linux/fpga-dfl.h
> @@ -176,4 +176,36 @@ struct dfl_fpga_fme_port_pr {
>  
>  #define DFL_FPGA_FME_PORT_PR	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 0)
>  
> +/**
> + * DFL_FPGA_FME_PORT_RELEASE - _IOW(DFL_FPGA_MAGIC, DFL_FME_BASE + 1,
> + *					struct dfl_fpga_fme_port_release)
> + *
> + * Driver releases the port per Port ID provided by caller.
> + * Return: 0 on success, -errno on failure.
> + */
> +struct dfl_fpga_fme_port_release {
> +	/* Input */
> +	__u32 argsz;		/* Structure length */
> +	__u32 flags;		/* Zero for now */
> +	__u32 port_id;
> +};
> +
> +#define DFL_FPGA_FME_PORT_RELEASE	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 1)
> +
> +/**
> + * DFL_FPGA_FME_PORT_ASSIGN - _IOW(DFL_FPGA_MAGIC, DFL_FME_BASE + 2,
> + *					struct dfl_fpga_fme_port_assign)
> + *
> + * Driver assigns the port back per Port ID provided by caller.
> + * Return: 0 on success, -errno on failure.
> + */
> +struct dfl_fpga_fme_port_assign {
> +	/* Input */
> +	__u32 argsz;		/* Structure length */
> +	__u32 flags;		/* Zero for now */
> +	__u32 port_id;
> +};
> +
> +#define DFL_FPGA_FME_PORT_ASSIGN	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 2)
> +
>  #endif /* _UAPI_LINUX_FPGA_DFL_H */
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 07/18] fpga: dfl: pci: enable SRIOV support.
From: Moritz Fischer @ 2019-05-07 17:35 UTC (permalink / raw)
  To: Wu Hao
  Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Zhang Yi Z,
	Xu Yilun
In-Reply-To: <1556528151-17221-8-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 04:55:40PM +0800, Wu Hao wrote:
> This patch enables the standard sriov support. It allows user to
> enable SRIOV (and VFs), then user could pass through accelerators
> (VFs) into virtual machine or use VFs directly in host.
> 
> Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  drivers/fpga/dfl.c     | 41 +++++++++++++++++++++++++++++++++++++++++
>  drivers/fpga/dfl.h     |  1 +
>  3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
> index 66b5720..2fa571b 100644
> --- a/drivers/fpga/dfl-pci.c
> +++ b/drivers/fpga/dfl-pci.c
> @@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct pci_device_id *pcidevid)
>  	return ret;
>  }
>  
> +static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
> +{
> +	struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
> +	struct dfl_fpga_cdev *cdev = drvdata->cdev;
> +	int ret = 0;
> +
> +	mutex_lock(&cdev->lock);
> +
> +	if (!num_vfs) {
> +		/*
> +		 * disable SRIOV and then put released ports back to default
> +		 * PF access mode.
> +		 */
> +		pci_disable_sriov(pcidev);
> +
> +		__dfl_fpga_cdev_config_port_vf(cdev, false);
> +
> +	} else if (cdev->released_port_num == num_vfs) {
> +		/*
> +		 * only enable SRIOV if cdev has matched released ports, put
> +		 * released ports into VF access mode firstly.
> +		 */
> +		__dfl_fpga_cdev_config_port_vf(cdev, true);
> +
> +		ret = pci_enable_sriov(pcidev, num_vfs);
> +		if (ret)
> +			__dfl_fpga_cdev_config_port_vf(cdev, false);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	mutex_unlock(&cdev->lock);
> +	return ret;
> +}
> +
>  static void cci_pci_remove(struct pci_dev *pcidev)
>  {
> +	if (dev_is_pf(&pcidev->dev))
> +		cci_pci_sriov_configure(pcidev, 0);
> +
>  	cci_remove_feature_devs(pcidev);
>  	pci_disable_pcie_error_reporting(pcidev);
>  }
> @@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
>  	.id_table = cci_pcie_id_tbl,
>  	.probe = cci_pci_probe,
>  	.remove = cci_pci_remove,
> +	.sriov_configure = cci_pci_sriov_configure,
>  };
>  
>  module_pci_driver(cci_pci_driver);
> @@ -241,3 +280,4 @@ static void cci_pci_remove(struct pci_dev *pcidev)
>  MODULE_DESCRIPTION("FPGA DFL PCIe Device Driver");
>  MODULE_AUTHOR("Intel Corporation");
>  MODULE_LICENSE("GPL v2");
> +MODULE_VERSION(DRV_VERSION);
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index a6b6d38..c5aa287 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -1098,6 +1098,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
>  }
>  EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
>  
> +static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
> +{
> +	void __iomem *base;
> +	u64 v;
> +
> +	base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
> +
> +	v = readq(base + FME_HDR_PORT_OFST(port_id));
> +
> +	v &= ~FME_PORT_OFST_ACC_CTRL;
> +	v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
> +			is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
> +
> +	writeq(v, base + FME_HDR_PORT_OFST(port_id));
> +}
> +
> +/**
> + * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
> + *
> + * @cdev: parent container device.
> + * @if_vf: true for VF access mode, and false for PF access mode
> + *
> + * Return: 0 on success, negative error code otherwise.
> + *
> + * This function is needed in sriov configuration routine. It could be used to
> + * configures the released ports access mode to VF or PF.
> + * The caller needs to hold lock for protection.
> + */
> +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
> +{
> +	struct dfl_feature_platform_data *pdata;
> +
> +	list_for_each_entry(pdata, &cdev->port_dev_list, node) {
> +		if (device_is_registered(&pdata->dev->dev))
> +			continue;
> +
> +		config_port_vf(cdev->fme_dev, pdata->id, is_vf);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);
> +
>  static int __init dfl_fpga_init(void)
>  {
>  	int ret;
> diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> index 63f39ab..1350e8e 100644
> --- a/drivers/fpga/dfl.h
> +++ b/drivers/fpga/dfl.h
> @@ -421,5 +421,6 @@ struct platform_device *
>  
>  int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
>  			      u32 port_id, bool release);
> +void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf);
>  
>  #endif /* __FPGA_DFL_H */
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 15/18] fpga: dfl: fme: add thermal management support
From: Alan Tull @ 2019-05-07 18:20 UTC (permalink / raw)
  To: Wu Hao
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Russ Weight, Xu Yilun, Jean Delvare, Guenter Roeck,
	Linux HWMON List
In-Reply-To: <1556528151-17221-16-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 4:13 AM Wu Hao <hao.wu@intel.com> wrote:

+ The hwmon people

>
> This patch adds support to thermal management private feature for DFL
> FPGA Management Engine (FME). This private feature driver registers
> a hwmon for thermal/temperature monitoring (hwmon temp1_input).
> If hardware automatic throttling is supported by this hardware, then
> driver also exposes sysfs interfaces under hwmon for thresholds
> (temp1_alarm/ crit/ emergency), threshold status (temp1_alarm_status/
> temp1_crit_status) and throttling policy (temp1_alarm_policy).
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Russ Weight <russell.h.weight@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
> v2: create a dfl_fme_thermal hwmon to expose thermal information.
>     move all sysfs interfaces under hwmon
>         tempareture       --> hwmon temp1_input
>         threshold1        --> hwmon temp1_alarm
>         threshold2        --> hwmon temp1_crit
>         trip_threshold    --> hwmon temp1_emergency
>         threshold1_status --> hwmon temp1_alarm_status
>         threshold2_status --> hwmon temp1_crit_status
>         threshold1_policy --> hwmon temp1_alarm_policy
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme |  64 +++++++
>  drivers/fpga/Kconfig                             |   2 +-
>  drivers/fpga/dfl-fme-main.c                      | 212 +++++++++++++++++++++++
>  3 files changed, 277 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index d1aa375..dfbd315 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -44,3 +44,67 @@ Description: Read-only. It returns socket_id to indicate which socket
>                 this FPGA belongs to, only valid for integrated solution.
>                 User only needs this information, in case standard numa node
>                 can't provide correct information.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. Read this file to get the name of hwmon device, it
> +               supports values:
> +                   'dfl_fme_thermal' - thermal hwmon device name
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns FPGA device temperature in millidegrees
> +               Celsius.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns hardware threshold1 temperature in
> +               millidegrees Celsius. If temperature rises at or above this
> +               threshold, hardware starts 50% or 90% throttling (see
> +               'temp1_alarm_policy').
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns hardware threshold2 temperature in
> +               millidegrees Celsius. If temperature rises at or above this
> +               threshold, hardware starts 100% throttling.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns hardware trip threshold temperature in
> +               millidegrees Celsius. If temperature rises at or above this
> +               threshold, a fatal event will be triggered to board management
> +               controller (BMC) to shutdown FPGA.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_status
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if temperature is currently at or above
> +               hardware threshold1 (see 'temp1_alarm'), otherwise 0.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_status
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if temperature is currently at or above
> +               hardware threshold2 (see 'temp1_crit'), otherwise 0.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_policy
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. Read this file to get the policy of hardware threshold1
> +               (see 'temp1_alarm'). It only supports two values (policies):
> +                   0 - AP2 state (90% throttling)
> +                   1 - AP1 state (50% throttling)
> diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
> index c20445b..a6d7588 100644
> --- a/drivers/fpga/Kconfig
> +++ b/drivers/fpga/Kconfig
> @@ -154,7 +154,7 @@ config FPGA_DFL
>
>  config FPGA_DFL_FME
>         tristate "FPGA DFL FME Driver"
> -       depends on FPGA_DFL
> +       depends on FPGA_DFL && HWMON
>         help
>           The FPGA Management Engine (FME) is a feature device implemented
>           under Device Feature List (DFL) framework. Select this option to
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 8339ee8..b9a68b8 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -14,6 +14,8 @@
>   *   Henry Mitchel <henry.mitchel@intel.com>
>   */
>
> +#include <linux/hwmon.h>
> +#include <linux/hwmon-sysfs.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/uaccess.h>
> @@ -217,6 +219,212 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
>         .ioctl = fme_hdr_ioctl,
>  };
>
> +#define FME_THERM_THRESHOLD    0x8
> +#define TEMP_THRESHOLD1                GENMASK_ULL(6, 0)
> +#define TEMP_THRESHOLD1_EN     BIT_ULL(7)
> +#define TEMP_THRESHOLD2                GENMASK_ULL(14, 8)
> +#define TEMP_THRESHOLD2_EN     BIT_ULL(15)
> +#define TRIP_THRESHOLD         GENMASK_ULL(30, 24)
> +#define TEMP_THRESHOLD1_STATUS BIT_ULL(32)             /* threshold1 reached */
> +#define TEMP_THRESHOLD2_STATUS BIT_ULL(33)             /* threshold2 reached */
> +/* threshold1 policy: 0 - AP2 (90% throttle) / 1 - AP1 (50% throttle) */
> +#define TEMP_THRESHOLD1_POLICY BIT_ULL(44)
> +
> +#define FME_THERM_RDSENSOR_FMT1        0x10
> +#define FPGA_TEMPERATURE       GENMASK_ULL(6, 0)
> +
> +#define FME_THERM_CAP          0x20
> +#define THERM_NO_THROTTLE      BIT_ULL(0)
> +
> +#define MD_PRE_DEG
> +
> +static bool fme_thermal_throttle_support(void __iomem *base)
> +{
> +       u64 v = readq(base + FME_THERM_CAP);
> +
> +       return FIELD_GET(THERM_NO_THROTTLE, v) ? false : true;
> +}
> +
> +static umode_t thermal_hwmon_attrs_visible(const void *drvdata,
> +                                          enum hwmon_sensor_types type,
> +                                          u32 attr, int channel)
> +{
> +       const struct dfl_feature *feature = drvdata;
> +
> +       /* temperature is always supported, and check hardware cap for others */
> +       if (attr == hwmon_temp_input)
> +               return 0444;
> +
> +       return fme_thermal_throttle_support(feature->ioaddr) ? 0444 : 0;
> +}
> +
> +static int thermal_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> +                             u32 attr, int channel, long *val)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       switch (attr) {
> +       case hwmon_temp_input:
> +               v = readq(feature->ioaddr + FME_THERM_RDSENSOR_FMT1);
> +               *val = (long)(FIELD_GET(FPGA_TEMPERATURE, v) * 1000);
> +               break;
> +       case hwmon_temp_alarm:
> +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +               *val = (long)(FIELD_GET(TEMP_THRESHOLD1, v) * 1000);
> +               break;
> +       case hwmon_temp_crit:
> +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +               *val = (long)(FIELD_GET(TEMP_THRESHOLD2, v) * 1000);
> +               break;
> +       case hwmon_temp_emergency:
> +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +               *val = (long)(FIELD_GET(TRIP_THRESHOLD, v) * 1000);
> +               break;
> +       default:
> +               return -EOPNOTSUPP;
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct hwmon_ops thermal_hwmon_ops = {
> +       .is_visible = thermal_hwmon_attrs_visible,
> +       .read = thermal_hwmon_read,
> +};
> +
> +static const u32 thermal_hwmon_temp_config[] = {
> +       HWMON_T_INPUT | HWMON_T_ALARM | HWMON_T_CRIT | HWMON_T_EMERGENCY,
> +       0
> +};
> +
> +static const struct hwmon_channel_info hwmon_temp_info = {
> +       .type = hwmon_temp,
> +       .config = thermal_hwmon_temp_config,
> +};
> +
> +static const struct hwmon_channel_info *thermal_hwmon_info[] = {
> +       &hwmon_temp_info,
> +       NULL
> +};
> +
> +static const struct hwmon_chip_info thermal_hwmon_chip_info = {
> +       .ops = &thermal_hwmon_ops,
> +       .info = thermal_hwmon_info,
> +};
> +
> +static ssize_t temp1_alarm_status_show(struct device *dev,
> +                                      struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD1_STATUS, v));
> +}
> +
> +static ssize_t temp1_crit_status_show(struct device *dev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD2_STATUS, v));
> +}
> +
> +static ssize_t temp1_alarm_policy_show(struct device *dev,
> +                                      struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD1_POLICY, v));
> +}
> +
> +static DEVICE_ATTR_RO(temp1_alarm_status);
> +static DEVICE_ATTR_RO(temp1_crit_status);
> +static DEVICE_ATTR_RO(temp1_alarm_policy);
> +
> +static struct attribute *thermal_extra_attrs[] = {
> +       &dev_attr_temp1_alarm_status.attr,
> +       &dev_attr_temp1_crit_status.attr,
> +       &dev_attr_temp1_alarm_policy.attr,
> +       NULL,
> +};
> +
> +static umode_t thermal_extra_attrs_visible(struct kobject *kobj,
> +                                          struct attribute *attr, int index)
> +{
> +       struct device *dev = kobj_to_dev(kobj);
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +
> +       return fme_thermal_throttle_support(feature->ioaddr) ? attr->mode : 0;
> +}
> +
> +static const struct attribute_group thermal_extra_group = {
> +       .attrs          = thermal_extra_attrs,
> +       .is_visible     = thermal_extra_attrs_visible,
> +};
> +__ATTRIBUTE_GROUPS(thermal_extra);
> +
> +static int fme_thermal_mgmt_init(struct platform_device *pdev,
> +                                struct dfl_feature *feature)
> +{
> +       struct device *hwmon;
> +
> +       dev_dbg(&pdev->dev, "FME Thermal Management Init.\n");
> +
> +       /*
> +        * create hwmon to allow userspace monitoring temperature and other
> +        * threshold information.
> +        *
> +        * temp1_alarm     -> hardware threshold 1 -> 50% or 90% throttling
> +        * temp1_crit      -> hardware threshold 2 -> 100% throttling
> +        * temp1_emergency -> hardware trip_threshold to shutdown FPGA
> +        *
> +        * create device specific sysfs interfaces, e.g. read temp1_alarm_policy
> +        * to understand the actual hardware throttling action (50% vs 90%).
> +        *
> +        * If hardware doesn't support automatic throttling per thresholds,
> +        * then all above sysfs interfaces are not visible except temp1_input
> +        * for temperature.
> +        */
> +       hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> +                                                    "dfl_fme_thermal", feature,
> +                                                    &thermal_hwmon_chip_info,
> +                                                    thermal_extra_groups);
> +       if (IS_ERR(hwmon)) {
> +               dev_err(&pdev->dev, "Fail to register thermal hwmon\n");
> +               return PTR_ERR(hwmon);
> +       }
> +
> +       return 0;
> +}
> +
> +static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
> +                                  struct dfl_feature *feature)
> +{
> +       dev_dbg(&pdev->dev, "FME Thermal Management UInit.\n");
> +}
> +
> +static const struct dfl_feature_id fme_thermal_mgmt_id_table[] = {
> +       {.id = FME_FEATURE_ID_THERMAL_MGMT,},
> +       {0,}
> +};
> +
> +static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> +       .init = fme_thermal_mgmt_init,
> +       .uinit = fme_thermal_mgmt_uinit,
> +};
> +
>  static struct dfl_feature_driver fme_feature_drvs[] = {
>         {
>                 .id_table = fme_hdr_id_table,
> @@ -227,6 +435,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
>                 .ops = &fme_pr_mgmt_ops,
>         },
>         {
> +               .id_table = fme_thermal_mgmt_id_table,
> +               .ops = &fme_thermal_mgmt_ops,
> +       },
> +       {
>                 .ops = NULL,
>         },
>  };
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH v2 16/18] fpga: dfl: fme: add power management support
From: Alan Tull @ 2019-05-07 18:23 UTC (permalink / raw)
  To: Wu Hao
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun, Jean Delvare, Guenter Roeck, Linux HWMON List
In-Reply-To: <1556528151-17221-17-git-send-email-hao.wu@intel.com>

On Mon, Apr 29, 2019 at 4:13 AM Wu Hao <hao.wu@intel.com> wrote:

+ hwmon folks

>
> This patch adds support for power management private feature under
> FPGA Management Engine (FME). This private feature driver registers
> a hwmon for power (power1_input), thresholds information, e.g.
> (power1_cap / crit) and also read-only sysfs interfaces for other
> power management information. For configuration, user could write
> threshold values via above power1_cap / crit sysfs interface
> under hwmon too.
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
> v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
>     move all sysfs interfaces under hwmon
>         consumed          --> hwmon power1_input
>         threshold1        --> hwmon power1_cap
>         threshold2        --> hwmon power1_crit
>         threshold1_status --> hwmon power1_cap_status
>         threshold2_status --> hwmon power1_crit_status
>         xeon_limit        --> hwmon power1_xeon_limit
>         fpga_limit        --> hwmon power1_fpga_limit
>         ltr               --> hwmon power1_ltr
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme |  67 ++++++
>  drivers/fpga/dfl-fme-main.c                      | 247 +++++++++++++++++++++++
>  2 files changed, 314 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index dfbd315..e2ba92d 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -52,6 +52,7 @@ Contact:      Wu Hao <hao.wu@intel.com>
>  Description:   Read-Only. Read this file to get the name of hwmon device, it
>                 supports values:
>                     'dfl_fme_thermal' - thermal hwmon device name
> +                   'dfl_fme_power'   - power hwmon device name
>
>  What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
>  Date:          April 2019
> @@ -108,3 +109,69 @@ Description:       Read-Only. Read this file to get the policy of hardware threshold1
>                 (see 'temp1_alarm'). It only supports two values (policies):
>                     0 - AP2 state (90% throttling)
>                     1 - AP1 state (50% throttling)
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns current FPGA power consumption in uW.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_cap
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Write. Read this file to get current hardware power
> +               threshold1 in uW. If power consumption rises at or above
> +               this threshold, hardware starts 50% throttling.
> +               Write this file to set current hardware power threshold1 in uW.
> +               As hardware only accepts values in Watts, so input value will
> +               be round down per Watts (< 1 watts part will be discarded).
> +               Write fails with -EINVAL if input parsing fails or input isn't
> +               in the valid range (0 - 127000000 uW).
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Write. Read this file to get current hardware power
> +               threshold2 in uW. If power consumption rises at or above
> +               this threshold, hardware starts 90% throttling.
> +               Write this file to set current hardware power threshold2 in uW.
> +               As hardware only accepts values in Watts, so input value will
> +               be round down per Watts (< 1 watts part will be discarded).
> +               Write fails with -EINVAL if input parsing fails or input isn't
> +               in the valid range (0 - 127000000 uW).
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_cap_status
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if power consumption is currently at or
> +               above hardware threshold1 (see 'power1_cap'), otherwise 0.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_status
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns 1 if power consumption is currently at or
> +               above hardware threshold2 (see 'power1_crit'), otherwise 0.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns power limit for XEON in uW.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Only. It returns power limit for FPGA in uW.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
> +Date:          April 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get current Latency Tolerance
> +               Reporting (ltr) value. This ltr impacts the CPU low power
> +               state in integrated solution.
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index b9a68b8..7005316 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -425,6 +425,249 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
>         .uinit = fme_thermal_mgmt_uinit,
>  };
>
> +#define FME_PWR_STATUS         0x8
> +#define FME_LATENCY_TOLERANCE  BIT_ULL(18)
> +#define PWR_CONSUMED           GENMASK_ULL(17, 0)
> +
> +#define FME_PWR_THRESHOLD      0x10
> +#define PWR_THRESHOLD1         GENMASK_ULL(6, 0)       /* in Watts */
> +#define PWR_THRESHOLD2         GENMASK_ULL(14, 8)      /* in Watts */
> +#define PWR_THRESHOLD_MAX      0x7f                    /* in Watts */
> +#define PWR_THRESHOLD1_STATUS  BIT_ULL(16)
> +#define PWR_THRESHOLD2_STATUS  BIT_ULL(17)
> +
> +#define FME_PWR_XEON_LIMIT     0x18
> +#define XEON_PWR_LIMIT         GENMASK_ULL(14, 0)      /* in 0.1 Watts */
> +#define XEON_PWR_EN            BIT_ULL(15)
> +#define FME_PWR_FPGA_LIMIT     0x20
> +#define FPGA_PWR_LIMIT         GENMASK_ULL(14, 0)      /* in 0.1 Watts */
> +#define FPGA_PWR_EN            BIT_ULL(15)
> +
> +#define PWR_THRESHOLD_MAX_IN_UW (PWR_THRESHOLD_MAX * 1000000)
> +
> +static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> +                           u32 attr, int channel, long *val)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       switch (attr) {
> +       case hwmon_power_input:
> +               v = readq(feature->ioaddr + FME_PWR_STATUS);
> +               *val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
> +               break;
> +       case hwmon_power_cap:
> +               v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +               *val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
> +               break;
> +       case hwmon_power_crit:
> +               v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +               *val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
> +               break;
> +       default:
> +               return -EOPNOTSUPP;
> +       }
> +
> +       return 0;
> +}
> +
> +static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
> +                            u32 attr, int channel, long val)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       int ret = 0;
> +       u64 v;
> +
> +       if (val < 0 || val > PWR_THRESHOLD_MAX_IN_UW)
> +               return -EINVAL;
> +
> +       val = val / 1000000;
> +
> +       mutex_lock(&pdata->lock);
> +
> +       switch (attr) {
> +       case hwmon_power_cap:
> +               v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +               v &= ~PWR_THRESHOLD1;
> +               v |= FIELD_PREP(PWR_THRESHOLD1, val);
> +               writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> +               break;
> +       case hwmon_power_crit:
> +               v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +               v &= ~PWR_THRESHOLD2;
> +               v |= FIELD_PREP(PWR_THRESHOLD2, val);
> +               writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> +               break;
> +       default:
> +               ret = -EOPNOTSUPP;
> +               break;
> +       }
> +
> +       mutex_unlock(&pdata->lock);
> +
> +       return ret;
> +}
> +
> +static umode_t power_hwmon_attrs_visible(const void *drvdata,
> +                                        enum hwmon_sensor_types type,
> +                                        u32 attr, int channel)
> +{
> +       switch (attr) {
> +       case hwmon_power_input:
> +               return 0444;
> +       case hwmon_power_cap:
> +       case hwmon_power_crit:
> +               return 0644;
> +       }
> +
> +       return 0;
> +}
> +
> +static const u32 power_hwmon_config[] = {
> +       HWMON_P_INPUT | HWMON_P_CAP | HWMON_P_CRIT,
> +       0
> +};
> +
> +static const struct hwmon_channel_info hwmon_pwr_info = {
> +       .type = hwmon_power,
> +       .config = power_hwmon_config,
> +};
> +
> +static const struct hwmon_channel_info *power_hwmon_info[] = {
> +       &hwmon_pwr_info,
> +       NULL
> +};
> +
> +static const struct hwmon_ops power_hwmon_ops = {
> +       .is_visible = power_hwmon_attrs_visible,
> +       .read = power_hwmon_read,
> +       .write = power_hwmon_write,
> +};
> +
> +static const struct hwmon_chip_info power_hwmon_chip_info = {
> +       .ops = &power_hwmon_ops,
> +       .info = power_hwmon_info,
> +};
> +
> +static ssize_t power1_cap_status_show(struct device *dev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> +}
> +
> +static ssize_t power1_crit_status_show(struct device *dev,
> +                                      struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> +}
> +
> +static ssize_t power1_xeon_limit_show(struct device *dev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u16 xeon_limit = 0;
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
> +
> +       if (FIELD_GET(XEON_PWR_EN, v))
> +               xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit * 100000);
> +}
> +
> +static ssize_t power1_fpga_limit_show(struct device *dev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u16 fpga_limit = 0;
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
> +
> +       if (FIELD_GET(FPGA_PWR_EN, v))
> +               fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit * 100000);
> +}
> +
> +static ssize_t power1_ltr_show(struct device *dev,
> +                              struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature *feature = dev_get_drvdata(dev);
> +       u64 v;
> +
> +       v = readq(feature->ioaddr + FME_PWR_STATUS);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> +}
> +
> +static DEVICE_ATTR_RO(power1_cap_status);
> +static DEVICE_ATTR_RO(power1_crit_status);
> +static DEVICE_ATTR_RO(power1_xeon_limit);
> +static DEVICE_ATTR_RO(power1_fpga_limit);
> +static DEVICE_ATTR_RO(power1_ltr);
> +
> +static struct attribute *power_extra_attrs[] = {
> +       &dev_attr_power1_cap_status.attr,
> +       &dev_attr_power1_crit_status.attr,
> +       &dev_attr_power1_xeon_limit.attr,
> +       &dev_attr_power1_fpga_limit.attr,
> +       &dev_attr_power1_ltr.attr,
> +       NULL
> +};
> +
> +ATTRIBUTE_GROUPS(power_extra);
> +
> +static int fme_power_mgmt_init(struct platform_device *pdev,
> +                              struct dfl_feature *feature)
> +{
> +       struct device *hwmon;
> +
> +       dev_dbg(&pdev->dev, "FME Power Management Init.\n");
> +
> +       hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> +                                                    "dfl_fme_power", feature,
> +                                                    &power_hwmon_chip_info,
> +                                                    power_extra_groups);
> +       if (IS_ERR(hwmon)) {
> +               dev_err(&pdev->dev, "Fail to register power hwmon\n");
> +               return PTR_ERR(hwmon);
> +       }
> +
> +       return 0;
> +}
> +
> +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> +                                struct dfl_feature *feature)
> +{
> +       dev_dbg(&pdev->dev, "FME Power Management UInit.\n");
> +}
> +
> +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> +       {.id = FME_FEATURE_ID_POWER_MGMT,},
> +       {0,}
> +};
> +
> +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> +       .init = fme_power_mgmt_init,
> +       .uinit = fme_power_mgmt_uinit,
> +};
> +
>  static struct dfl_feature_driver fme_feature_drvs[] = {
>         {
>                 .id_table = fme_hdr_id_table,
> @@ -439,6 +682,10 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
>                 .ops = &fme_thermal_mgmt_ops,
>         },
>         {
> +               .id_table = fme_power_mgmt_id_table,
> +               .ops = &fme_power_mgmt_ops,
> +       },
> +       {
>                 .ops = NULL,
>         },
>  };
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH v2 15/18] fpga: dfl: fme: add thermal management support
From: Moritz Fischer @ 2019-05-07 18:30 UTC (permalink / raw)
  To: Wu Hao
  Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Russ Weight, Xu Yilun, linux-hwmon, linux
In-Reply-To: <1556528151-17221-16-git-send-email-hao.wu@intel.com>

Please for next round:

+CC linux-hwmon, Guenter etc ...

On Mon, Apr 29, 2019 at 04:55:48PM +0800, Wu Hao wrote:
> This patch adds support to thermal management private feature for DFL
> FPGA Management Engine (FME). This private feature driver registers
> a hwmon for thermal/temperature monitoring (hwmon temp1_input).
> If hardware automatic throttling is supported by this hardware, then
> driver also exposes sysfs interfaces under hwmon for thresholds
> (temp1_alarm/ crit/ emergency), threshold status (temp1_alarm_status/
> temp1_crit_status) and throttling policy (temp1_alarm_policy).
> 
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Russ Weight <russell.h.weight@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
> v2: create a dfl_fme_thermal hwmon to expose thermal information.
>     move all sysfs interfaces under hwmon
> 	tempareture       --> hwmon temp1_input
> 	threshold1        --> hwmon temp1_alarm
> 	threshold2        --> hwmon temp1_crit
> 	trip_threshold    --> hwmon temp1_emergency
> 	threshold1_status --> hwmon temp1_alarm_status
> 	threshold2_status --> hwmon temp1_crit_status
> 	threshold1_policy --> hwmon temp1_alarm_policy
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme |  64 +++++++
>  drivers/fpga/Kconfig                             |   2 +-
>  drivers/fpga/dfl-fme-main.c                      | 212 +++++++++++++++++++++++
>  3 files changed, 277 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index d1aa375..dfbd315 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -44,3 +44,67 @@ Description:	Read-only. It returns socket_id to indicate which socket
>  		this FPGA belongs to, only valid for integrated solution.
>  		User only needs this information, in case standard numa node
>  		can't provide correct information.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. Read this file to get the name of hwmon device, it
> +		supports values:
> +		    'dfl_fme_thermal' - thermal hwmon device name
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns FPGA device temperature in millidegrees
> +		Celsius.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns hardware threshold1 temperature in
> +		millidegrees Celsius. If temperature rises at or above this
> +		threshold, hardware starts 50% or 90% throttling (see
> +		'temp1_alarm_policy').
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns hardware threshold2 temperature in
> +		millidegrees Celsius. If temperature rises at or above this
> +		threshold, hardware starts 100% throttling.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns hardware trip threshold temperature in
> +		millidegrees Celsius. If temperature rises at or above this
> +		threshold, a fatal event will be triggered to board management
> +		controller (BMC) to shutdown FPGA.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_status
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. It returns 1 if temperature is currently at or above
> +		hardware threshold1 (see 'temp1_alarm'), otherwise 0.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_status
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. It returns 1 if temperature is currently at or above
> +		hardware threshold2 (see 'temp1_crit'), otherwise 0.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_policy
> +Date:		April 2019
> +KernelVersion:	5.2
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. Read this file to get the policy of hardware threshold1
> +		(see 'temp1_alarm'). It only supports two values (policies):
> +		    0 - AP2 state (90% throttling)
> +		    1 - AP1 state (50% throttling)
> diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
> index c20445b..a6d7588 100644
> --- a/drivers/fpga/Kconfig
> +++ b/drivers/fpga/Kconfig
> @@ -154,7 +154,7 @@ config FPGA_DFL
>  
>  config FPGA_DFL_FME
>  	tristate "FPGA DFL FME Driver"
> -	depends on FPGA_DFL
> +	depends on FPGA_DFL && HWMON
>  	help
>  	  The FPGA Management Engine (FME) is a feature device implemented
>  	  under Device Feature List (DFL) framework. Select this option to
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 8339ee8..b9a68b8 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -14,6 +14,8 @@
>   *   Henry Mitchel <henry.mitchel@intel.com>
>   */
>  
> +#include <linux/hwmon.h>
> +#include <linux/hwmon-sysfs.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/uaccess.h>
> @@ -217,6 +219,212 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
>  	.ioctl = fme_hdr_ioctl,
>  };
>  
> +#define FME_THERM_THRESHOLD	0x8
> +#define TEMP_THRESHOLD1		GENMASK_ULL(6, 0)
> +#define TEMP_THRESHOLD1_EN	BIT_ULL(7)
> +#define TEMP_THRESHOLD2		GENMASK_ULL(14, 8)
> +#define TEMP_THRESHOLD2_EN	BIT_ULL(15)
> +#define TRIP_THRESHOLD		GENMASK_ULL(30, 24)
> +#define TEMP_THRESHOLD1_STATUS	BIT_ULL(32)		/* threshold1 reached */
> +#define TEMP_THRESHOLD2_STATUS	BIT_ULL(33)		/* threshold2 reached */
> +/* threshold1 policy: 0 - AP2 (90% throttle) / 1 - AP1 (50% throttle) */
> +#define TEMP_THRESHOLD1_POLICY	BIT_ULL(44)
> +
> +#define FME_THERM_RDSENSOR_FMT1	0x10
> +#define FPGA_TEMPERATURE	GENMASK_ULL(6, 0)
> +
> +#define FME_THERM_CAP		0x20
> +#define THERM_NO_THROTTLE	BIT_ULL(0)
> +
> +#define MD_PRE_DEG
> +
> +static bool fme_thermal_throttle_support(void __iomem *base)
> +{
> +	u64 v = readq(base + FME_THERM_CAP);
> +
> +	return FIELD_GET(THERM_NO_THROTTLE, v) ? false : true;
> +}
> +
> +static umode_t thermal_hwmon_attrs_visible(const void *drvdata,
> +					   enum hwmon_sensor_types type,
> +					   u32 attr, int channel)
> +{
> +	const struct dfl_feature *feature = drvdata;
> +
> +	/* temperature is always supported, and check hardware cap for others */
> +	if (attr == hwmon_temp_input)
> +		return 0444;
> +
> +	return fme_thermal_throttle_support(feature->ioaddr) ? 0444 : 0;
> +}
> +
> +static int thermal_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> +			      u32 attr, int channel, long *val)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	switch (attr) {
> +	case hwmon_temp_input:
> +		v = readq(feature->ioaddr + FME_THERM_RDSENSOR_FMT1);
> +		*val = (long)(FIELD_GET(FPGA_TEMPERATURE, v) * 1000);
> +		break;
> +	case hwmon_temp_alarm:
> +		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +		*val = (long)(FIELD_GET(TEMP_THRESHOLD1, v) * 1000);
> +		break;
> +	case hwmon_temp_crit:
> +		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +		*val = (long)(FIELD_GET(TEMP_THRESHOLD2, v) * 1000);
> +		break;
> +	case hwmon_temp_emergency:
> +		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +		*val = (long)(FIELD_GET(TRIP_THRESHOLD, v) * 1000);
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct hwmon_ops thermal_hwmon_ops = {
> +	.is_visible = thermal_hwmon_attrs_visible,
> +	.read = thermal_hwmon_read,
> +};
> +
> +static const u32 thermal_hwmon_temp_config[] = {
> +	HWMON_T_INPUT | HWMON_T_ALARM | HWMON_T_CRIT | HWMON_T_EMERGENCY,
> +	0
> +};
> +
> +static const struct hwmon_channel_info hwmon_temp_info = {
> +	.type = hwmon_temp,
> +	.config = thermal_hwmon_temp_config,
> +};
> +
> +static const struct hwmon_channel_info *thermal_hwmon_info[] = {
> +	&hwmon_temp_info,
> +	NULL
> +};
> +
> +static const struct hwmon_chip_info thermal_hwmon_chip_info = {
> +	.ops = &thermal_hwmon_ops,
> +	.info = thermal_hwmon_info,
> +};
> +
> +static ssize_t temp1_alarm_status_show(struct device *dev,
> +				       struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +	return scnprintf(buf, PAGE_SIZE, "%u\n",
> +			 (unsigned int)FIELD_GET(TEMP_THRESHOLD1_STATUS, v));
> +}
> +
> +static ssize_t temp1_crit_status_show(struct device *dev,
> +				      struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +	return scnprintf(buf, PAGE_SIZE, "%u\n",
> +			 (unsigned int)FIELD_GET(TEMP_THRESHOLD2_STATUS, v));
> +}
> +
> +static ssize_t temp1_alarm_policy_show(struct device *dev,
> +				       struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> +
> +	return scnprintf(buf, PAGE_SIZE, "%u\n",
> +			 (unsigned int)FIELD_GET(TEMP_THRESHOLD1_POLICY, v));
> +}
> +
> +static DEVICE_ATTR_RO(temp1_alarm_status);
> +static DEVICE_ATTR_RO(temp1_crit_status);
> +static DEVICE_ATTR_RO(temp1_alarm_policy);
> +
> +static struct attribute *thermal_extra_attrs[] = {
> +	&dev_attr_temp1_alarm_status.attr,
> +	&dev_attr_temp1_crit_status.attr,
> +	&dev_attr_temp1_alarm_policy.attr,
> +	NULL,
> +};
> +
> +static umode_t thermal_extra_attrs_visible(struct kobject *kobj,
> +					   struct attribute *attr, int index)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +
> +	return fme_thermal_throttle_support(feature->ioaddr) ? attr->mode : 0;
> +}
> +
> +static const struct attribute_group thermal_extra_group = {
> +	.attrs		= thermal_extra_attrs,
> +	.is_visible	= thermal_extra_attrs_visible,
> +};
> +__ATTRIBUTE_GROUPS(thermal_extra);
> +
> +static int fme_thermal_mgmt_init(struct platform_device *pdev,
> +				 struct dfl_feature *feature)
> +{
> +	struct device *hwmon;
> +
> +	dev_dbg(&pdev->dev, "FME Thermal Management Init.\n");
> +
> +	/*
> +	 * create hwmon to allow userspace monitoring temperature and other
> +	 * threshold information.
> +	 *
> +	 * temp1_alarm     -> hardware threshold 1 -> 50% or 90% throttling
> +	 * temp1_crit      -> hardware threshold 2 -> 100% throttling
> +	 * temp1_emergency -> hardware trip_threshold to shutdown FPGA
> +	 *
> +	 * create device specific sysfs interfaces, e.g. read temp1_alarm_policy
> +	 * to understand the actual hardware throttling action (50% vs 90%).
> +	 *
> +	 * If hardware doesn't support automatic throttling per thresholds,
> +	 * then all above sysfs interfaces are not visible except temp1_input
> +	 * for temperature.
> +	 */
> +	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> +						     "dfl_fme_thermal", feature,
> +						     &thermal_hwmon_chip_info,
> +						     thermal_extra_groups);
> +	if (IS_ERR(hwmon)) {
> +		dev_err(&pdev->dev, "Fail to register thermal hwmon\n");
> +		return PTR_ERR(hwmon);
> +	}
> +
> +	return 0;
> +}
> +
> +static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
> +				   struct dfl_feature *feature)
> +{
> +	dev_dbg(&pdev->dev, "FME Thermal Management UInit.\n");
> +}
> +
> +static const struct dfl_feature_id fme_thermal_mgmt_id_table[] = {
> +	{.id = FME_FEATURE_ID_THERMAL_MGMT,},
> +	{0,}
> +};
> +
> +static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> +	.init = fme_thermal_mgmt_init,
> +	.uinit = fme_thermal_mgmt_uinit,
> +};
> +
>  static struct dfl_feature_driver fme_feature_drvs[] = {
>  	{
>  		.id_table = fme_hdr_id_table,
> @@ -227,6 +435,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
>  		.ops = &fme_pr_mgmt_ops,
>  	},
>  	{
> +		.id_table = fme_thermal_mgmt_id_table,
> +		.ops = &fme_thermal_mgmt_ops,
> +	},
> +	{
>  		.ops = NULL,
>  	},
>  };
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 15/18] fpga: dfl: fme: add thermal management support
From: Guenter Roeck @ 2019-05-07 18:35 UTC (permalink / raw)
  To: Alan Tull
  Cc: Wu Hao, Moritz Fischer, linux-fpga, linux-kernel, linux-api,
	Luwei Kang, Russ Weight, Xu Yilun, Jean Delvare, Linux HWMON List
In-Reply-To: <CANk1AXR+wRoQB40ZFTAPPp4mmkzZY+03bTwdB24fL0mYNck2Ug@mail.gmail.com>

On Tue, May 07, 2019 at 01:20:52PM -0500, Alan Tull wrote:
> On Mon, Apr 29, 2019 at 4:13 AM Wu Hao <hao.wu@intel.com> wrote:
> 
> + The hwmon people
> 
> >
> > This patch adds support to thermal management private feature for DFL
> > FPGA Management Engine (FME). This private feature driver registers
> > a hwmon for thermal/temperature monitoring (hwmon temp1_input).
> > If hardware automatic throttling is supported by this hardware, then
> > driver also exposes sysfs interfaces under hwmon for thresholds
> > (temp1_alarm/ crit/ emergency), threshold status (temp1_alarm_status/
> > temp1_crit_status) and throttling policy (temp1_alarm_policy).
> >
> > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > Signed-off-by: Russ Weight <russell.h.weight@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> > v2: create a dfl_fme_thermal hwmon to expose thermal information.
> >     move all sysfs interfaces under hwmon
> >         tempareture       --> hwmon temp1_input
> >         threshold1        --> hwmon temp1_alarm
> >         threshold2        --> hwmon temp1_crit
> >         trip_threshold    --> hwmon temp1_emergency
> >         threshold1_status --> hwmon temp1_alarm_status
> >         threshold2_status --> hwmon temp1_crit_status
> >         threshold1_policy --> hwmon temp1_alarm_policy

You should not write a hwmon driver if you don't want to follow the ABI.
The implementation will only confuse the sensors command, so what exactly
is the point ?

More on that below.

Guenter

> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  64 +++++++
> >  drivers/fpga/Kconfig                             |   2 +-
> >  drivers/fpga/dfl-fme-main.c                      | 212 +++++++++++++++++++++++
> >  3 files changed, 277 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > index d1aa375..dfbd315 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > @@ -44,3 +44,67 @@ Description: Read-only. It returns socket_id to indicate which socket
> >                 this FPGA belongs to, only valid for integrated solution.
> >                 User only needs this information, in case standard numa node
> >                 can't provide correct information.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. Read this file to get the name of hwmon device, it
> > +               supports values:
> > +                   'dfl_fme_thermal' - thermal hwmon device name
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. It returns FPGA device temperature in millidegrees
> > +               Celsius.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. It returns hardware threshold1 temperature in
> > +               millidegrees Celsius. If temperature rises at or above this
> > +               threshold, hardware starts 50% or 90% throttling (see
> > +               'temp1_alarm_policy').
> > +

This does not follow the ABI. temp1_alarm is the alarm status, not the alarm
temperature. The ABI attribute name would be temp1_max.

> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. It returns hardware threshold2 temperature in
> > +               millidegrees Celsius. If temperature rises at or above this
> > +               threshold, hardware starts 100% throttling.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. It returns hardware trip threshold temperature in
> > +               millidegrees Celsius. If temperature rises at or above this
> > +               threshold, a fatal event will be triggered to board management
> > +               controller (BMC) to shutdown FPGA.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_status
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if temperature is currently at or above
> > +               hardware threshold1 (see 'temp1_alarm'), otherwise 0.
> > +

Why not follow the ABI and use temp1_alarm ?

> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_status
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns 1 if temperature is currently at or above
> > +               hardware threshold2 (see 'temp1_crit'), otherwise 0.
> > +

Why not follow the ABI and use temp1_crit_alarm ?

> > +What:          /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_alarm_policy
> > +Date:          April 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Only. Read this file to get the policy of hardware threshold1
> > +               (see 'temp1_alarm'). It only supports two values (policies):
> > +                   0 - AP2 state (90% throttling)
> > +                   1 - AP1 state (50% throttling)
> > diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
> > index c20445b..a6d7588 100644
> > --- a/drivers/fpga/Kconfig
> > +++ b/drivers/fpga/Kconfig
> > @@ -154,7 +154,7 @@ config FPGA_DFL
> >
> >  config FPGA_DFL_FME
> >         tristate "FPGA DFL FME Driver"
> > -       depends on FPGA_DFL
> > +       depends on FPGA_DFL && HWMON
> >         help
> >           The FPGA Management Engine (FME) is a feature device implemented
> >           under Device Feature List (DFL) framework. Select this option to
> > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > index 8339ee8..b9a68b8 100644
> > --- a/drivers/fpga/dfl-fme-main.c
> > +++ b/drivers/fpga/dfl-fme-main.c
> > @@ -14,6 +14,8 @@
> >   *   Henry Mitchel <henry.mitchel@intel.com>
> >   */
> >
> > +#include <linux/hwmon.h>
> > +#include <linux/hwmon-sysfs.h>
> >  #include <linux/kernel.h>
> >  #include <linux/module.h>
> >  #include <linux/uaccess.h>
> > @@ -217,6 +219,212 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
> >         .ioctl = fme_hdr_ioctl,
> >  };
> >
> > +#define FME_THERM_THRESHOLD    0x8
> > +#define TEMP_THRESHOLD1                GENMASK_ULL(6, 0)
> > +#define TEMP_THRESHOLD1_EN     BIT_ULL(7)
> > +#define TEMP_THRESHOLD2                GENMASK_ULL(14, 8)
> > +#define TEMP_THRESHOLD2_EN     BIT_ULL(15)
> > +#define TRIP_THRESHOLD         GENMASK_ULL(30, 24)
> > +#define TEMP_THRESHOLD1_STATUS BIT_ULL(32)             /* threshold1 reached */
> > +#define TEMP_THRESHOLD2_STATUS BIT_ULL(33)             /* threshold2 reached */
> > +/* threshold1 policy: 0 - AP2 (90% throttle) / 1 - AP1 (50% throttle) */
> > +#define TEMP_THRESHOLD1_POLICY BIT_ULL(44)
> > +
> > +#define FME_THERM_RDSENSOR_FMT1        0x10
> > +#define FPGA_TEMPERATURE       GENMASK_ULL(6, 0)
> > +
> > +#define FME_THERM_CAP          0x20
> > +#define THERM_NO_THROTTLE      BIT_ULL(0)
> > +
> > +#define MD_PRE_DEG
> > +
> > +static bool fme_thermal_throttle_support(void __iomem *base)
> > +{
> > +       u64 v = readq(base + FME_THERM_CAP);
> > +
> > +       return FIELD_GET(THERM_NO_THROTTLE, v) ? false : true;
> > +}
> > +
> > +static umode_t thermal_hwmon_attrs_visible(const void *drvdata,
> > +                                          enum hwmon_sensor_types type,
> > +                                          u32 attr, int channel)
> > +{
> > +       const struct dfl_feature *feature = drvdata;
> > +
> > +       /* temperature is always supported, and check hardware cap for others */
> > +       if (attr == hwmon_temp_input)
> > +               return 0444;
> > +
> > +       return fme_thermal_throttle_support(feature->ioaddr) ? 0444 : 0;
> > +}
> > +
> > +static int thermal_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> > +                             u32 attr, int channel, long *val)
> > +{
> > +       struct dfl_feature *feature = dev_get_drvdata(dev);
> > +       u64 v;
> > +
> > +       switch (attr) {
> > +       case hwmon_temp_input:
> > +               v = readq(feature->ioaddr + FME_THERM_RDSENSOR_FMT1);
> > +               *val = (long)(FIELD_GET(FPGA_TEMPERATURE, v) * 1000);
> > +               break;
> > +       case hwmon_temp_alarm:
> > +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +               *val = (long)(FIELD_GET(TEMP_THRESHOLD1, v) * 1000);

This is supposed to return 0 or 1.

> > +               break;
> > +       case hwmon_temp_crit:
> > +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +               *val = (long)(FIELD_GET(TEMP_THRESHOLD2, v) * 1000);
> > +               break;
> > +       case hwmon_temp_emergency:
> > +               v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +               *val = (long)(FIELD_GET(TRIP_THRESHOLD, v) * 1000);
> > +               break;
> > +       default:
> > +               return -EOPNOTSUPP;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static const struct hwmon_ops thermal_hwmon_ops = {
> > +       .is_visible = thermal_hwmon_attrs_visible,
> > +       .read = thermal_hwmon_read,
> > +};
> > +
> > +static const u32 thermal_hwmon_temp_config[] = {
> > +       HWMON_T_INPUT | HWMON_T_ALARM | HWMON_T_CRIT | HWMON_T_EMERGENCY,
> > +       0
> > +};
> > +
> > +static const struct hwmon_channel_info hwmon_temp_info = {
> > +       .type = hwmon_temp,
> > +       .config = thermal_hwmon_temp_config,
> > +};
> > +
> > +static const struct hwmon_channel_info *thermal_hwmon_info[] = {
> > +       &hwmon_temp_info,
> > +       NULL
> > +};
> > +
> > +static const struct hwmon_chip_info thermal_hwmon_chip_info = {
> > +       .ops = &thermal_hwmon_ops,
> > +       .info = thermal_hwmon_info,
> > +};
> > +
> > +static ssize_t temp1_alarm_status_show(struct device *dev,
> > +                                      struct device_attribute *attr, char *buf)
> > +{
> > +       struct dfl_feature *feature = dev_get_drvdata(dev);
> > +       u64 v;
> > +
> > +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD1_STATUS, v));
> > +}
> > +
> > +static ssize_t temp1_crit_status_show(struct device *dev,
> > +                                     struct device_attribute *attr, char *buf)
> > +{
> > +       struct dfl_feature *feature = dev_get_drvdata(dev);
> > +       u64 v;
> > +
> > +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD2_STATUS, v));
> > +}
> > +
> > +static ssize_t temp1_alarm_policy_show(struct device *dev,
> > +                                      struct device_attribute *attr, char *buf)
> > +{
> > +       struct dfl_feature *feature = dev_get_drvdata(dev);
> > +       u64 v;
> > +
> > +       v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > +                        (unsigned int)FIELD_GET(TEMP_THRESHOLD1_POLICY, v));
> > +}
> > +
> > +static DEVICE_ATTR_RO(temp1_alarm_status);
> > +static DEVICE_ATTR_RO(temp1_crit_status);
> > +static DEVICE_ATTR_RO(temp1_alarm_policy);
> > +
> > +static struct attribute *thermal_extra_attrs[] = {
> > +       &dev_attr_temp1_alarm_status.attr,
> > +       &dev_attr_temp1_crit_status.attr,

Why not use standard attributes for the above ?

> > +       &dev_attr_temp1_alarm_policy.attr,
> > +       NULL,
> > +};
> > +
> > +static umode_t thermal_extra_attrs_visible(struct kobject *kobj,
> > +                                          struct attribute *attr, int index)
> > +{
> > +       struct device *dev = kobj_to_dev(kobj);
> > +       struct dfl_feature *feature = dev_get_drvdata(dev);
> > +
> > +       return fme_thermal_throttle_support(feature->ioaddr) ? attr->mode : 0;
> > +}
> > +
> > +static const struct attribute_group thermal_extra_group = {
> > +       .attrs          = thermal_extra_attrs,
> > +       .is_visible     = thermal_extra_attrs_visible,
> > +};
> > +__ATTRIBUTE_GROUPS(thermal_extra);
> > +
> > +static int fme_thermal_mgmt_init(struct platform_device *pdev,
> > +                                struct dfl_feature *feature)
> > +{
> > +       struct device *hwmon;
> > +
> > +       dev_dbg(&pdev->dev, "FME Thermal Management Init.\n");
> > +
> > +       /*
> > +        * create hwmon to allow userspace monitoring temperature and other
> > +        * threshold information.
> > +        *
> > +        * temp1_alarm     -> hardware threshold 1 -> 50% or 90% throttling
> > +        * temp1_crit      -> hardware threshold 2 -> 100% throttling
> > +        * temp1_emergency -> hardware trip_threshold to shutdown FPGA
> > +        *
> > +        * create device specific sysfs interfaces, e.g. read temp1_alarm_policy
> > +        * to understand the actual hardware throttling action (50% vs 90%).
> > +        *
> > +        * If hardware doesn't support automatic throttling per thresholds,
> > +        * then all above sysfs interfaces are not visible except temp1_input
> > +        * for temperature.
> > +        */
> > +       hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> > +                                                    "dfl_fme_thermal", feature,
> > +                                                    &thermal_hwmon_chip_info,
> > +                                                    thermal_extra_groups);
> > +       if (IS_ERR(hwmon)) {
> > +               dev_err(&pdev->dev, "Fail to register thermal hwmon\n");
> > +               return PTR_ERR(hwmon);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
> > +                                  struct dfl_feature *feature)
> > +{
> > +       dev_dbg(&pdev->dev, "FME Thermal Management UInit.\n");
> > +}
> > +
> > +static const struct dfl_feature_id fme_thermal_mgmt_id_table[] = {
> > +       {.id = FME_FEATURE_ID_THERMAL_MGMT,},
> > +       {0,}
> > +};
> > +
> > +static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> > +       .init = fme_thermal_mgmt_init,
> > +       .uinit = fme_thermal_mgmt_uinit,
> > +};
> > +
> >  static struct dfl_feature_driver fme_feature_drvs[] = {
> >         {
> >                 .id_table = fme_hdr_id_table,
> > @@ -227,6 +435,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
> >                 .ops = &fme_pr_mgmt_ops,
> >         },
> >         {
> > +               .id_table = fme_thermal_mgmt_id_table,
> > +               .ops = &fme_thermal_mgmt_ops,
> > +       },
> > +       {
> >                 .ops = NULL,
> >         },
> >  };
> > --
> > 1.8.3.1
> >

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox