Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH 3/7 v2] tracing: Get trace_array reference for available_tracers files
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro, stable
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

As instances may have different tracers available, we need to look at the
trace_array descriptor that shows lists the available tracers for the
instance. But there's a race between opening the file and the admin from
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.

Cc: stable@vger.kernel.org
Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 252f79c435f8..fa7d813b04c6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4355,9 +4355,14 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	if (tracing_disabled)
 		return -ENODEV;
 
+	if (trace_array_get(tr) < 0)
+		return -ENODEV;
+
 	ret = seq_open(file, &show_traces_seq_ops);
-	if (ret)
+	if (ret) {
+		trace_array_put(tr);
 		return ret;
+	}
 
 	m = file->private_data;
 	m->private = tr;
@@ -4365,6 +4370,14 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static int show_traces_release(struct inode *inode, struct file *file)
+{
+	struct trace_array *tr = inode->i_private;
+
+	trace_array_put(tr);
+	return seq_release(inode, file);
+}
+
 static ssize_t
 tracing_write_stub(struct file *filp, const char __user *ubuf,
 		   size_t count, loff_t *ppos)
@@ -4395,8 +4408,8 @@ static const struct file_operations tracing_fops = {
 static const struct file_operations show_traces_fops = {
 	.open		= show_traces_open,
 	.read		= seq_read,
-	.release	= seq_release,
 	.llseek		= seq_lseek,
+	.release	= show_traces_release,
 };
 
 static ssize_t
-- 
2.23.0

^ permalink raw reply related

* [PATCH 4/7 v2] tracing: Have trace events system open call tracing_open_generic_tr()
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Instead of having the trace events system open calls open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c        |  2 +-
 kernel/trace/trace.h        |  1 +
 kernel/trace/trace_events.c | 31 +++++--------------------------
 3 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fa7d813b04c6..94f1b9124939 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4156,7 +4156,7 @@ bool tracing_is_disabled(void)
  * Open and update trace_array ref count.
  * Must have the current trace_array passed to it.
  */
-static int tracing_open_generic_tr(struct inode *inode, struct file *filp)
+int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f801d154ff6a..854dbf4050f8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -681,6 +681,7 @@ void tracing_reset_online_cpus(struct trace_buffer *buf);
 void tracing_reset_current(int cpu);
 void tracing_reset_all_online_cpus(void);
 int tracing_open_generic(struct inode *inode, struct file *filp);
+int tracing_open_generic_tr(struct inode *inode, struct file *filp);
 bool tracing_is_disabled(void);
 bool tracer_tracing_is_on(struct trace_array *tr);
 void tracer_tracing_on(struct trace_array *tr);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index b89cdfe20bc1..0d29f2f13477 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1391,9 +1391,6 @@ static int subsystem_open(struct inode *inode, struct file *filp)
 	struct trace_array *tr;
 	int ret;
 
-	if (tracing_is_disabled())
-		return -ENODEV;
-
 	/* Make sure the system still exists */
 	mutex_lock(&event_mutex);
 	mutex_lock(&trace_types_lock);
@@ -1420,16 +1417,9 @@ static int subsystem_open(struct inode *inode, struct file *filp)
 	WARN_ON(!dir);
 
 	/* Still need to increment the ref count of the system */
-	if (trace_array_get(tr) < 0) {
-		put_system(dir);
-		return -ENODEV;
-	}
-
-	ret = tracing_open_generic(inode, filp);
-	if (ret < 0) {
-		trace_array_put(tr);
+	ret = tracing_open_generic_tr(inode, filp);
+	if (ret < 0)
 		put_system(dir);
-	}
 
 	return ret;
 }
@@ -1440,28 +1430,17 @@ static int system_tr_open(struct inode *inode, struct file *filp)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_is_disabled())
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
-
 	/* Make a temporary dir that has no system but points to tr */
 	dir = kzalloc(sizeof(*dir), GFP_KERNEL);
-	if (!dir) {
-		trace_array_put(tr);
+	if (!dir)
 		return -ENOMEM;
-	}
 
-	dir->tr = tr;
-
-	ret = tracing_open_generic(inode, filp);
+	ret = tracing_open_generic_tr(inode, filp);
 	if (ret < 0) {
-		trace_array_put(tr);
 		kfree(dir);
 		return ret;
 	}
-
+	dir->tr = tr;
 	filp->private_data = dir;
 
 	return 0;
-- 
2.23.0

^ permalink raw reply related

* [PATCH 5/7 v2] tracing: Add tracing_check_open_get_tr()
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Currently, most files in the tracefs directory that gets opened needs to
test if tracing_disabled is set. If so, it should return -ENODEV. The
tracing_disabled is called when tracing is found to be broken. Originally it
was done in case the ring buffer was found to be corrupted, and we wanted to
prevent reading it from crashing the kernel. But it's also called if a
tracing selftest fails on boot. It's a one way switch. That is, once it is
triggered, tracing is disabled until reboot.

As most tracefs files can also be used by instances in the tracefs
directory, they need to be carefully done. Each instance has a trace_array
associated to it, and when the instance is removed, the trace_array is
freed. But if an instance is opened with a reference to the trace_array, the
getting the trace_array has to be done by a lookup (as there could be a race
with it being deleted and the open itself), and then once it is found, a
reference is added to prevent the instance from being removed (and the
trace_array associated with it freed).

Combine the two checks (tracing_disabled and trace_array_get()) into a
single helper function. This will also make it easier to add lockdown to
tracefs later.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c            |   7 +-
 kernel/trace/trace.c             | 117 +++++++++++++++++--------------
 kernel/trace/trace.h             |   1 +
 kernel/trace/trace_dynevent.c    |   4 ++
 kernel/trace/trace_events.c      |  10 +--
 kernel/trace/trace_events_hist.c |   2 +-
 6 files changed, 81 insertions(+), 60 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 32c2eb167de0..8b765a55e01c 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3547,7 +3547,7 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 	if (unlikely(ftrace_disabled))
 		return -ENODEV;
 
-	if (tr && trace_array_get(tr) < 0)
+	if (tracing_check_open_get_tr(tr))
 		return -ENODEV;
 
 	iter = kzalloc(sizeof(*iter), GFP_KERNEL);
@@ -6546,8 +6546,9 @@ ftrace_pid_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret = 0;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if ((file->f_mode & FMODE_WRITE) &&
 	    (file->f_flags & O_TRUNC))
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 94f1b9124939..26ee280f852b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -304,6 +304,17 @@ void trace_array_put(struct trace_array *this_tr)
 	mutex_unlock(&trace_types_lock);
 }
 
+int tracing_check_open_get_tr(struct trace_array *tr)
+{
+	if (tracing_disabled)
+		return -ENODEV;
+
+	if (tr && trace_array_get(tr) < 0)
+		return -ENODEV;
+
+	return 0;
+}
+
 int call_filter_check_discard(struct trace_event_call *call, void *rec,
 			      struct ring_buffer *buffer,
 			      struct ring_buffer_event *event)
@@ -4140,8 +4151,11 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 
 int tracing_open_generic(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	filp->private_data = inode->i_private;
 	return 0;
@@ -4159,12 +4173,11 @@ bool tracing_is_disabled(void)
 int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
+	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	filp->private_data = inode->i_private;
 
@@ -4233,10 +4246,11 @@ static int tracing_open(struct inode *inode, struct file *file)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	/* If this file was open for write, then erase contents */
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
@@ -4352,11 +4366,9 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = seq_open(file, &show_traces_seq_ops);
 	if (ret) {
@@ -4710,11 +4722,9 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_trace_options_show, inode->i_private);
 	if (ret < 0)
@@ -5051,8 +5061,11 @@ static const struct seq_operations tracing_saved_tgids_seq_ops = {
 
 static int tracing_saved_tgids_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_saved_tgids_seq_ops);
 }
@@ -5128,8 +5141,11 @@ static const struct seq_operations tracing_saved_cmdlines_seq_ops = {
 
 static int tracing_saved_cmdlines_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_saved_cmdlines_seq_ops);
 }
@@ -5293,8 +5309,11 @@ static const struct seq_operations tracing_eval_map_seq_ops = {
 
 static int tracing_eval_map_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_eval_map_seq_ops);
 }
@@ -5817,13 +5836,11 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
-
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	mutex_lock(&trace_types_lock);
 
@@ -6560,11 +6577,9 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr))
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_clock_show, inode->i_private);
 	if (ret < 0)
@@ -6594,11 +6609,9 @@ static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr))
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_time_stamp_mode_show, inode->i_private);
 	if (ret < 0)
@@ -6651,10 +6664,11 @@ static int tracing_snapshot_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
 	struct seq_file *m;
-	int ret = 0;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if (file->f_mode & FMODE_READ) {
 		iter = __tracing_open(inode, file, true);
@@ -7118,8 +7132,9 @@ static int tracing_err_log_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret = 0;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	/* If this file was opened for write, then erase contents */
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
@@ -7170,11 +7185,9 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 854dbf4050f8..d685c61085c0 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -338,6 +338,7 @@ extern struct mutex trace_types_lock;
 
 extern int trace_array_get(struct trace_array *tr);
 extern void trace_array_put(struct trace_array *tr);
+extern int tracing_check_open_get_tr(struct trace_array *tr);
 
 extern int tracing_set_time_stamp_abs(struct trace_array *tr, bool abs);
 extern int tracing_set_clock(struct trace_array *tr, const char *clockstr);
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index a41fed46c285..89779eb84a07 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -174,6 +174,10 @@ static int dyn_event_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(NULL);
 		if (ret < 0)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 0d29f2f13477..ae35d6cb802a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1784,8 +1784,9 @@ ftrace_event_set_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if ((file->f_mode & FMODE_WRITE) &&
 	    (file->f_flags & O_TRUNC))
@@ -1804,8 +1805,9 @@ ftrace_event_set_pid_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if ((file->f_mode & FMODE_WRITE) &&
 	    (file->f_flags & O_TRUNC))
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 9468bd8d44a2..dd18d76bf1bd 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1680,7 +1680,7 @@ static int save_hist_vars(struct hist_trigger_data *hist_data)
 	if (var_data)
 		return 0;
 
-	if (trace_array_get(tr) < 0)
+	if (tracing_check_open_get_tr(tr))
 		return -ENODEV;
 
 	var_data = kzalloc(sizeof(*var_data), GFP_KERNEL);
-- 
2.23.0

^ permalink raw reply related

* [PATCH 6/7 v2] tracing: Add some more locked_down checks
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Added various checks on open tracefs calls to see if tracefs is in lockdown
mode, and if so, to return -EPERM.

Note, the event format files (which are basically standard on all machines)
as well as the enabled_functions file (which shows what is currently being
traced) are not lockde down. Perhaps they should be, but it seems counter
intuitive to lockdown information to help you know if the system has been
modified.

Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c               | 23 ++++++++++++++++++++++-
 kernel/trace/trace.c                |  8 ++++++++
 kernel/trace/trace_events.c         |  8 ++++++++
 kernel/trace/trace_events_hist.c    | 11 +++++++++++
 kernel/trace/trace_events_trigger.c |  8 +++++++-
 kernel/trace/trace_kprobe.c         | 12 +++++++++++-
 kernel/trace/trace_printk.c         |  7 +++++++
 kernel/trace/trace_stack.c          |  8 ++++++++
 kernel/trace/trace_stat.c           |  6 +++++-
 kernel/trace/trace_uprobe.c         | 11 +++++++++++
 10 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8b765a55e01c..f296d89be757 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -18,6 +18,7 @@
 #include <linux/clocksource.h>
 #include <linux/sched/task.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/tracefs.h>
 #include <linux/hardirq.h>
@@ -3486,6 +3487,11 @@ static int
 ftrace_avail_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (unlikely(ftrace_disabled))
 		return -ENODEV;
@@ -3505,6 +3511,15 @@ ftrace_enabled_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
 
+	/*
+	 * This shows us what functions are currently being
+	 * traced and by what. Not sure if we want lockdown
+	 * to hide such critical information for an admin.
+	 * Although, perhaps it can show information we don't
+	 * want people to see, but if something is tracing
+	 * something, we probably want to know about it.
+	 */
+
 	iter = __seq_open_private(file, &show_ftrace_seq_ops, sizeof(*iter));
 	if (!iter)
 		return -ENOMEM;
@@ -3625,6 +3640,7 @@ ftrace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops,
 			FTRACE_ITER_FILTER | FTRACE_ITER_DO_PROBES,
 			inode, file);
@@ -3635,6 +3651,7 @@ ftrace_notrace_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_NOTRACE,
 				 inode, file);
 }
@@ -5203,9 +5220,13 @@ static int
 __ftrace_graph_open(struct inode *inode, struct file *file,
 		    struct ftrace_graph_data *fgd)
 {
-	int ret = 0;
+	int ret;
 	struct ftrace_hash *new_hash = NULL;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (file->f_mode & FMODE_WRITE) {
 		const int size_bits = FTRACE_HASH_DEFAULT_BITS;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 26ee280f852b..2b4eff383505 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -17,6 +17,7 @@
 #include <linux/stacktrace.h>
 #include <linux/writeback.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/notifier.h>
 #include <linux/irqflags.h>
@@ -306,6 +307,12 @@ void trace_array_put(struct trace_array *this_tr)
 
 int tracing_check_open_get_tr(struct trace_array *tr)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -6813,6 +6820,7 @@ static int snapshot_raw_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
+	/* The following checks for tracefs lockdown */
 	ret = tracing_buffers_open(inode, filp);
 	if (ret < 0)
 		return ret;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index ae35d6cb802a..994b7a408c5f 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -12,6 +12,7 @@
 #define pr_fmt(fmt) fmt
 
 #include <linux/workqueue.h>
+#include <linux/security.h>
 #include <linux/spinlock.h>
 #include <linux/kthread.h>
 #include <linux/tracefs.h>
@@ -1294,6 +1295,8 @@ static int trace_format_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
+	/* Do we want to hide event format files on tracefs lockdown? */
+
 	ret = seq_open(file, &trace_format_seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1750,6 +1753,10 @@ ftrace_event_open(struct inode *inode, struct file *file,
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = seq_open(file, seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1774,6 +1781,7 @@ ftrace_event_avail_open(struct inode *inode, struct file *file)
 {
 	const struct seq_operations *seq_ops = &show_event_seq_ops;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_event_open(inode, file, seq_ops);
 }
 
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index dd18d76bf1bd..57648c5aa679 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -7,6 +7,7 @@
 
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
@@ -1448,6 +1449,10 @@ static int synth_events_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&synth_event_ops);
 		if (ret < 0)
@@ -5515,6 +5520,12 @@ static int hist_show(struct seq_file *m, void *v)
 
 static int event_hist_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return single_open(file, hist_show, file);
 }
 
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 2a2912cb4533..2cd53ca21b51 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2013 Tom Zanussi <tom.zanussi@linux.intel.com>
  */
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/ctype.h>
 #include <linux/mutex.h>
@@ -173,7 +174,11 @@ static const struct seq_operations event_triggers_seq_ops = {
 
 static int event_trigger_regex_open(struct inode *inode, struct file *file)
 {
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	mutex_lock(&event_mutex);
 
@@ -292,6 +297,7 @@ event_trigger_write(struct file *filp, const char __user *ubuf,
 static int
 event_trigger_open(struct inode *inode, struct file *filp)
 {
+	/* Checks for tracefs lockdown */
 	return event_trigger_regex_open(inode, filp);
 }
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 324ffbea3556..1552a95c743b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -7,11 +7,11 @@
  */
 #define pr_fmt(fmt)	"trace_kprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
 #include <linux/rculist.h>
 #include <linux/error-injection.h>
-#include <linux/security.h>
 
 #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
 
@@ -936,6 +936,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_kprobe_ops);
 		if (ret < 0)
@@ -988,6 +992,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index c3fd849d4a8f..d4e31e969206 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -6,6 +6,7 @@
  *
  */
 #include <linux/seq_file.h>
+#include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
 #include <linux/ftrace.h>
@@ -348,6 +349,12 @@ static const struct seq_operations show_format_seq_ops = {
 static int
 ftrace_formats_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &show_format_seq_ops);
 }
 
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index ec9a34a97129..4df9a209f7ca 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -5,6 +5,7 @@
  */
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
+#include <linux/security.h>
 #include <linux/kallsyms.h>
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
@@ -470,6 +471,12 @@ static const struct seq_operations stack_trace_seq_ops = {
 
 static int stack_trace_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &stack_trace_seq_ops);
 }
 
@@ -487,6 +494,7 @@ stack_trace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_FILTER,
 				 inode, file);
 }
diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c
index 75bf1bcb4a8a..9ab0a1a7ad5e 100644
--- a/kernel/trace/trace_stat.c
+++ b/kernel/trace/trace_stat.c
@@ -9,7 +9,7 @@
  *
  */
 
-
+#include <linux/security.h>
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
@@ -238,6 +238,10 @@ static int tracing_stat_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	struct stat_session *session = inode->i_private;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = stat_seq_init(session);
 	if (ret)
 		return ret;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index dd884341f5c5..352073d36585 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -7,6 +7,7 @@
  */
 #define pr_fmt(fmt)	"trace_uprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
@@ -769,6 +770,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_uprobe_ops);
 		if (ret)
@@ -818,6 +823,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 
-- 
2.23.0

^ permalink raw reply related

* [PATCH 7/7 v2] tracing: Do not create tracefs files if tracefs lockdown is in effect
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

If on boot up, lockdown is activated for tracefs, don't even bother creating
the files. This can also prevent instances from being created if lockdown is
in effect.

Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 fs/tracefs/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index eeeae0475da9..0caa151cae4e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -16,6 +16,7 @@
 #include <linux/namei.h>
 #include <linux/tracefs.h>
 #include <linux/fsnotify.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/parser.h>
 #include <linux/magic.h>
@@ -390,6 +391,9 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	struct dentry *dentry;
 	struct inode *inode;
 
+	if (security_locked_down(LOCKDOWN_TRACEFS))
+		return NULL;
+
 	if (!(mode & S_IFMT))
 		mode |= S_IFREG;
 	BUG_ON(!S_ISREG(mode));
-- 
2.23.0

^ permalink raw reply related

* Re: [PATCH 4/7 v2] tracing: Have trace events system open call tracing_open_generic_tr()
From: Steven Rostedt @ 2019-10-12  2:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005921.091872328@goodmis.org>

On Fri, 11 Oct 2019 20:57:51 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -1391,9 +1391,6 @@ static int subsystem_open(struct inode *inode, struct file *filp)
>  	struct trace_array *tr;
>  	int ret;
>  
> -	if (tracing_is_disabled())
> -		return -ENODEV;
> -
>  	/* Make sure the system still exists */
>  	mutex_lock(&event_mutex);
>  	mutex_lock(&trace_types_lock);
> @@ -1420,16 +1417,9 @@ static int subsystem_open(struct inode *inode, struct file *filp)
>  	WARN_ON(!dir);
>  
>  	/* Still need to increment the ref count of the system */
> -	if (trace_array_get(tr) < 0) {
> -		put_system(dir);
> -		return -ENODEV;
> -	}
> -
> -	ret = tracing_open_generic(inode, filp);
> -	if (ret < 0) {
> -		trace_array_put(tr);
> +	ret = tracing_open_generic_tr(inode, filp);
> +	if (ret < 0)
>  		put_system(dir);
> -	}
>  
>  	return ret;
>  }

I got a bit too aggressive on this patch. The subsystem_open() gets the
tr from a search, not from the inode. Thus it can't use the
tracing_open_generic_tr() call. It needs to do the trace_array_get()
directly.

V2 of this patch:

-- Steve

>From b69083301044f587afefd5a4ac754a8c43ba0f0d Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Date: Fri, 11 Oct 2019 19:12:21 -0400
Subject: [PATCH] tracing: Have trace events system open call
 tracing_open_generic_tr()

Instead of having the trace events system open calls open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c        |  2 +-
 kernel/trace/trace.h        |  1 +
 kernel/trace/trace_events.c | 17 +++--------------
 3 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fa7d813b04c6..94f1b9124939 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4156,7 +4156,7 @@ bool tracing_is_disabled(void)
  * Open and update trace_array ref count.
  * Must have the current trace_array passed to it.
  */
-static int tracing_open_generic_tr(struct inode *inode, struct file *filp)
+int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f801d154ff6a..854dbf4050f8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -681,6 +681,7 @@ void tracing_reset_online_cpus(struct trace_buffer *buf);
 void tracing_reset_current(int cpu);
 void tracing_reset_all_online_cpus(void);
 int tracing_open_generic(struct inode *inode, struct file *filp);
+int tracing_open_generic_tr(struct inode *inode, struct file *filp);
 bool tracing_is_disabled(void);
 bool tracer_tracing_is_on(struct trace_array *tr);
 void tracer_tracing_on(struct trace_array *tr);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index b89cdfe20bc1..9613a757c902 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1440,28 +1440,17 @@ static int system_tr_open(struct inode *inode, struct file *filp)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_is_disabled())
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
-
 	/* Make a temporary dir that has no system but points to tr */
 	dir = kzalloc(sizeof(*dir), GFP_KERNEL);
-	if (!dir) {
-		trace_array_put(tr);
+	if (!dir)
 		return -ENOMEM;
-	}
-
-	dir->tr = tr;
 
-	ret = tracing_open_generic(inode, filp);
+	ret = tracing_open_generic_tr(inode, filp);
 	if (ret < 0) {
-		trace_array_put(tr);
 		kfree(dir);
 		return ret;
 	}
-
+	dir->tr = tr;
 	filp->private_data = dir;
 
 	return 0;
-- 
2.20.1

^ permalink raw reply related

* Re: [PATCH v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-10-12  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner
In-Reply-To: <CAHk-=wh8L50f31vW8BwRUXhLiq3eoCQ3tg8ER4Yp2dzuU1w5rQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3738 bytes --]

On 2019-10-10, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, Oct 9, 2019 at 10:42 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2277,6 +2277,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >
> >         nd->m_seq = read_seqbegin(&mount_lock);
> >
> > +       /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
> > +       if (flags & LOOKUP_IN_ROOT)
> > +               while (*s == '/')
> > +                       s++;
> > +
> >         /* Figure out the starting path and root (if needed). */
> >         if (*s == '/') {
> >                 error = nd_jump_root(nd);
> 
> Hmm. Wouldn't this make more sense all inside the if (*s =- '/') test?
> That way if would be where we check for "should we start at the root",
> which seems to make more sense conceptually.

I don't really agree (though I do think that both options are pretty
ugly). Doing it before the block makes it clear that absolute paths are
just treated relative-to-dirfd -- doing it inside the block makes it
look more like "/" is a special-case for nd_jump_root(). And while that
is somewhat true, this is just a side-effect of making the code more
clean -- my earlier versions reworked the dirfd handling to always grab
nd->root first if LOOKUP_IS_SCOPED. I switched to this method based on
Al's review.

In fairness, I do agree that the lonely while loop looks ugly.

> That test for '/' currently has a "} else if (..)", but that's
> pointless since it ends with a "return" anyway. So the "else" logic is
> just noise.

This depends on the fact that LOOKUP_BENEATH always triggers -EXDEV for
nd_jump_root() -- if we ever add another "scoped lookup" flag then the
logic will have to be further reworked.

(It should be noted that the new version doesn't always end with a
"return", but you could change it to act that way given the above
assumption.)

> And if you get rid of the unnecessary else, moving the LOOKUP_IN_ROOT
> inside the if-statement works fine.
> 
> So this could be something like
> 
>     --- a/fs/namei.c
>     +++ b/fs/namei.c
>     @@ -2194,11 +2196,19 @@ static const char *path_init(struct
> nameidata *nd, unsigned flags)
> 
>         nd->m_seq = read_seqbegin(&mount_lock);
>         if (*s == '/') {
>     -           set_root(nd);
>     -           if (likely(!nd_jump_root(nd)))
>     -                   return s;
>     -           return ERR_PTR(-ECHILD);
>     -   } else if (nd->dfd == AT_FDCWD) {
>     +           /* LOOKUP_IN_ROOT treats absolute paths as being
> relative-to-dirfd. */
>     +           if (!(flags & LOOKUP_IN_ROOT)) {
>     +                   set_root(nd);
>     +                   if (likely(!nd_jump_root(nd)))
>     +                           return s;
>     +                   return ERR_PTR(-ECHILD);
>     +           }
>     +
>     +           /* Skip initial '/' for LOOKUP_IN_ROOT */
>     +           do { s++; } while (*s == '/');
>     +   }
>     +
>     +   if (nd->dfd == AT_FDCWD) {
>                 if (flags & LOOKUP_RCU) {
>                         struct fs_struct *fs = current->fs;
>                         unsigned seq;
> 
> instead. The patch ends up slightly bigger (due to the re-indentation)
> but now it handles all the "start at root" in the same place. Doesn't
> that make sense?

It is correct (though I'd need to clean it up a bit to handle
nd_jump_root() correctly), and if you really would like me to change it
I will -- but I just don't agree that it's cleaner.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-10-12  4:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner
In-Reply-To: <20191012040815.gnc43cfmo5mnv67u@yavin.dot.cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

On 2019-10-12, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-10-10, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Wed, Oct 9, 2019 at 10:42 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2277,6 +2277,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >
> > >         nd->m_seq = read_seqbegin(&mount_lock);
> > >
> > > +       /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
> > > +       if (flags & LOOKUP_IN_ROOT)
> > > +               while (*s == '/')
> > > +                       s++;
> > > +
> > >         /* Figure out the starting path and root (if needed). */
> > >         if (*s == '/') {
> > >                 error = nd_jump_root(nd);
> > 
> > Hmm. Wouldn't this make more sense all inside the if (*s =- '/') test?
> > That way if would be where we check for "should we start at the root",
> > which seems to make more sense conceptually.
> 
> I don't really agree (though I do think that both options are pretty
> ugly). Doing it before the block makes it clear that absolute paths are
> just treated relative-to-dirfd -- doing it inside the block makes it
> look more like "/" is a special-case for nd_jump_root(). And while that

Sorry, I meant "special-case for LOOKUP_IN_ROOT".

> is somewhat true, this is just a side-effect of making the code more
> clean -- my earlier versions reworked the dirfd handling to always grab
> nd->root first if LOOKUP_IS_SCOPED. I switched to this method based on
> Al's review.
> 
> In fairness, I do agree that the lonely while loop looks ugly.

And with the old way I did it (where we grabbed nd->root first) the
semantics were slightly more clear -- stripping leading "/"s doesn't
really look as "clearly obvious" as grabbing nd->root beforehand and
treating "/"s normally. But the code was also needlessly more complex.

> > That test for '/' currently has a "} else if (..)", but that's
> > pointless since it ends with a "return" anyway. So the "else" logic is
> > just noise.
> 
> This depends on the fact that LOOKUP_BENEATH always triggers -EXDEV for
> nd_jump_root() -- if we ever add another "scoped lookup" flag then the
> logic will have to be further reworked.
> 
> (It should be noted that the new version doesn't always end with a
> "return", but you could change it to act that way given the above
> assumption.)
> 
> > And if you get rid of the unnecessary else, moving the LOOKUP_IN_ROOT
> > inside the if-statement works fine.
> > 
> > So this could be something like
> > 
> >     --- a/fs/namei.c
> >     +++ b/fs/namei.c
> >     @@ -2194,11 +2196,19 @@ static const char *path_init(struct
> > nameidata *nd, unsigned flags)
> > 
> >         nd->m_seq = read_seqbegin(&mount_lock);
> >         if (*s == '/') {
> >     -           set_root(nd);
> >     -           if (likely(!nd_jump_root(nd)))
> >     -                   return s;
> >     -           return ERR_PTR(-ECHILD);
> >     -   } else if (nd->dfd == AT_FDCWD) {
> >     +           /* LOOKUP_IN_ROOT treats absolute paths as being
> > relative-to-dirfd. */
> >     +           if (!(flags & LOOKUP_IN_ROOT)) {
> >     +                   set_root(nd);
> >     +                   if (likely(!nd_jump_root(nd)))
> >     +                           return s;
> >     +                   return ERR_PTR(-ECHILD);
> >     +           }
> >     +
> >     +           /* Skip initial '/' for LOOKUP_IN_ROOT */
> >     +           do { s++; } while (*s == '/');
> >     +   }
> >     +
> >     +   if (nd->dfd == AT_FDCWD) {
> >                 if (flags & LOOKUP_RCU) {
> >                         struct fs_struct *fs = current->fs;
> >                         unsigned seq;
> > 
> > instead. The patch ends up slightly bigger (due to the re-indentation)
> > but now it handles all the "start at root" in the same place. Doesn't
> > that make sense?
> 
> It is correct (though I'd need to clean it up a bit to handle
> nd_jump_root() correctly), and if you really would like me to change it
> I will -- but I just don't agree that it's cleaner.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH] usercopy: Avoid soft lockups in test_check_nonzero_user()
From: Michael Ellerman @ 2019-10-12  9:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: mingo, peterz, alexander.shishkin, jolsa, namhyung, christian,
	keescook, linux, viro, torvalds, linux-api, linux-kernel
In-Reply-To: <20191011034810.xkmz3e4l5ezxvq57@yavin.dot.cyphar.com>

Aleksa Sarai <cyphar@cyphar.com> writes:
> On 2019-10-11, Michael Ellerman <mpe@ellerman.id.au> wrote:
>> On a machine with a 64K PAGE_SIZE, the nested for loops in
>> test_check_nonzero_user() can lead to soft lockups, eg:
...
>> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
>> index 950ee88cd6ac..9fb6bc609d4c 100644
>> --- a/lib/test_user_copy.c
>> +++ b/lib/test_user_copy.c
>> @@ -47,9 +47,26 @@ static bool is_zeroed(void *from, size_t size)
>>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
>>  {
>>  	int ret = 0;
>> -	size_t start, end, i;
>> -	size_t zero_start = size / 4;
>> -	size_t zero_end = size - zero_start;
>> +	size_t start, end, i, zero_start, zero_end;
>> +
>> +	if (test(size < 1024, "buffer too small"))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * We want to cross a page boundary to exercise the code more
>> +	 * effectively. We assume the buffer we're passed has a page boundary at
>> +	 * size / 2. We also don't want to make the size we scan too large,
>> +	 * otherwise the test can take a long time and cause soft lockups. So
>> +	 * scan a 1024 byte region across the page boundary.
>> +	 */
>> +	start = size / 2 - 512;
>> +	size = 1024;
>
> I don't think it's necessary to do "size / 2" here -- you can just use
> PAGE_SIZE directly and check above that "size == 2*PAGE_SIZE" (not that
> this check is exceptionally necessary -- since there's only one caller
> of this function and it's in the same file).

OK, like this?

diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
index 950ee88cd6ac..48bc669b2549 100644
--- a/lib/test_user_copy.c
+++ b/lib/test_user_copy.c
@@ -47,9 +47,25 @@ static bool is_zeroed(void *from, size_t size)
 static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
 {
 	int ret = 0;
-	size_t start, end, i;
-	size_t zero_start = size / 4;
-	size_t zero_end = size - zero_start;
+	size_t start, end, i, zero_start, zero_end;
+
+	if (test(size < 2 * PAGE_SIZE, "buffer too small"))
+		return -EINVAL;
+
+	/*
+	 * We want to cross a page boundary to exercise the code more
+	 * effectively. We also don't want to make the size we scan too large,
+	 * otherwise the test can take a long time and cause soft lockups. So
+	 * scan a 1024 byte region across the page boundary.
+	 */
+	size = 1024;
+	start = PAGE_SIZE - (size / 2);
+
+	kmem += start;
+	umem += start;
+
+	zero_start = size / 4;
+	zero_end = size - zero_start;
 
 	/*
 	 * We conduct a series of check_nonzero_user() tests on a block of memory

cheers

^ permalink raw reply related

* Re: [PATCH] usercopy: Avoid soft lockups in test_check_nonzero_user()
From: Aleksa Sarai @ 2019-10-12 10:12 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: mingo, peterz, alexander.shishkin, jolsa, namhyung, christian,
	keescook, linux, viro, torvalds, linux-api, linux-kernel
In-Reply-To: <87tv8euw44.fsf@mpe.ellerman.id.au>

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On 2019-10-12, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Aleksa Sarai <cyphar@cyphar.com> writes:
> > On 2019-10-11, Michael Ellerman <mpe@ellerman.id.au> wrote:
> >> On a machine with a 64K PAGE_SIZE, the nested for loops in
> >> test_check_nonzero_user() can lead to soft lockups, eg:
> ...
> >> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> >> index 950ee88cd6ac..9fb6bc609d4c 100644
> >> --- a/lib/test_user_copy.c
> >> +++ b/lib/test_user_copy.c
> >> @@ -47,9 +47,26 @@ static bool is_zeroed(void *from, size_t size)
> >>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
> >>  {
> >>  	int ret = 0;
> >> -	size_t start, end, i;
> >> -	size_t zero_start = size / 4;
> >> -	size_t zero_end = size - zero_start;
> >> +	size_t start, end, i, zero_start, zero_end;
> >> +
> >> +	if (test(size < 1024, "buffer too small"))
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * We want to cross a page boundary to exercise the code more
> >> +	 * effectively. We assume the buffer we're passed has a page boundary at
> >> +	 * size / 2. We also don't want to make the size we scan too large,
> >> +	 * otherwise the test can take a long time and cause soft lockups. So
> >> +	 * scan a 1024 byte region across the page boundary.
> >> +	 */
> >> +	start = size / 2 - 512;
> >> +	size = 1024;
> >
> > I don't think it's necessary to do "size / 2" here -- you can just use
> > PAGE_SIZE directly and check above that "size == 2*PAGE_SIZE" (not that
> > this check is exceptionally necessary -- since there's only one caller
> > of this function and it's in the same file).
> 
> OK, like this?

Yup -- that looks good. I'll give it a Reviewed-by once you resend it.

> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> index 950ee88cd6ac..48bc669b2549 100644
> --- a/lib/test_user_copy.c
> +++ b/lib/test_user_copy.c
> @@ -47,9 +47,25 @@ static bool is_zeroed(void *from, size_t size)
>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
>  {
>  	int ret = 0;
> -	size_t start, end, i;
> -	size_t zero_start = size / 4;
> -	size_t zero_end = size - zero_start;
> +	size_t start, end, i, zero_start, zero_end;
> +
> +	if (test(size < 2 * PAGE_SIZE, "buffer too small"))
> +		return -EINVAL;
> +
> +	/*
> +	 * We want to cross a page boundary to exercise the code more
> +	 * effectively. We also don't want to make the size we scan too large,
> +	 * otherwise the test can take a long time and cause soft lockups. So
> +	 * scan a 1024 byte region across the page boundary.
> +	 */
> +	size = 1024;
> +	start = PAGE_SIZE - (size / 2);
> +
> +	kmem += start;
> +	umem += start;
> +
> +	zero_start = size / 4;
> +	zero_end = size - zero_start;
>  
>  	/*
>  	 * We conduct a series of check_nonzero_user() tests on a block of memory


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH] pidfd: add NSpid entries to fdinfo
From: Christian Brauner @ 2019-10-12 10:19 UTC (permalink / raw)
  To: jannh
  Cc: aarcange, akpm, christian.brauner, christian, ckellner, cyphar,
	elena.reshetova, guro, ldv, linux-api, linux-kernel, mhocko,
	mingo, peterz, tglx, viro
In-Reply-To: <CAG48ez1hk9d-qAPcRy9QOgNuO8u3Y_hu_3=GZoFYLY+oMdo8xg@mail.gmail.com>

Currently, the fdinfo file of contains the field Pid:
It contains the pid a given pidfd refers to in the pid namespace of the
opener's procfs instance.
If the pid namespace of the process is not a descendant of the pid
namespace of the procfs instance 0 will be shown as its pid. This is
similar to calling getppid() on a process who's parent is out of it's
pid namespace (e.g. when moving a process into a sibling pid namespace
via setns()).

Add an NSpid field for easy retrieval of the pid in all descendant pid
namespaces:
If pid namespaces are supported this field will contain the pid a given
pidfd refers to for all descendant pid namespaces starting from the
current pid namespace of the opener's procfs instance, i.e. the first
pid entry for Pid and NSpid will be identical.
If the pid namespace of the process is not a descendant of the pid
namespace of the procfs instance 0 will be shown as its first NSpid and
no other NSpid entries will be shown.
Note that this differs from the Pid and NSpid fields in
/proc/<pid>/status where Pid and NSpid are always shown relative to the
pid namespace of the opener's procfs instace. The difference becomes
obvious when sending around a pidfd between pid namespaces from
different trees, i.e. where no ancestoral relation is present between
the pid namespaces:
1. sending around pidfd:
- create two new pid namespaces ns1 and ns2 in the initial pid namespace
  (Also take care to create new mount namespaces in the new pid
  namespace and mount procfs.)
- create a process with a pidfd in ns1
- send pidfd from ns1 to ns2
- read /proc/self/fdinfo/<pidfd> and observe that Pid and NSpid entry
  are 0
- create a process with a pidfd in
- open a pidfd for a process in the initial pid namespace
2. sending around /proc/<pid>/status fd:
- create two new pid namespaces ns1 and ns2 in the initial pid namespace
  (Also take care to create new mount namespaces in the new pid
  namespace and mount procfs.)
- create a process in ns1
- open /proc/<pid>/status in the initial pid namespace for the process
  you created in ns1
- send statusfd from initial pid namespace to ns2
- read statusfd and observe:
  - that Pid will contain the pid of the process as seen from the init
  - that NSpid will contain the pids of the process for all descendant
    pid namespaces starting from the initial pid namespace

Cc: Jann Horn <jannh@google.com>
Cc: linux-api@vger.kernel.org
Co-Developed-by: Christian Kellner <christian@kellner.me>
Signed-off-by: Christian Kellner <christian@kellner.me>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/fork.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1f6c45f6a734..b155bad92d9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1695,12 +1695,83 @@ static int pidfd_release(struct inode *inode, struct file *file)
 }
 
 #ifdef CONFIG_PROC_FS
+/**
+ * pidfd_show_fdinfo - print information about a pidfd
+ * @m: proc fdinfo file
+ * @f: file referencing a pidfd
+ *
+ * Pid:
+ * This function will print the pid a given pidfd refers to in the pid
+ * namespace of the opener's procfs instance.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its pid. This is
+ * similar to calling getppid() on a process who's parent is out of it's
+ * pid namespace (e.g. when moving a process into a sibling pid namespace
+ * via setns()).
+ *
+ * NSpid:
+ * If pid namespaces are supported then this function will also print the
+ * pid a given pidfd refers to for all descendant pid namespaces starting
+ * from the current pid namespace of the opener's procfs instance, i.e. the
+ * first pid entry for Pid and NSpid will be identical.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its first NSpid and
+ * no other NSpid entries will be shown.
+ * Note that this differs from the Pid and NSpid fields in
+ * /proc/<pid>/status where Pid and NSpid are always shown relative to the
+ * pid namespace of the opener's procfs instace. The difference becomes
+ * obvious when sending around a pidfd between pid namespaces from
+ * different trees, i.e. where no ancestoral relation is present between
+ * the pid namespaces:
+ * 1. sending around pidfd:
+ * - create two new pid namespaces ns1 and ns2 in the initial pid namespace
+ *   (Also take care to create new mount namespaces in the new pid
+ *   namespace and mount procfs.)
+ * - create a process with a pidfd in ns1
+ * - send pidfd from ns1 to ns2
+ * - read /proc/self/fdinfo/<pidfd> and observe that Pid and NSpid entry
+ *   are 0
+ * - create a process with a pidfd in
+ * - open a pidfd for a process in the initial pid namespace
+ * 2. sending around /proc/<pid>/status fd:
+ * - create two new pid namespaces ns1 and ns2 in the initial pid namespace
+ *   (Also take care to create new mount namespaces in the new pid
+ *   namespace and mount procfs.)
+ * - create a process in ns1
+ * - open /proc/<pid>/status in the initial pid namespace for the process
+ *   you created in ns1
+ * - send statusfd from initial pid namespace to ns2
+ * - read statusfd and observe:
+ *   - that Pid will contain the pid of the process as seen from the init
+ *   - that NSpid will contain the pids of the process for all descendant
+ *     pid namespaces starting from the initial pid namespace
+ */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
 	struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
 	struct pid *pid = f->private_data;
+	pid_t nr = pid_nr_ns(pid, ns);
+
+	seq_put_decimal_ull(m, "Pid:\t", nr);
+
+#ifdef CONFIG_PID_NS
+	seq_puts(m, "\nNSpid:");
+	if (nr == 0) {
+		/*
+		 * If nr is zero the pid namespace of the procfs and the
+		 * pid namespace of the pidfd are neither the same pid
+		 * namespace nor are they ancestors. Since NSpid and Pid
+		 * are always identical in their first entry shortcut it
+		 * and simply print 0.
+		 */
+		seq_put_decimal_ull(m, "\t", nr);
+	} else {
+		int i;
+		for (i = ns->level; i <= pid->level; i++)
+			seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, pid->numbers[i].ns));
+	}
+#endif
 
-	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
 	seq_putc(m, '\n');
 }
 #endif
-- 
2.23.0

^ permalink raw reply related

* Re: [PATCH] pidfd: add NSpid entries to fdinfo
From: Christian Brauner @ 2019-10-12 10:21 UTC (permalink / raw)
  To: jannh
  Cc: aarcange, akpm, christian, ckellner, cyphar, elena.reshetova,
	guro, ldv, linux-api, linux-kernel, mhocko, mingo, peterz, tglx,
	viro
In-Reply-To: <20191012101922.24168-1-christian.brauner@ubuntu.com>

On Sat, Oct 12, 2019 at 12:19:22PM +0200, Christian Brauner wrote:
> Currently, the fdinfo file of contains the field Pid:
> It contains the pid a given pidfd refers to in the pid namespace of the
> opener's procfs instance.
> If the pid namespace of the process is not a descendant of the pid
> namespace of the procfs instance 0 will be shown as its pid. This is
> similar to calling getppid() on a process who's parent is out of it's
> pid namespace (e.g. when moving a process into a sibling pid namespace
> via setns()).
> 
> Add an NSpid field for easy retrieval of the pid in all descendant pid
> namespaces:
> If pid namespaces are supported this field will contain the pid a given
> pidfd refers to for all descendant pid namespaces starting from the
> current pid namespace of the opener's procfs instance, i.e. the first
> pid entry for Pid and NSpid will be identical.
> If the pid namespace of the process is not a descendant of the pid
> namespace of the procfs instance 0 will be shown as its first NSpid and
> no other NSpid entries will be shown.
> Note that this differs from the Pid and NSpid fields in
> /proc/<pid>/status where Pid and NSpid are always shown relative to the
> pid namespace of the opener's procfs instace. The difference becomes
> obvious when sending around a pidfd between pid namespaces from
> different trees, i.e. where no ancestoral relation is present between
> the pid namespaces:
> 1. sending around pidfd:
> - create two new pid namespaces ns1 and ns2 in the initial pid namespace
>   (Also take care to create new mount namespaces in the new pid
>   namespace and mount procfs.)
> - create a process with a pidfd in ns1
> - send pidfd from ns1 to ns2
> - read /proc/self/fdinfo/<pidfd> and observe that Pid and NSpid entry
>   are 0
> - create a process with a pidfd in
> - open a pidfd for a process in the initial pid namespace
> 2. sending around /proc/<pid>/status fd:
> - create two new pid namespaces ns1 and ns2 in the initial pid namespace
>   (Also take care to create new mount namespaces in the new pid
>   namespace and mount procfs.)
> - create a process in ns1
> - open /proc/<pid>/status in the initial pid namespace for the process
>   you created in ns1
> - send statusfd from initial pid namespace to ns2
> - read statusfd and observe:
>   - that Pid will contain the pid of the process as seen from the init
>   - that NSpid will contain the pids of the process for all descendant
>     pid namespaces starting from the initial pid namespace
> 
> Cc: Jann Horn <jannh@google.com>
> Cc: linux-api@vger.kernel.org
> Co-Developed-by: Christian Kellner <christian@kellner.me>
> Signed-off-by: Christian Kellner <christian@kellner.me>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

I think this might be more what we want.
I tried to think of cases where the first entry of Pid is not identical
to the first entry of NSpid but I came up with none. Maybe you do, Jann?

Christian, this is just a quick stab I took. Feel free to pick this up
as a template.

Thanks!
Christian

> ---
>  kernel/fork.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 1f6c45f6a734..b155bad92d9c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1695,12 +1695,83 @@ static int pidfd_release(struct inode *inode, struct file *file)
>  }
>  
>  #ifdef CONFIG_PROC_FS
> +/**
> + * pidfd_show_fdinfo - print information about a pidfd
> + * @m: proc fdinfo file
> + * @f: file referencing a pidfd
> + *
> + * Pid:
> + * This function will print the pid a given pidfd refers to in the pid
> + * namespace of the opener's procfs instance.
> + * If the pid namespace of the process is not a descendant of the pid
> + * namespace of the procfs instance 0 will be shown as its pid. This is
> + * similar to calling getppid() on a process who's parent is out of it's
> + * pid namespace (e.g. when moving a process into a sibling pid namespace
> + * via setns()).
> + *
> + * NSpid:
> + * If pid namespaces are supported then this function will also print the
> + * pid a given pidfd refers to for all descendant pid namespaces starting
> + * from the current pid namespace of the opener's procfs instance, i.e. the
> + * first pid entry for Pid and NSpid will be identical.
> + * If the pid namespace of the process is not a descendant of the pid
> + * namespace of the procfs instance 0 will be shown as its first NSpid and
> + * no other NSpid entries will be shown.
> + * Note that this differs from the Pid and NSpid fields in
> + * /proc/<pid>/status where Pid and NSpid are always shown relative to the
> + * pid namespace of the opener's procfs instace. The difference becomes
> + * obvious when sending around a pidfd between pid namespaces from
> + * different trees, i.e. where no ancestoral relation is present between
> + * the pid namespaces:
> + * 1. sending around pidfd:
> + * - create two new pid namespaces ns1 and ns2 in the initial pid namespace
> + *   (Also take care to create new mount namespaces in the new pid
> + *   namespace and mount procfs.)
> + * - create a process with a pidfd in ns1
> + * - send pidfd from ns1 to ns2
> + * - read /proc/self/fdinfo/<pidfd> and observe that Pid and NSpid entry
> + *   are 0
> + * - create a process with a pidfd in
> + * - open a pidfd for a process in the initial pid namespace
> + * 2. sending around /proc/<pid>/status fd:
> + * - create two new pid namespaces ns1 and ns2 in the initial pid namespace
> + *   (Also take care to create new mount namespaces in the new pid
> + *   namespace and mount procfs.)
> + * - create a process in ns1
> + * - open /proc/<pid>/status in the initial pid namespace for the process
> + *   you created in ns1
> + * - send statusfd from initial pid namespace to ns2
> + * - read statusfd and observe:
> + *   - that Pid will contain the pid of the process as seen from the init
> + *   - that NSpid will contain the pids of the process for all descendant
> + *     pid namespaces starting from the initial pid namespace
> + */
>  static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
>  {
>  	struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
>  	struct pid *pid = f->private_data;
> +	pid_t nr = pid_nr_ns(pid, ns);
> +
> +	seq_put_decimal_ull(m, "Pid:\t", nr);
> +
> +#ifdef CONFIG_PID_NS
> +	seq_puts(m, "\nNSpid:");
> +	if (nr == 0) {
> +		/*
> +		 * If nr is zero the pid namespace of the procfs and the
> +		 * pid namespace of the pidfd are neither the same pid
> +		 * namespace nor are they ancestors. Since NSpid and Pid
> +		 * are always identical in their first entry shortcut it
> +		 * and simply print 0.
> +		 */
> +		seq_put_decimal_ull(m, "\t", nr);
> +	} else {
> +		int i;
> +		for (i = ns->level; i <= pid->level; i++)
> +			seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, pid->numbers[i].ns));
> +	}
> +#endif
>  
> -	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
>  	seq_putc(m, '\n');
>  }
>  #endif
> -- 
> 2.23.0
> 

^ permalink raw reply

* [PATCH 0/7] Harden userfaultfd
From: Daniel Colascione @ 2019-10-12 19:15 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray

Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors (via a new flag, for compatibility with existing
code) and allows administrators to limit userfaultfd to servicing
user-mode faults, increasing the difficulty of using userfaultfd in
exploit chains invoking delaying kernel faults.

A new anon_inodes interface allows callers to opt into SELinux
management of anonymous file objects. In this mode, anon_inodes
creates new ephemeral inodes for anonymous file objects instead of
reusing a singleton dummy inode. A new LSM hook gives security modules
an opportunity to configure and veto these ephemeral inodes.

Existing anon_inodes users must opt into the new functionality.

Daniel Colascione (7):
  Add a new flags-accepting interface for anonymous inodes
  Add a concept of a "secure" anonymous file
  Add a UFFD_SECURE flag to the userfaultfd API.
  Teach SELinux about a new userfaultfd class
  Let userfaultfd opt out of handling kernel-mode faults
  Allow users to require UFFD_SECURE
  Add a new sysctl for limiting userfaultfd to user mode faults

 Documentation/admin-guide/sysctl/vm.rst | 19 +++++-
 fs/anon_inodes.c                        | 89 +++++++++++++++++--------
 fs/userfaultfd.c                        | 47 +++++++++++--
 include/linux/anon_inodes.h             | 27 ++++++--
 include/linux/lsm_hooks.h               |  8 +++
 include/linux/security.h                |  2 +
 include/linux/userfaultfd_k.h           |  3 +
 include/uapi/linux/userfaultfd.h        | 14 ++++
 kernel/sysctl.c                         |  9 +++
 security/security.c                     |  8 +++
 security/selinux/hooks.c                | 68 +++++++++++++++++++
 security/selinux/include/classmap.h     |  2 +
 12 files changed, 256 insertions(+), 40 deletions(-)

-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply

* [PATCH 1/7] Add a new flags-accepting interface for anonymous inodes
From: Daniel Colascione @ 2019-10-12 19:15 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

Add functions forwarding from the old names to the new ones so we
don't need to change any callers.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/anon_inodes.c            | 62 ++++++++++++++++++++++---------------
 include/linux/anon_inodes.h | 27 +++++++++++++---
 2 files changed, 59 insertions(+), 30 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89714308c25b..caa36019afca 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -56,60 +56,71 @@ static struct file_system_type anon_inode_fs_type = {
 };
 
 /**
- * anon_inode_getfile - creates a new file instance by hooking it up to an
- *                      anonymous inode, and a dentry that describe the "class"
- *                      of the file
+ * anon_inode_getfile2 - creates a new file instance by hooking it up to
+ *                       an anonymous inode, and a dentry that describe
+ *                       the "class" of the file
  *
  * @name:    [in]    name of the "class" of the new file
  * @fops:    [in]    file operations for the new file
  * @priv:    [in]    private data for the new file (will be file's private_data)
- * @flags:   [in]    flags
+ * @flags:   [in]    flags for the file
+ * @anon_inode_flags: [in] flags for anon_inode*
  *
- * Creates a new file by hooking it on a single inode. This is useful for files
+ * Creates a new file by hooking it on an unspecified inode. This is useful for files
  * that do not need to have a full-fledged inode in order to operate correctly.
  * All the files created with anon_inode_getfile() will share a single inode,
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns the newly created file* or an error pointer.
+ *
+ * anon_inode_flags must be zero.
  */
-struct file *anon_inode_getfile(const char *name,
-				const struct file_operations *fops,
-				void *priv, int flags)
+struct file *anon_inode_getfile2(const char *name,
+				 const struct file_operations *fops,
+				 void *priv, int flags, int anon_inode_flags)
 {
+	struct inode *inode;
 	struct file *file;
 
-	if (IS_ERR(anon_inode_inode))
-		return ERR_PTR(-ENODEV);
-
-	if (fops->owner && !try_module_get(fops->owner))
-		return ERR_PTR(-ENOENT);
+	if (anon_inode_flags)
+		return ERR_PTR(-EINVAL);
 
+	inode =	anon_inode_inode;
+	if (IS_ERR(inode))
+		return ERR_PTR(-ENODEV);
 	/*
-	 * We know the anon_inode inode count is always greater than zero,
-	 * so ihold() is safe.
+	 * We know the anon_inode inode count is always
+	 * greater than zero, so ihold() is safe.
 	 */
-	ihold(anon_inode_inode);
-	file = alloc_file_pseudo(anon_inode_inode, anon_inode_mnt, name,
+	ihold(inode);
+
+	if (fops->owner && !try_module_get(fops->owner)) {
+		file = ERR_PTR(-ENOENT);
+		goto err;
+	}
+
+	file = alloc_file_pseudo(inode, anon_inode_mnt, name,
 				 flags & (O_ACCMODE | O_NONBLOCK), fops);
 	if (IS_ERR(file))
 		goto err;
 
-	file->f_mapping = anon_inode_inode->i_mapping;
+	file->f_mapping = inode->i_mapping;
 
 	file->private_data = priv;
 
 	return file;
 
 err:
-	iput(anon_inode_inode);
+	iput(inode);
 	module_put(fops->owner);
 	return file;
 }
 EXPORT_SYMBOL_GPL(anon_inode_getfile);
+EXPORT_SYMBOL_GPL(anon_inode_getfile2);
 
 /**
- * anon_inode_getfd - creates a new file instance by hooking it up to an
- *                    anonymous inode, and a dentry that describe the "class"
- *                    of the file
+ * anon_inode_getfd2 - creates a new file instance by hooking it up to an
+ *                     anonymous inode, and a dentry that describe the "class"
+ *                     of the file
  *
  * @name:    [in]    name of the "class" of the new file
  * @fops:    [in]    file operations for the new file
@@ -122,8 +133,8 @@ EXPORT_SYMBOL_GPL(anon_inode_getfile);
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns new descriptor or an error code.
  */
-int anon_inode_getfd(const char *name, const struct file_operations *fops,
-		     void *priv, int flags)
+int anon_inode_getfd2(const char *name, const struct file_operations *fops,
+		      void *priv, int flags, int anon_inode_flags)
 {
 	int error, fd;
 	struct file *file;
@@ -133,7 +144,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		return error;
 	fd = error;
 
-	file = anon_inode_getfile(name, fops, priv, flags);
+	file = anon_inode_getfile2(name, fops, priv, flags, anon_inode_flags);
 	if (IS_ERR(file)) {
 		error = PTR_ERR(file);
 		goto err_put_unused_fd;
@@ -147,6 +158,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	return error;
 }
 EXPORT_SYMBOL_GPL(anon_inode_getfd);
+EXPORT_SYMBOL_GPL(anon_inode_getfd2);
 
 static int __init anon_inode_init(void)
 {
diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
index d0d7d96261ad..10699462dcb1 100644
--- a/include/linux/anon_inodes.h
+++ b/include/linux/anon_inodes.h
@@ -11,11 +11,28 @@
 
 struct file_operations;
 
-struct file *anon_inode_getfile(const char *name,
-				const struct file_operations *fops,
-				void *priv, int flags);
-int anon_inode_getfd(const char *name, const struct file_operations *fops,
-		     void *priv, int flags);
+#define ANON_INODE_SECURE 1
+
+struct file *anon_inode_getfile2(const char *name,
+				 const struct file_operations *fops,
+				 void *priv, int flags, int anon_inode_flags);
+int anon_inode_getfd2(const char *name, const struct file_operations *fops,
+		      void *priv, int flags, int anon_inode_flags);
+
+static inline int anon_inode_getfd(const char *name,
+				   const struct file_operations *fops,
+				   void *priv, int flags)
+{
+	return anon_inode_getfd2(name, fops, priv, flags, 0);
+}
+
+static inline struct file *anon_inode_getfile(const char *name,
+					      const struct file_operations *fops,
+					      void *priv, int flags)
+{
+	return anon_inode_getfile2(name, fops, priv, flags, 0);
+}
+
 
 #endif /* _LINUX_ANON_INODES_H */
 
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 2/7] Add a concept of a "secure" anonymous file
From: Daniel Colascione @ 2019-10-12 19:15 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

A secure anonymous file is one we hooked up to its own inode (as
opposed to the shared inode we use for non-secure anonymous files). A
new selinux hook gives security modules a chance to initialize, label,
and veto the creation of these secure anonymous files. Security
modules had limit ability to interact with non-secure anonymous files
due to all of these files sharing a single inode.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/anon_inodes.c          | 45 ++++++++++++++++++++++++++++++---------
 include/linux/lsm_hooks.h |  8 +++++++
 include/linux/security.h  |  2 ++
 security/security.c       |  8 +++++++
 4 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index caa36019afca..d68d76523ad3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -55,6 +55,23 @@ static struct file_system_type anon_inode_fs_type = {
 	.kill_sb	= kill_anon_super,
 };
 
+struct inode *anon_inode_make_secure_inode(const char *name,
+					   const struct file_operations *fops)
+{
+	struct inode *inode;
+	int error;
+	inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_PTR(PTR_ERR(inode));
+	inode->i_flags &= ~S_PRIVATE;
+	error =	security_inode_init_security_anon(inode, name, fops);
+	if (error) {
+		iput(inode);
+		return ERR_PTR(error);
+	}
+	return inode;
+}
+
 /**
  * anon_inode_getfile2 - creates a new file instance by hooking it up to
  *                       an anonymous inode, and a dentry that describe
@@ -72,7 +89,9 @@ static struct file_system_type anon_inode_fs_type = {
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns the newly created file* or an error pointer.
  *
- * anon_inode_flags must be zero.
+ * If anon_inode_flags contains ANON_INODE_SECURE, create a new inode
+ * and enable security checks for it. Otherwise, attach a new file to
+ * a singleton placeholder inode with security checks disabled.
  */
 struct file *anon_inode_getfile2(const char *name,
 				 const struct file_operations *fops,
@@ -81,17 +100,23 @@ struct file *anon_inode_getfile2(const char *name,
 	struct inode *inode;
 	struct file *file;
 
-	if (anon_inode_flags)
+	if (anon_inode_flags & ~ANON_INODE_SECURE)
 		return ERR_PTR(-EINVAL);
 
-	inode =	anon_inode_inode;
-	if (IS_ERR(inode))
-		return ERR_PTR(-ENODEV);
-	/*
-	 * We know the anon_inode inode count is always
-	 * greater than zero, so ihold() is safe.
-	 */
-	ihold(inode);
+	if (anon_inode_flags & ANON_INODE_SECURE) {
+		inode =	anon_inode_make_secure_inode(name, fops);
+		if (IS_ERR(inode))
+			return ERR_PTR(PTR_ERR(inode));
+	} else {
+		inode =	anon_inode_inode;
+		if (IS_ERR(inode))
+			return ERR_PTR(-ENODEV);
+		/*
+		 * We know the anon_inode inode count is always
+		 * greater than zero, so ihold() is safe.
+		 */
+		ihold(inode);
+	}
 
 	if (fops->owner && !try_module_get(fops->owner)) {
 		file = ERR_PTR(-ENOENT);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index a3763247547c..3744ce9e9172 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -215,6 +215,10 @@
  *	Returns 0 if @name and @value have been successfully set,
  *	-EOPNOTSUPP if no security attribute is needed, or
  *	-ENOMEM on memory allocation failure.
+ * @inode_init_security_anon:
+ *      Set up a secure anonymous inode.
+ *	Returns 0 on success. Returns -EPERM if	the security module denies
+ *	the creation of this inode.
  * @inode_create:
  *	Check permission to create a regular file.
  *	@dir contains inode structure of the parent of the new file.
@@ -1552,6 +1556,9 @@ union security_list_options {
 					const struct qstr *qstr,
 					const char **name, void **value,
 					size_t *len);
+	int (*inode_init_security_anon)(struct inode *inode,
+					const char *name,
+					const struct file_operations *fops);
 	int (*inode_create)(struct inode *dir, struct dentry *dentry,
 				umode_t mode);
 	int (*inode_link)(struct dentry *old_dentry, struct inode *dir,
@@ -1876,6 +1883,7 @@ struct security_hook_heads {
 	struct hlist_head inode_alloc_security;
 	struct hlist_head inode_free_security;
 	struct hlist_head inode_init_security;
+	struct hlist_head inode_init_security_anon;
 	struct hlist_head inode_create;
 	struct hlist_head inode_link;
 	struct hlist_head inode_unlink;
diff --git a/include/linux/security.h b/include/linux/security.h
index a8d59d612d27..5b6f7e0de577 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -315,6 +315,8 @@ void security_inode_free(struct inode *inode);
 int security_inode_init_security(struct inode *inode, struct inode *dir,
 				 const struct qstr *qstr,
 				 initxattrs initxattrs, void *fs_data);
+int security_inode_init_security_anon(struct inode *inode, const char *name,
+				      const struct file_operations *fops);
 int security_old_inode_init_security(struct inode *inode, struct inode *dir,
 				     const struct qstr *qstr, const char **name,
 				     void **value, size_t *len);
diff --git a/security/security.c b/security/security.c
index 1bc000f834e2..c87695f66413 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1001,6 +1001,14 @@ int security_inode_init_security(struct inode *inode, struct inode *dir,
 }
 EXPORT_SYMBOL(security_inode_init_security);
 
+int
+security_inode_init_security_anon(struct inode *inode,
+				  const char *name,
+				  const struct file_operations *fops)
+{
+	return call_int_hook(inode_init_security_anon, 0, inode, name, fops);
+}
+
 int security_old_inode_init_security(struct inode *inode, struct inode *dir,
 				     const struct qstr *qstr, const char **name,
 				     void **value, size_t *len)
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Daniel Colascione @ 2019-10-12 19:15 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

The new secure flag makes userfaultfd use a new "secure" anonymous
file object instead of the default one, letting security modules
supervise userfaultfd use.

Requiring that users pass a new flag lets us avoid changing the
semantics for existing callers.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/userfaultfd.c                 | 28 +++++++++++++++++++++++++---
 include/uapi/linux/userfaultfd.h |  8 ++++++++
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f9fd18670e22..29f920fb236e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1022,6 +1022,13 @@ static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
 {
 	int fd;
 
+	/*
+	 * Using a secure-mode UFFD to monitor forks isn't supported
+	 * right now.
+	 */
+	if (new->flags & UFFD_SECURE)
+		return -EOPNOTSUPP;
+
 	fd = anon_inode_getfd("[userfaultfd]", &userfaultfd_fops, new,
 			      O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
 	if (fd < 0)
@@ -1841,6 +1848,18 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		ret = -EINVAL;
 		goto out;
 	}
+	if ((ctx->flags & UFFD_SECURE) &&
+	    (features & UFFD_FEATURE_EVENT_FORK)) {
+		/*
+		 * We don't support UFFD_FEATURE_EVENT_FORK on a
+		 * secure-mode UFFD: doing so would need us to
+		 * construct the new file object in the context of the
+		 * fork child, and it's not worth it right now.
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
@@ -1942,6 +1961,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
 	struct userfaultfd_ctx *ctx;
 	int fd;
+	static const int uffd_flags = UFFD_SECURE;
 
 	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
@@ -1951,8 +1971,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	/* Check the UFFD_* constants for consistency.  */
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
 	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+	BUILD_BUG_ON(UFFD_SHARED_FCNTL_FLAGS & uffd_flags);
 
-	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
 		return -EINVAL;
 
 	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
@@ -1969,8 +1990,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	/* prevent the mm struct to be freed */
 	mmgrab(ctx->mm);
 
-	fd = anon_inode_getfd("[userfaultfd]", &userfaultfd_fops, ctx,
-			      O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	fd = anon_inode_getfd2("[userfaultfd]", &userfaultfd_fops, ctx,
+			       O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS),
+			       ((flags & UFFD_SECURE) ? ANON_INODE_SECURE : 0));
 	if (fd < 0) {
 		mmdrop(ctx->mm);
 		kmem_cache_free(userfaultfd_ctx_cachep, ctx);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..12d7d40d7f25 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -231,4 +231,12 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+/*
+ * Flags for the userfaultfd(2) system call itself.
+ */
+
+/*
+ * Create a userfaultfd with MAC security checks enabled.
+ */
+#define UFFD_SECURE 1
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 4/7] Teach SELinux about a new userfaultfd class
From: Daniel Colascione @ 2019-10-12 19:15 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

Use the secure anonymous inode LSM hook we just added to let SELinux
policy place restrictions on userfaultfd use. The create operation
applies to processes creating new instances of these file objects;
transfer between processes is covered by restrictions on read, write,
and ioctl access already checked inside selinux_file_receive.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/userfaultfd.c                    |  4 +-
 include/linux/userfaultfd_k.h       |  2 +
 security/selinux/hooks.c            | 68 +++++++++++++++++++++++++++++
 security/selinux/include/classmap.h |  2 +
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 29f920fb236e..1123089c3d55 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1014,8 +1014,6 @@ static __poll_t userfaultfd_poll(struct file *file, poll_table *wait)
 	}
 }
 
-static const struct file_operations userfaultfd_fops;
-
 static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
 				  struct userfaultfd_ctx *new,
 				  struct uffd_msg *msg)
@@ -1934,7 +1932,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 }
 #endif
 
-static const struct file_operations userfaultfd_fops = {
+const struct file_operations userfaultfd_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= userfaultfd_show_fdinfo,
 #endif
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac9d71e24b81..549c8b0cca52 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,8 @@
 
 extern int sysctl_unprivileged_userfaultfd;
 
+extern const struct file_operations userfaultfd_fops;
+
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 9625b99e677f..0b3a36cbfbdc 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -92,6 +92,10 @@
 #include <linux/fsnotify.h>
 #include <linux/fanotify.h>
 
+#ifdef CONFIG_USERFAULTFD
+#include <linux/userfaultfd_k.h>
+#endif
+
 #include "avc.h"
 #include "objsec.h"
 #include "netif.h"
@@ -2943,6 +2947,69 @@ static int selinux_inode_init_security(struct inode *inode, struct inode *dir,
 	return 0;
 }
 
+static int selinux_inode_init_security_anon(struct inode *inode,
+					    const char *name,
+					    const struct file_operations *fops)
+{
+	const struct task_security_struct *tsec = selinux_cred(current_cred());
+	struct common_audit_data ad;
+	struct inode_security_struct *isec;
+
+	if (unlikely(IS_PRIVATE(inode)))
+		return 0;
+
+	/*
+	 * We shouldn't be creating secure anonymous inodes before LSM
+	 * initialization completes.
+	 */
+	if (unlikely(!selinux_state.initialized))
+		return -EBUSY;
+
+	isec = selinux_inode(inode);
+
+	/*
+	 * We only get here once per ephemeral inode.  The inode has
+	 * been initialized via inode_alloc_security but is otherwise
+	 * untouched, so check that the state is as
+	 * inode_alloc_security left it.
+	 */
+	BUG_ON(isec->initialized != LABEL_INVALID);
+	BUG_ON(isec->sclass != SECCLASS_FILE);
+
+#ifdef CONFIG_USERFAULTFD
+	if (fops == &userfaultfd_fops)
+		isec->sclass = SECCLASS_UFFD;
+#endif
+
+	if (isec->sclass == SECCLASS_FILE) {
+		printk(KERN_WARNING "refusing to create secure anonymous inode "
+		       "of unknown type");
+		return -EOPNOTSUPP;
+	}
+	/*
+	 * Always give secure anonymous inodes the sid of the
+	 * creating task.
+	 */
+
+	isec->sid = tsec->sid;
+	isec->initialized = LABEL_INITIALIZED;
+
+	/*
+	 * Now that we've initialized security, check whether we're
+	 * allowed to actually create this type of anonymous inode.
+	 */
+
+	ad.type = LSM_AUDIT_DATA_INODE;
+	ad.u.inode = inode;
+
+	return avc_has_perm(&selinux_state,
+			    tsec->sid,
+			    isec->sid,
+			    isec->sclass,
+			    FILE__CREATE,
+			    &ad);
+}
+
 static int selinux_inode_create(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	return may_create(dir, dentry, SECCLASS_FILE);
@@ -6840,6 +6907,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(inode_alloc_security, selinux_inode_alloc_security),
 	LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security),
 	LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security),
+	LSM_HOOK_INIT(inode_init_security_anon, selinux_inode_init_security_anon),
 	LSM_HOOK_INIT(inode_create, selinux_inode_create),
 	LSM_HOOK_INIT(inode_link, selinux_inode_link),
 	LSM_HOOK_INIT(inode_unlink, selinux_inode_unlink),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 32e9b03be3dd..41bc5da78048 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -244,6 +244,8 @@ struct security_class_mapping secclass_map[] = {
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
 	{ "xdp_socket",
 	  { COMMON_SOCK_PERMS, NULL } },
+	{ "uffd",
+	  { COMMON_FILE_PERMS, NULL } },
 	{ NULL }
   };
 
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 5/7] Let userfaultfd opt out of handling kernel-mode faults
From: Daniel Colascione @ 2019-10-12 19:16 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

userfaultfd handles page faults from both user and kernel code.  Add a
new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
resulting userfaultfd object refuse to handle faults from kernel mode,
treating these faults as if SIGBUS were always raised, causing the
kernel code to fail with EFAULT.

A future patch adds a knob allowing administrators to give some
processes the ability to create userfaultfd file objects only if they
pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
will exploit userfaultfd's ability to delay kernel page faults to open
timing windows for future exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/userfaultfd.c                 | 5 ++++-
 include/uapi/linux/userfaultfd.h | 6 ++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1123089c3d55..986d23b2cd33 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -389,6 +389,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (ctx->features & UFFD_FEATURE_SIGBUS)
 		goto out;
+	if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
+	    ctx->flags & UFFD_USER_MODE_ONLY)
+		goto out;
 
 	/*
 	 * If it's already released don't get it. This avoids to loop
@@ -1959,7 +1962,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
 	struct userfaultfd_ctx *ctx;
 	int fd;
-	static const int uffd_flags = UFFD_SECURE;
+	static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
 
 	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 12d7d40d7f25..eadd1497e7b5 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -239,4 +239,10 @@ struct uffdio_zeropage {
  * Create a userfaultfd with MAC security checks enabled.
  */
 #define UFFD_SECURE 1
+
+/*
+ * Create a userfaultfd that can handle page faults only in user mode.
+ */
+#define UFFD_USER_MODE_ONLY 2
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 6/7] Allow users to require UFFD_SECURE
From: Daniel Colascione @ 2019-10-12 19:16 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

This change adds 2 as an allowable value for
unprivileged_userfaultfd. (Previously, this sysctl could be either 0
or 1.) When unprivileged_userfaultfd is 2, users with CAP_SYS_PTRACE
may create userfaultfd with or without UFFD_SECURE, but users without
CAP_SYS_PTRACE must pass UFFD_SECURE to userfaultfd in order for the
system call to succeed, effectively forcing them to opt into
additional security checks.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 6 ++++--
 fs/userfaultfd.c                        | 4 +++-
 kernel/sysctl.c                         | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..6664eec7bd35 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -842,8 +842,10 @@ unprivileged_userfaultfd
 
 This flag controls whether unprivileged users can use the userfaultfd
 system calls.  Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
+userfaultfd system calls, or set this to 0 to restrict userfaultfd to
+only privileged users (with SYS_CAP_PTRACE capability).  If set to 2,
+unprivileged (non-SYS_CAP_PTRACE) users may use userfaultfd only if
+they pass the UFFD_SECURE, enabling MAC security checks.
 
 The default value is 1.
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 986d23b2cd33..aaed9347973e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1963,8 +1963,10 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	struct userfaultfd_ctx *ctx;
 	int fd;
 	static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
+	bool need_cap_check = sysctl_unprivileged_userfaultfd == 0 ||
+		(sysctl_unprivileged_userfaultfd == 2 && !(flags & UFFD_SECURE));
 
-	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+	if (need_cap_check && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 
 	BUG_ON(!current->mm);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 00fcea236eba..fc98d5df344e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1738,7 +1738,7 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= &two,
 	},
 #endif
 	{ }
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* [PATCH 7/7] Add a new sysctl for limiting userfaultfd to user mode faults
From: Daniel Colascione @ 2019-10-12 19:16 UTC (permalink / raw)
  To: linux-api, linux-kernel, lokeshgidra, dancol, nnk; +Cc: nosh, timmurray
In-Reply-To: <20191012191602.45649-1-dancol@google.com>

Add a new sysctl knob unprivileged_userfaultfd_user_mode_only.
This sysctl can be set to either zero or one. When zero (the default)
the system lets all users call userfaultfd with or without
UFFD_USER_MODE_ONLY, modulo other access controls. When
unprivileged_userfaultfd_user_mode_only is set to one, users without
CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
will fail with EPERM. This facility allows administrators to reduce
the likelihood that an attacker with access to userfaultfd can delay
faulting kernel code to widen timing windows for other exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
 fs/userfaultfd.c                        | 12 ++++++++++--
 include/linux/userfaultfd_k.h           |  1 +
 kernel/sysctl.c                         |  9 +++++++++
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 6664eec7bd35..330fd82b3f4e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -849,6 +849,19 @@ they pass the UFFD_SECURE, enabling MAC security checks.
 
 The default value is 1.
 
+unprivileged_userfaultfd_user_mode_only
+========================================
+
+This flag controls whether unprivileged users can use the userfaultfd
+system calls to handle page faults in kernel mode.  If set to zero,
+userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
+unprivileged_userfaultfd above.  If set to one, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
+to succeed.  Prohibiting use of userfaultfd for handling faults from
+kernel mode may make certain vulnerabilities more difficult
+to exploit.
+
+The default value is 0.
 
 user_reserve_kbytes
 ===================
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aaed9347973e..02addd425ab7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,7 @@
 #include <linux/hugetlb.h>
 
 int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -1963,8 +1964,15 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	struct userfaultfd_ctx *ctx;
 	int fd;
 	static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
-	bool need_cap_check = sysctl_unprivileged_userfaultfd == 0 ||
-		(sysctl_unprivileged_userfaultfd == 2 && !(flags & UFFD_SECURE));
+	bool need_cap_check = false;
+
+	if (sysctl_unprivileged_userfaultfd == 0 ||
+	    (sysctl_unprivileged_userfaultfd == 2 && !(flags & UFFD_SECURE)))
+		need_cap_check = true;
+
+	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
+	    (flags & UFFD_USER_MODE_ONLY) == 0)
+		need_cap_check = true;
 
 	if (need_cap_check && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 549c8b0cca52..efe14abb2dc8 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -29,6 +29,7 @@
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
+extern int sysctl_unprivileged_userfaultfd_user_mode_only;
 
 extern const struct file_operations userfaultfd_fops;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fc98d5df344e..4f296676c0ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1740,6 +1740,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
+	{
+		.procname	= "unprivileged_userfaultfd_user_mode_only",
+		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
+		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 	{ }
 };
-- 
2.23.0.700.g56cf767bdb-goog

^ permalink raw reply related

* Re: [PATCH 1/7 v2] tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
From: Linus Torvalds @ 2019-10-12 22:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	Matthew Garrett, James Morris James Morris, LSM List, Linux API,
	Ben Hutchings, Al Viro
In-Reply-To: <20191012005920.630331484@goodmis.org>

On Fri, Oct 11, 2019 at 5:59 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
>
> I bisected this down to the addition of the proxy_ops into tracefs for
> lockdown. It appears that the allocation of the proxy_ops and then freeing
> it in the destroy_inode callback, is causing havoc with the memory system.
> Reading the documentation about destroy_inode and talking with Linus about
> this, this is buggy and wrong.

Can you still add the explanation about the inode memory leak to this message?

Right now it just says "it's buggy and wrong". True. But doesn't
explain _why_ it is buggy and wrong.

          Linus

^ permalink raw reply

* Re: [PATCH 4/7] Teach SELinux about a new userfaultfd class
From: Andy Lutomirski @ 2019-10-12 23:08 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Linux API, LKML, lokeshgidra, Nick Kralevich, nosh, Tim Murray
In-Reply-To: <20191012191602.45649-5-dancol@google.com>

On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione <dancol@google.com> wrote:
>
> Use the secure anonymous inode LSM hook we just added to let SELinux
> policy place restrictions on userfaultfd use. The create operation
> applies to processes creating new instances of these file objects;
> transfer between processes is covered by restrictions on read, write,
> and ioctl access already checked inside selinux_file_receive.

This is great, and I suspect we'll want it for things like SGX, too.
But the current design seems like it will make it essentially
impossible for SELinux to reference an anon_inode class whose
file_operations are in a module, and moving file_operations out of a
module would be nasty.

Could this instead be keyed off a new struct anon_inode_class, an
enum, or even just a string?

--Andy

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andy Lutomirski @ 2019-10-12 23:10 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Linux API, LKML, lokeshgidra, Nick Kralevich, nosh, Tim Murray
In-Reply-To: <20191012191602.45649-4-dancol@google.com>

On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione <dancol@google.com> wrote:
>
> The new secure flag makes userfaultfd use a new "secure" anonymous
> file object instead of the default one, letting security modules
> supervise userfaultfd use.
>
> Requiring that users pass a new flag lets us avoid changing the
> semantics for existing callers.

Is there any good reason not to make this be the default?

The only downside I can see is that it would increase the memory usage
of userfaultfd(), but that doesn't seem like such a big deal.  A
lighter-weight alternative would be to have a single inode shared by
all userfaultfd instances, which would require a somewhat different
internal anon_inode API.

In any event, I don't think that "make me visible to SELinux" should
be a choice that user code makes.

--Andy

^ permalink raw reply

* Re: [PATCH 6/7] Allow users to require UFFD_SECURE
From: Andy Lutomirski @ 2019-10-12 23:12 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Linux API, LKML, lokeshgidra, Nick Kralevich, nosh, Tim Murray
In-Reply-To: <20191012191602.45649-7-dancol@google.com>

On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione <dancol@google.com> wrote:
>
> This change adds 2 as an allowable value for
> unprivileged_userfaultfd. (Previously, this sysctl could be either 0
> or 1.) When unprivileged_userfaultfd is 2, users with CAP_SYS_PTRACE
> may create userfaultfd with or without UFFD_SECURE, but users without
> CAP_SYS_PTRACE must pass UFFD_SECURE to userfaultfd in order for the
> system call to succeed, effectively forcing them to opt into
> additional security checks.

This patch can go away entirely if you make UFFD_SECURE automatic.

^ permalink raw reply

* Re: [PATCH 4/7] Teach SELinux about a new userfaultfd class
From: Daniel Colascione @ 2019-10-13  0:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, LKML, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray
In-Reply-To: <CALCETrVmYQ9xikif--RSAWhboY1yj=piEAEuPzisf+b+qEX4uA@mail.gmail.com>

On Sat, Oct 12, 2019 at 4:09 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > Use the secure anonymous inode LSM hook we just added to let SELinux
> > policy place restrictions on userfaultfd use. The create operation
> > applies to processes creating new instances of these file objects;
> > transfer between processes is covered by restrictions on read, write,
> > and ioctl access already checked inside selinux_file_receive.
>
> This is great, and I suspect we'll want it for things like SGX, too.
> But the current design seems like it will make it essentially
> impossible for SELinux to reference an anon_inode class whose
> file_operations are in a module, and moving file_operations out of a
> module would be nasty.
>
> Could this instead be keyed off a new struct anon_inode_class, an
> enum, or even just a string?

The new LSM hook already receives the string that callers pass to the
anon_inode APIs; modules can look at that instead of the fops if they
want. The reason to pass both the name and the fops through the hook
is to allow LSMs to match using fops comparison (which seems less
prone to breakage) when possible and rely on string matching when it
isn't.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox