Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH 3/7 v2] tracing: Get trace_array reference for available_tracers files
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro, stable
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

As instances may have different tracers available, we need to look at the
trace_array descriptor that shows lists the available tracers for the
instance. But there's a race between opening the file and the admin from
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.

Cc: stable@vger.kernel.org
Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 252f79c435f8..fa7d813b04c6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4355,9 +4355,14 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	if (tracing_disabled)
 		return -ENODEV;
 
+	if (trace_array_get(tr) < 0)
+		return -ENODEV;
+
 	ret = seq_open(file, &show_traces_seq_ops);
-	if (ret)
+	if (ret) {
+		trace_array_put(tr);
 		return ret;
+	}
 
 	m = file->private_data;
 	m->private = tr;
@@ -4365,6 +4370,14 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static int show_traces_release(struct inode *inode, struct file *file)
+{
+	struct trace_array *tr = inode->i_private;
+
+	trace_array_put(tr);
+	return seq_release(inode, file);
+}
+
 static ssize_t
 tracing_write_stub(struct file *filp, const char __user *ubuf,
 		   size_t count, loff_t *ppos)
@@ -4395,8 +4408,8 @@ static const struct file_operations tracing_fops = {
 static const struct file_operations show_traces_fops = {
 	.open		= show_traces_open,
 	.read		= seq_read,
-	.release	= seq_release,
 	.llseek		= seq_lseek,
+	.release	= show_traces_release,
 };
 
 static ssize_t
-- 
2.23.0

^ permalink raw reply related

* [PATCH 2/7 v2] ftrace: Get a reference counter for the trace_array on filter files
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro, stable
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
an instance now. They need to take a reference to the instance otherwise
there could be a race between accessing the files and deleting the instance.

It wasn't until the :mod: caching where these file operations stated
referencing the trace_arry directly.

Cc: stable@vger.kernel.org
Fixes: 673feb9d76ab3 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 62a50bf399d6..32c2eb167de0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3540,21 +3540,22 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 	struct ftrace_hash *hash;
 	struct list_head *mod_head;
 	struct trace_array *tr = ops->private;
-	int ret = 0;
+	int ret = -ENOMEM;
 
 	ftrace_ops_init(ops);
 
 	if (unlikely(ftrace_disabled))
 		return -ENODEV;
 
+	if (tr && trace_array_get(tr) < 0)
+		return -ENODEV;
+
 	iter = kzalloc(sizeof(*iter), GFP_KERNEL);
 	if (!iter)
-		return -ENOMEM;
+		goto out;
 
-	if (trace_parser_get_init(&iter->parser, FTRACE_BUFF_MAX)) {
-		kfree(iter);
-		return -ENOMEM;
-	}
+	if (trace_parser_get_init(&iter->parser, FTRACE_BUFF_MAX))
+		goto out;
 
 	iter->ops = ops;
 	iter->flags = flag;
@@ -3584,13 +3585,13 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 
 		if (!iter->hash) {
 			trace_parser_put(&iter->parser);
-			kfree(iter);
-			ret = -ENOMEM;
 			goto out_unlock;
 		}
 	} else
 		iter->hash = hash;
 
+	ret = 0;
+
 	if (file->f_mode & FMODE_READ) {
 		iter->pg = ftrace_pages_start;
 
@@ -3602,7 +3603,6 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 			/* Failed */
 			free_ftrace_hash(iter->hash);
 			trace_parser_put(&iter->parser);
-			kfree(iter);
 		}
 	} else
 		file->private_data = iter;
@@ -3610,6 +3610,13 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
  out_unlock:
 	mutex_unlock(&ops->func_hash->regex_lock);
 
+ out:
+	if (ret) {
+		kfree(iter);
+		if (tr)
+			trace_array_put(tr);
+	}
+
 	return ret;
 }
 
@@ -5037,6 +5044,8 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
 
 	mutex_unlock(&iter->ops->func_hash->regex_lock);
 	free_ftrace_hash(iter->hash);
+	if (iter->tr)
+		trace_array_put(iter->tr);
 	kfree(iter);
 
 	return 0;
-- 
2.23.0

^ permalink raw reply related

* [PATCH 1/7 v2] tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro
In-Reply-To: <20191012005747.210722465@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Running the latest kernel through my "make instances" stress tests, I
triggered the following bug (with KASAN and kmemleak enabled):

mkdir invoked oom-killer:
gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
oom_score_adj=0
CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
Call Trace:
 dump_stack+0x64/0x8c
 dump_header+0x43/0x3b7
 ? trace_hardirqs_on+0x48/0x4a
 oom_kill_process+0x68/0x2d5
 out_of_memory+0x2aa/0x2d0
 __alloc_pages_nodemask+0x96d/0xb67
 __alloc_pages_node+0x19/0x1e
 alloc_slab_page+0x17/0x45
 new_slab+0xd0/0x234
 ___slab_alloc.constprop.86+0x18f/0x336
 ? alloc_inode+0x2c/0x74
 ? irq_trace+0x12/0x1e
 ? tracer_hardirqs_off+0x1d/0xd7
 ? __slab_alloc.constprop.85+0x21/0x53
 __slab_alloc.constprop.85+0x31/0x53
 ? __slab_alloc.constprop.85+0x31/0x53
 ? alloc_inode+0x2c/0x74
 kmem_cache_alloc+0x50/0x179
 ? alloc_inode+0x2c/0x74
 alloc_inode+0x2c/0x74
 new_inode_pseudo+0xf/0x48
 new_inode+0x15/0x25
 tracefs_get_inode+0x23/0x7c
 ? lookup_one_len+0x54/0x6c
 tracefs_create_file+0x53/0x11d
 trace_create_file+0x15/0x33
 event_create_dir+0x2a3/0x34b
 __trace_add_new_event+0x1c/0x26
 event_trace_add_tracer+0x56/0x86
 trace_array_create+0x13e/0x1e1
 instance_mkdir+0x8/0x17
 tracefs_syscall_mkdir+0x39/0x50
 ? get_dname+0x31/0x31
 vfs_mkdir+0x78/0xa3
 do_mkdirat+0x71/0xb0
 sys_mkdir+0x19/0x1b
 do_fast_syscall_32+0xb0/0xed

I bisected this down to the addition of the proxy_ops into tracefs for
lockdown. It appears that the allocation of the proxy_ops and then freeing
it in the destroy_inode callback, is causing havoc with the memory system.
Reading the documentation about destroy_inode and talking with Linus about
this, this is buggy and wrong.

Instead of allocating the proxy_ops (and then having to free it) the checks
should be done by the open functions themselves, and not hack into the
tracefs directory. First revert the tracefs updates for locked_down and then
later we can add the locked_down checks in the kernel/trace files.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Fixes: ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 fs/tracefs/inode.c | 42 +-----------------------------------------
 1 file changed, 1 insertion(+), 41 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 9fc14e38927f..eeeae0475da9 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,7 +20,6 @@
 #include <linux/parser.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
-#include <linux/security.h>

 #define TRACEFS_DEFAULT_MODE	0700

@@ -28,25 +27,6 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;

-static int default_open_file(struct inode *inode, struct file *filp)
-{
-	struct dentry *dentry = filp->f_path.dentry;
-	struct file_operations *real_fops;
-	int ret;
-
-	if (!dentry)
-		return -EINVAL;
-
-	ret = security_locked_down(LOCKDOWN_TRACEFS);
-	if (ret)
-		return ret;
-
-	real_fops = dentry->d_fsdata;
-	if (!real_fops->open)
-		return 0;
-	return real_fops->open(inode, filp);
-}
-
 static ssize_t default_read_file(struct file *file, char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -241,12 +221,6 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }

-static void tracefs_destroy_inode(struct inode *inode)
-{
-	if (S_ISREG(inode->i_mode))
-		kfree(inode->i_fop);
-}
-
 static int tracefs_remount(struct super_block *sb, int *flags, char *data)
 {
 	int err;
@@ -283,7 +257,6 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
 static const struct super_operations tracefs_super_operations = {
 	.statfs		= simple_statfs,
 	.remount_fs	= tracefs_remount,
-	.destroy_inode  = tracefs_destroy_inode,
 	.show_options	= tracefs_show_options,
 };

@@ -414,7 +387,6 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
-	struct file_operations *proxy_fops;
 	struct dentry *dentry;
 	struct inode *inode;

@@ -430,20 +402,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return failed_creating(dentry);

-	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
-	if (unlikely(!proxy_fops)) {
-		iput(inode);
-		return failed_creating(dentry);
-	}
-
-	if (!fops)
-		fops = &tracefs_file_operations;
-
-	dentry->d_fsdata = (void *)fops;
-	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
-	proxy_fops->open = default_open_file;
 	inode->i_mode = mode;
-	inode->i_fop = proxy_fops;
+	inode->i_fop = fops ? fops : &tracefs_file_operations;
 	inode->i_private = data;
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
-- 
2.23.0

^ permalink raw reply related

* [PATCH 0/7 v2] tracing: Fix tracefs lockdown and various clean ups
From: Steven Rostedt @ 2019-10-12  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Matthew Garrett,
	James Morris James Morris, LSM List, Linux API, Ben Hutchings,
	Al Viro

It appears that using destroy_inode() to clean up the proxy_ops that was
used by the lockdown code to have all open calls to the tracefs directory
was totally broken. It caused the inodes to not be cleaned up as the
destroy_inode() method is expected to clean up the inode, and not just what
it allocated as extra.

Linus suggested to get rid of the proxy_ops in tracefs, and just put the
checks in the open functions themselves. This also gives us a bit more fine
grain control to what exactly can be accessed.

Currently, I left the event format files (as they may need to be used by
something other than tracing), and enabled_functions, which shows what
functions are currently being traced. Not sure it is wise to not display
that information.

They can always be locked down later if need be.

Steven Rostedt (VMware) (7):
      tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
      ftrace: Get a reference counter for the trace_array on filter files
      tracing: Get trace_array reference for available_tracers files
      tracing: Have trace events system open call tracing_open_generic_tr()
      tracing: Add tracing_check_open_get_tr()
      tracing: Add some more locked_down checks
      tracing: Do not create tracefs files if tracefs lockdown is in effect

----
 fs/tracefs/inode.c                  |  46 ++----------
 kernel/trace/ftrace.c               |  55 ++++++++++----
 kernel/trace/trace.c                | 138 ++++++++++++++++++++++--------------
 kernel/trace/trace.h                |   2 +
 kernel/trace/trace_dynevent.c       |   4 ++
 kernel/trace/trace_events.c         |  49 +++++--------
 kernel/trace/trace_events_hist.c    |  13 +++-
 kernel/trace/trace_events_trigger.c |   8 ++-
 kernel/trace/trace_kprobe.c         |  12 +++-
 kernel/trace/trace_printk.c         |   7 ++
 kernel/trace/trace_stack.c          |   8 +++
 kernel/trace/trace_stat.c           |   6 +-
 kernel/trace/trace_uprobe.c         |  11 +++
 13 files changed, 220 insertions(+), 139 deletions(-)

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 22:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Linus Torvalds, LKML, Matthew Garrett, James Morris James Morris,
	LSM List, Linux API, Ben Hutchings, Al Viro
In-Reply-To: <87tv8f9cr7.fsf@mid.deneb.enyo.de>

On Fri, 11 Oct 2019 23:46:20 +0200
Florian Weimer <fw@deneb.enyo.de> wrote:

> * Steven Rostedt:
> 
> > Once locked down is set, can it ever be undone without rebooting?  
> 
> I think this is the original intent with such patches, yes.  But then
> reality interferes and people add some escape hatch, so that it's
> possible again to load arbitrary kernel modules.  And for servers, you
> can't have a meaningful physical presence check, so you end up with a
> lot of complexity for something that offers absolutely zero gains in
> security.
> 
> The other practical issue is that general-purpose Linux distributions
> cannot prevent kernel downgrades, so even if there's a
> cryptographically signed chain from the firmware to the kernel, you
> can boot last year's kernel, use a root-to-ring-0 exploit to disable
> its particular implementation of lockdown, and then kexec the real
> kernel with lockdown disabled.
> 
> I'm sure that kernel lockdown has applications somewhere, but for
> general-purpose distributions (who usually want to support third-party
> kernel modules), it's an endless source of problems that wouldn't
> exist without it.

I just decided to keep the two separate. The tracing_disable is
permanent (unless you actually do something that writes into kernel
memory to change the variable). When set, there's nothing to clear it.

Thus, I decided not to couple that with lockdown, and let the lockdown
folks do whatever they damn well please ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Florian Weimer @ 2019-10-11 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Matthew Garrett, James Morris James Morris,
	LSM List, Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011143610.21bcd9c0@gandalf.local.home>

* Steven Rostedt:

> Once locked down is set, can it ever be undone without rebooting?

I think this is the original intent with such patches, yes.  But then
reality interferes and people add some escape hatch, so that it's
possible again to load arbitrary kernel modules.  And for servers, you
can't have a meaningful physical presence check, so you end up with a
lot of complexity for something that offers absolutely zero gains in
security.

The other practical issue is that general-purpose Linux distributions
cannot prevent kernel downgrades, so even if there's a
cryptographically signed chain from the firmware to the kernel, you
can boot last year's kernel, use a root-to-ring-0 exploit to disable
its particular implementation of lockdown, and then kexec the real
kernel with lockdown disabled.

I'm sure that kernel lockdown has applications somewhere, but for
general-purpose distributions (who usually want to support third-party
kernel modules), it's an endless source of problems that wouldn't
exist without it.

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <CAHk-=wiGtEDhwJab7+tQzmjDssynBruRvgj9NJY2bzNrVzw+0Q@mail.gmail.com>

On Fri, 11 Oct 2019 14:00:50 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Oct 11, 2019 at 1:55 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I guess I can keep it this way. Thoughts?  
> 
> That looks fine to me. I'm still not sure you actually need to do all
> this, but it doesn't look _wrong_.

Yep, I sent this before seeing your other email.

> 
> That said, I still do think that if things are locked down from the
> very get-go, tracefs_create_file() shouldn't even create the files.

Agreed. I forgot to add in my last email:

4) Add the lockdown check to create_file()

> 
> That's mostly an independent thing from the "what about if they exists
> and things got locked down afterwards", though.

Well, I'll be combining it with the tracing_disabled code, which was
there originally to prevent corrupted buffers from causing harm by
being read.

> 
> I do wonder about the whole "well, if you started tracing before
> locking things down, don't you want to see the end results"?
> 

I don't think it hurts to disable it. Although, I don't think reading
the trace event formats will be a issue. I'm not aware of any of them
from giving too much info to the system. And if you really do care
about that, do it at boot up, and with the create_file part, it wont
have any files to read.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com>

On Fri, 11 Oct 2019 13:46:11 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Oct 11, 2019 at 1:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > OK, so I tried this approach, and there's a bit more than just a "few"
> > extra cases that use tracing_open_generic().  
> 
> Yeah, that's more than I would have expected.
> 
> That said, you also did what I consider a somewhat over-done conversion.
> 
> Just do
> 
>    static inline bool tracefs_lockdown_or_disabled(void)
>    { return tracing_disabled || security_locked_down(LOCKDOWN_TRACEFS); }
> 
> and ignore the pointless return value (we know it's EPERM, and we
> really don't care).
> 
> And then several of those conversions just turn into oneliner
> 
> -       if (tracing_disabled)
> +       if (tracefs_lockdown_or_disabled())
>                  return -ENODEV;

OK, so this is similar to what I just sent. But instead I made it into
a function that combines tracing_disabled and the locked_down, plus, it
did the reference update for the trace_array if needed (which in doing
this exercise, I think I found a couple of places that need the ref
count update).

> 
> patches instead.
> 
> I'm also not sure why you bothered with a lot of the files that don't
> currently even have that "tracing_disabled" logic at all. Yeah, they
> show pre-existing tracing info, but even if you locked down _after_
> starting some tracing, that's probably what you want. You just don't
> want people to be able to add new trace events.

I guess just preventing new traces would be good enough (or anything
that records data), as well as seeing that recorded data.

Note, the tracing_disabled was added in case the ring buffer got
corrupted, and we wanted to prevent the system from crashing if someone
read the corrupted ring buffer. In the over 10 years of having that
check, I don't recall a single instance of someone triggering it :-/


> 
> For example, maybe you want to have lockdown after you've booted, but
> still show boot time traces?
> 
> I dunno. I feel like you re-did the original patch, and the original
> patch was designed for "least line impact" rather than for anything
> else.
> 
> I also suspect that if you *actually* do lockdown at early boot (which
> is presumably one common case), then all you need is to do
> 
>     --- a/fs/tracefs/inode.c
>     +++ b/fs/tracefs/inode.c
>     @@ -418,6 +418,9 @@ struct dentry *tracefs_create_file(
>         struct dentry *dentry;
>         struct inode *inode;
> 
>     +   if (security_locked_down(LOCKDOWN_TRACEFS))
>     +           return NULL;
>     +

Sounds reasonable.

>         if (!(mode & S_IFMT))
>                 mode |= S_IFREG;
>         BUG_ON(!S_ISREG(mode));
> 
> and the "open-time check" is purely for "we did lockdown _after_ boot,
> but then you might well want to be able to read the existing trace
> information that you did before you locked down.
> 
> Again - I think trying to emulate exactly what that broken lockdown
> patch did is not really what you should aim for.
> 
> That patch was not only buggy, it really wasn't about what people
> really necessarily _want_, as about "we don't want to deal with
> tracing, so here's a minimal patch that doesn't even work".

OK. My new approach is to:

1) revert the tracefs lockdown patch
2) Add my tracing_check_open_get_tr() patch (and consolidate the calls)
 (fix up anything that should have this too)
3) Add the lockdown to the tracing_check_open_get_tr() routine, and be
   done with it.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Linus Torvalds @ 2019-10-11 21:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011165455.32666d53@gandalf.local.home>

On Fri, Oct 11, 2019 at 1:55 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I guess I can keep it this way. Thoughts?

That looks fine to me. I'm still not sure you actually need to do all
this, but it doesn't look _wrong_.

That said, I still do think that if things are locked down from the
very get-go, tracefs_create_file() shouldn't even create the files.

That's mostly an independent thing from the "what about if they exists
and things got locked down afterwards", though.

I do wonder about the whole "well, if you started tracing before
locking things down, don't you want to see the end results"?

             Linus

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011162518.2f8c99ca@gandalf.local.home>

On Fri, 11 Oct 2019 16:25:18 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> Here's the diffstat:
> 
>  fs/tracefs/inode.c                  |   42 --------------------
>  kernel/trace/ftrace.c               |   29 +++++++++++++-
>  kernel/trace/trace.c                |   73 ++++++++++++++++++++++++++++++++++--
>  kernel/trace/trace_dynevent.c       |    5 ++
>  kernel/trace/trace_events.c         |   10 ++++
>  kernel/trace/trace_events_hist.c    |   11 +++++
>  kernel/trace/trace_events_trigger.c |    8 +++
>  kernel/trace/trace_kprobe.c         |   11 +++++
>  kernel/trace/trace_printk.c         |    7 +++
>  kernel/trace/trace_stack.c          |    8 +++
>  kernel/trace/trace_stat.c           |    6 ++
>  kernel/trace/trace_uprobe.c         |   11 +++++
>  12 files changed, 173 insertions(+), 48 deletions(-)
> 
> Compared to the patch with the full wrapper:
> 
>  fs/tracefs/inode.c | 153 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 135 insertions(+), 18 deletions(-)
> 

Consolidating some of the functions did a little better on the deletion
front:

 fs/tracefs/inode.c                  |   42 ------------
 kernel/trace/ftrace.c               |   30 +++++++-
 kernel/trace/trace.c                |  122 +++++++++++++++++++++---------------
 kernel/trace/trace.h                |    1 
 kernel/trace/trace_dynevent.c       |    5 +
 kernel/trace/trace_events.c         |   41 ++++--------
 kernel/trace/trace_events_hist.c    |   11 +++
 kernel/trace/trace_events_trigger.c |    8 ++
 kernel/trace/trace_kprobe.c         |   11 +++
 kernel/trace/trace_printk.c         |    7 ++
 kernel/trace/trace_stack.c          |    8 ++
 kernel/trace/trace_stat.c           |    6 +
 kernel/trace/trace_uprobe.c         |   11 +++
 13 files changed, 181 insertions(+), 122 deletions(-)

I guess I can keep it this way. Thoughts?

-- Steve


diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 9fc14e38927f..eeeae0475da9 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,7 +20,6 @@
 #include <linux/parser.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
-#include <linux/security.h>
 
 #define TRACEFS_DEFAULT_MODE	0700
 
@@ -28,25 +27,6 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
-static int default_open_file(struct inode *inode, struct file *filp)
-{
-	struct dentry *dentry = filp->f_path.dentry;
-	struct file_operations *real_fops;
-	int ret;
-
-	if (!dentry)
-		return -EINVAL;
-
-	ret = security_locked_down(LOCKDOWN_TRACEFS);
-	if (ret)
-		return ret;
-
-	real_fops = dentry->d_fsdata;
-	if (!real_fops->open)
-		return 0;
-	return real_fops->open(inode, filp);
-}
-
 static ssize_t default_read_file(struct file *file, char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -241,12 +221,6 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }
 
-static void tracefs_destroy_inode(struct inode *inode)
-{
-	if (S_ISREG(inode->i_mode))
-		kfree(inode->i_fop);
-}
-
 static int tracefs_remount(struct super_block *sb, int *flags, char *data)
 {
 	int err;
@@ -283,7 +257,6 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
 static const struct super_operations tracefs_super_operations = {
 	.statfs		= simple_statfs,
 	.remount_fs	= tracefs_remount,
-	.destroy_inode  = tracefs_destroy_inode,
 	.show_options	= tracefs_show_options,
 };
 
@@ -414,7 +387,6 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
-	struct file_operations *proxy_fops;
 	struct dentry *dentry;
 	struct inode *inode;
 
@@ -430,20 +402,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return failed_creating(dentry);
 
-	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
-	if (unlikely(!proxy_fops)) {
-		iput(inode);
-		return failed_creating(dentry);
-	}
-
-	if (!fops)
-		fops = &tracefs_file_operations;
-
-	dentry->d_fsdata = (void *)fops;
-	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
-	proxy_fops->open = default_open_file;
 	inode->i_mode = mode;
-	inode->i_fop = proxy_fops;
+	inode->i_fop = fops ? fops : &tracefs_file_operations;
 	inode->i_private = data;
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 62a50bf399d6..7bcbfef9ab56 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -18,6 +18,7 @@
 #include <linux/clocksource.h>
 #include <linux/sched/task.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/tracefs.h>
 #include <linux/hardirq.h>
@@ -3486,6 +3487,11 @@ static int
 ftrace_avail_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (unlikely(ftrace_disabled))
 		return -ENODEV;
@@ -3504,6 +3510,11 @@ static int
 ftrace_enabled_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	iter = __seq_open_private(file, &show_ftrace_seq_ops, sizeof(*iter));
 	if (!iter)
@@ -3540,7 +3551,11 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 	struct ftrace_hash *hash;
 	struct list_head *mod_head;
 	struct trace_array *tr = ops->private;
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	ftrace_ops_init(ops);
 
@@ -3618,6 +3633,7 @@ ftrace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops,
 			FTRACE_ITER_FILTER | FTRACE_ITER_DO_PROBES,
 			inode, file);
@@ -3628,6 +3644,7 @@ ftrace_notrace_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_NOTRACE,
 				 inode, file);
 }
@@ -5194,9 +5211,13 @@ static int
 __ftrace_graph_open(struct inode *inode, struct file *file,
 		    struct ftrace_graph_data *fgd)
 {
-	int ret = 0;
+	int ret;
 	struct ftrace_hash *new_hash = NULL;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (file->f_mode & FMODE_WRITE) {
 		const int size_bits = FTRACE_HASH_DEFAULT_BITS;
 
@@ -6537,8 +6558,9 @@ ftrace_pid_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret = 0;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if ((file->f_mode & FMODE_WRITE) &&
 	    (file->f_flags & O_TRUNC))
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 252f79c435f8..d6e467281bbc 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -17,6 +17,7 @@
 #include <linux/stacktrace.h>
 #include <linux/writeback.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/notifier.h>
 #include <linux/irqflags.h>
@@ -304,6 +305,23 @@ void trace_array_put(struct trace_array *this_tr)
 	mutex_unlock(&trace_types_lock);
 }
 
+int tracing_check_open_get_tr(struct trace_array *tr)
+{
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
+	if (tracing_disabled)
+		return -ENODEV;
+
+	if (tr && trace_array_get(tr) < 0)
+		return -ENODEV;
+
+	return 0;
+}
+
 int call_filter_check_discard(struct trace_event_call *call, void *rec,
 			      struct ring_buffer *buffer,
 			      struct ring_buffer_event *event)
@@ -4140,8 +4158,11 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 
 int tracing_open_generic(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	filp->private_data = inode->i_private;
 	return 0;
@@ -4159,12 +4180,11 @@ bool tracing_is_disabled(void)
 static int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
+	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	filp->private_data = inode->i_private;
 
@@ -4233,10 +4253,11 @@ static int tracing_open(struct inode *inode, struct file *file)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	/* If this file was open for write, then erase contents */
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
@@ -4352,8 +4373,9 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	ret = seq_open(file, &show_traces_seq_ops);
 	if (ret)
@@ -4697,11 +4719,9 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_trace_options_show, inode->i_private);
 	if (ret < 0)
@@ -5038,8 +5058,11 @@ static const struct seq_operations tracing_saved_tgids_seq_ops = {
 
 static int tracing_saved_tgids_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_saved_tgids_seq_ops);
 }
@@ -5115,8 +5138,11 @@ static const struct seq_operations tracing_saved_cmdlines_seq_ops = {
 
 static int tracing_saved_cmdlines_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_saved_cmdlines_seq_ops);
 }
@@ -5280,8 +5306,11 @@ static const struct seq_operations tracing_eval_map_seq_ops = {
 
 static int tracing_eval_map_open(struct inode *inode, struct file *filp)
 {
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
+
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
 
 	return seq_open(filp, &tracing_eval_map_seq_ops);
 }
@@ -5804,13 +5833,11 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
-
-	if (tracing_disabled)
-		return -ENODEV;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	mutex_lock(&trace_types_lock);
 
@@ -6547,11 +6574,9 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr))
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_clock_show, inode->i_private);
 	if (ret < 0)
@@ -6581,11 +6606,9 @@ static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr))
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	ret = single_open(file, tracing_time_stamp_mode_show, inode->i_private);
 	if (ret < 0)
@@ -6638,10 +6661,11 @@ static int tracing_snapshot_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
 	struct seq_file *m;
-	int ret = 0;
+	int ret;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	if (file->f_mode & FMODE_READ) {
 		iter = __tracing_open(inode, file, true);
@@ -6786,6 +6810,7 @@ static int snapshot_raw_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
+	/* The following checks for tracefs lockdown */
 	ret = tracing_buffers_open(inode, filp);
 	if (ret < 0)
 		return ret;
@@ -7105,8 +7130,9 @@ static int tracing_err_log_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret = 0;
 
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	/* If this file was opened for write, then erase contents */
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
@@ -7157,11 +7183,9 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
-	if (tracing_disabled)
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f801d154ff6a..fc41971306c1 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -338,6 +338,7 @@ extern struct mutex trace_types_lock;
 
 extern int trace_array_get(struct trace_array *tr);
 extern void trace_array_put(struct trace_array *tr);
+extern int tracing_check_open_get_tr(struct trace_array *tr);
 
 extern int tracing_set_time_stamp_abs(struct trace_array *tr, bool abs);
 extern int tracing_set_clock(struct trace_array *tr, const char *clockstr);
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index a41fed46c285..74df50375fd9 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2018 Masami Hiramatsu <mhiramat@kernel.org>
  */
 
+#include <linux/security.h>
 #include <linux/debugfs.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -174,6 +175,10 @@ static int dyn_event_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = tracing_check_open_get_tr(NULL);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(NULL);
 		if (ret < 0)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index b89cdfe20bc1..24a642489437 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -12,6 +12,7 @@
 #define pr_fmt(fmt) fmt
 
 #include <linux/workqueue.h>
+#include <linux/security.h>
 #include <linux/spinlock.h>
 #include <linux/kthread.h>
 #include <linux/tracefs.h>
@@ -1294,6 +1295,10 @@ static int trace_format_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = seq_open(file, &trace_format_seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1391,9 +1396,6 @@ static int subsystem_open(struct inode *inode, struct file *filp)
 	struct trace_array *tr;
 	int ret;
 
-	if (tracing_is_disabled())
-		return -ENODEV;
-
 	/* Make sure the system still exists */
 	mutex_lock(&event_mutex);
 	mutex_lock(&trace_types_lock);
@@ -1420,16 +1422,9 @@ static int subsystem_open(struct inode *inode, struct file *filp)
 	WARN_ON(!dir);
 
 	/* Still need to increment the ref count of the system */
-	if (trace_array_get(tr) < 0) {
-		put_system(dir);
-		return -ENODEV;
-	}
-
-	ret = tracing_open_generic(inode, filp);
-	if (ret < 0) {
-		trace_array_put(tr);
+	ret = tracing_open_generic_tr(inode, filp);
+	if (ret < 0)
 		put_system(dir);
-	}
 
 	return ret;
 }
@@ -1440,28 +1435,17 @@ static int system_tr_open(struct inode *inode, struct file *filp)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	if (tracing_is_disabled())
-		return -ENODEV;
-
-	if (trace_array_get(tr) < 0)
-		return -ENODEV;
-
 	/* Make a temporary dir that has no system but points to tr */
 	dir = kzalloc(sizeof(*dir), GFP_KERNEL);
-	if (!dir) {
-		trace_array_put(tr);
+	if (!dir)
 		return -ENOMEM;
-	}
-
-	dir->tr = tr;
 
-	ret = tracing_open_generic(inode, filp);
+	ret = tracing_open_generic_tr(inode, filp);
 	if (ret < 0) {
-		trace_array_put(tr);
 		kfree(dir);
 		return ret;
 	}
-
+	dir->tr = tr;
 	filp->private_data = dir;
 
 	return 0;
@@ -1771,6 +1755,10 @@ ftrace_event_open(struct inode *inode, struct file *file,
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = seq_open(file, seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1795,6 +1783,7 @@ ftrace_event_avail_open(struct inode *inode, struct file *file)
 {
 	const struct seq_operations *seq_ops = &show_event_seq_ops;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_event_open(inode, file, seq_ops);
 }
 
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 9468bd8d44a2..71dfe6889d62 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -7,6 +7,7 @@
 
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
@@ -1448,6 +1449,10 @@ static int synth_events_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&synth_event_ops);
 		if (ret < 0)
@@ -5515,6 +5520,12 @@ static int hist_show(struct seq_file *m, void *v)
 
 static int event_hist_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return single_open(file, hist_show, file);
 }
 
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 2a2912cb4533..2cd53ca21b51 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2013 Tom Zanussi <tom.zanussi@linux.intel.com>
  */
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/ctype.h>
 #include <linux/mutex.h>
@@ -173,7 +174,11 @@ static const struct seq_operations event_triggers_seq_ops = {
 
 static int event_trigger_regex_open(struct inode *inode, struct file *file)
 {
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	mutex_lock(&event_mutex);
 
@@ -292,6 +297,7 @@ event_trigger_write(struct file *filp, const char __user *ubuf,
 static int
 event_trigger_open(struct inode *inode, struct file *filp)
 {
+	/* Checks for tracefs lockdown */
 	return event_trigger_regex_open(inode, filp);
 }
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 324ffbea3556..e4422746cb4c 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -7,6 +7,7 @@
  */
 #define pr_fmt(fmt)	"trace_kprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
 #include <linux/rculist.h>
@@ -936,6 +937,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_kprobe_ops);
 		if (ret < 0)
@@ -988,6 +993,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index c3fd849d4a8f..d4e31e969206 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -6,6 +6,7 @@
  *
  */
 #include <linux/seq_file.h>
+#include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
 #include <linux/ftrace.h>
@@ -348,6 +349,12 @@ static const struct seq_operations show_format_seq_ops = {
 static int
 ftrace_formats_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &show_format_seq_ops);
 }
 
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index ec9a34a97129..4df9a209f7ca 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -5,6 +5,7 @@
  */
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
+#include <linux/security.h>
 #include <linux/kallsyms.h>
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
@@ -470,6 +471,12 @@ static const struct seq_operations stack_trace_seq_ops = {
 
 static int stack_trace_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &stack_trace_seq_ops);
 }
 
@@ -487,6 +494,7 @@ stack_trace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_FILTER,
 				 inode, file);
 }
diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c
index 75bf1bcb4a8a..9ab0a1a7ad5e 100644
--- a/kernel/trace/trace_stat.c
+++ b/kernel/trace/trace_stat.c
@@ -9,7 +9,7 @@
  *
  */
 
-
+#include <linux/security.h>
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
@@ -238,6 +238,10 @@ static int tracing_stat_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	struct stat_session *session = inode->i_private;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = stat_seq_init(session);
 	if (ret)
 		return ret;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index dd884341f5c5..352073d36585 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -7,6 +7,7 @@
  */
 #define pr_fmt(fmt)	"trace_uprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
@@ -769,6 +770,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_uprobe_ops);
 		if (ret)
@@ -818,6 +823,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 

^ permalink raw reply related

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Linus Torvalds @ 2019-10-11 20:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011162518.2f8c99ca@gandalf.local.home>

On Fri, Oct 11, 2019 at 1:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> OK, so I tried this approach, and there's a bit more than just a "few"
> extra cases that use tracing_open_generic().

Yeah, that's more than I would have expected.

That said, you also did what I consider a somewhat over-done conversion.

Just do

   static inline bool tracefs_lockdown_or_disabled(void)
   { return tracing_disabled || security_locked_down(LOCKDOWN_TRACEFS); }

and ignore the pointless return value (we know it's EPERM, and we
really don't care).

And then several of those conversions just turn into oneliner

-       if (tracing_disabled)
+       if (tracefs_lockdown_or_disabled())
                 return -ENODEV;

patches instead.

I'm also not sure why you bothered with a lot of the files that don't
currently even have that "tracing_disabled" logic at all. Yeah, they
show pre-existing tracing info, but even if you locked down _after_
starting some tracing, that's probably what you want. You just don't
want people to be able to add new trace events.

For example, maybe you want to have lockdown after you've booted, but
still show boot time traces?

I dunno. I feel like you re-did the original patch, and the original
patch was designed for "least line impact" rather than for anything
else.

I also suspect that if you *actually* do lockdown at early boot (which
is presumably one common case), then all you need is to do

    --- a/fs/tracefs/inode.c
    +++ b/fs/tracefs/inode.c
    @@ -418,6 +418,9 @@ struct dentry *tracefs_create_file(
        struct dentry *dentry;
        struct inode *inode;

    +   if (security_locked_down(LOCKDOWN_TRACEFS))
    +           return NULL;
    +
        if (!(mode & S_IFMT))
                mode |= S_IFREG;
        BUG_ON(!S_ISREG(mode));

and the "open-time check" is purely for "we did lockdown _after_ boot,
but then you might well want to be able to read the existing trace
information that you did before you locked down.

Again - I think trying to emulate exactly what that broken lockdown
patch did is not really what you should aim for.

That patch was not only buggy, it really wasn't about what people
really necessarily _want_, as about "we don't want to deal with
tracing, so here's a minimal patch that doesn't even work".

          Linus

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com>

On Fri, 11 Oct 2019 11:20:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So from a quick glance, just making tracing_open_generic() do the
> lockdown testing will take care of most cases.  Add a few other cases
> to fill up the whole set, and your'e done.
> 
> Willing to do that instead?

OK, so I tried this approach, and there's a bit more than just a "few"
extra cases that use tracing_open_generic(). Below is the full patch.

Here's the diffstat:

 fs/tracefs/inode.c                  |   42 --------------------
 kernel/trace/ftrace.c               |   29 +++++++++++++-
 kernel/trace/trace.c                |   73 ++++++++++++++++++++++++++++++++++--
 kernel/trace/trace_dynevent.c       |    5 ++
 kernel/trace/trace_events.c         |   10 ++++
 kernel/trace/trace_events_hist.c    |   11 +++++
 kernel/trace/trace_events_trigger.c |    8 +++
 kernel/trace/trace_kprobe.c         |   11 +++++
 kernel/trace/trace_printk.c         |    7 +++
 kernel/trace/trace_stack.c          |    8 +++
 kernel/trace/trace_stat.c           |    6 ++
 kernel/trace/trace_uprobe.c         |   11 +++++
 12 files changed, 173 insertions(+), 48 deletions(-)

Compared to the patch with the full wrapper:

 fs/tracefs/inode.c | 153 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 135 insertions(+), 18 deletions(-)

I'm thinking that the full wrapper may not be as bad. But then again, I
could probably clean up some of this code and have a single check for
both the lockdown and the "tracing_disabled" check.

-- Steve

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 9fc14e38927f..eeeae0475da9 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,7 +20,6 @@
 #include <linux/parser.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
-#include <linux/security.h>
 
 #define TRACEFS_DEFAULT_MODE	0700
 
@@ -28,25 +27,6 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
-static int default_open_file(struct inode *inode, struct file *filp)
-{
-	struct dentry *dentry = filp->f_path.dentry;
-	struct file_operations *real_fops;
-	int ret;
-
-	if (!dentry)
-		return -EINVAL;
-
-	ret = security_locked_down(LOCKDOWN_TRACEFS);
-	if (ret)
-		return ret;
-
-	real_fops = dentry->d_fsdata;
-	if (!real_fops->open)
-		return 0;
-	return real_fops->open(inode, filp);
-}
-
 static ssize_t default_read_file(struct file *file, char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -241,12 +221,6 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }
 
-static void tracefs_destroy_inode(struct inode *inode)
-{
-	if (S_ISREG(inode->i_mode))
-		kfree(inode->i_fop);
-}
-
 static int tracefs_remount(struct super_block *sb, int *flags, char *data)
 {
 	int err;
@@ -283,7 +257,6 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
 static const struct super_operations tracefs_super_operations = {
 	.statfs		= simple_statfs,
 	.remount_fs	= tracefs_remount,
-	.destroy_inode  = tracefs_destroy_inode,
 	.show_options	= tracefs_show_options,
 };
 
@@ -414,7 +387,6 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
-	struct file_operations *proxy_fops;
 	struct dentry *dentry;
 	struct inode *inode;
 
@@ -430,20 +402,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return failed_creating(dentry);
 
-	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
-	if (unlikely(!proxy_fops)) {
-		iput(inode);
-		return failed_creating(dentry);
-	}
-
-	if (!fops)
-		fops = &tracefs_file_operations;
-
-	dentry->d_fsdata = (void *)fops;
-	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
-	proxy_fops->open = default_open_file;
 	inode->i_mode = mode;
-	inode->i_fop = proxy_fops;
+	inode->i_fop = fops ? fops : &tracefs_file_operations;
 	inode->i_private = data;
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 62a50bf399d6..326253b4de93 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -18,6 +18,7 @@
 #include <linux/clocksource.h>
 #include <linux/sched/task.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/tracefs.h>
 #include <linux/hardirq.h>
@@ -3486,6 +3487,11 @@ static int
 ftrace_avail_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (unlikely(ftrace_disabled))
 		return -ENODEV;
@@ -3504,6 +3510,11 @@ static int
 ftrace_enabled_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_iterator *iter;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	iter = __seq_open_private(file, &show_ftrace_seq_ops, sizeof(*iter));
 	if (!iter)
@@ -3540,7 +3551,11 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
 	struct ftrace_hash *hash;
 	struct list_head *mod_head;
 	struct trace_array *tr = ops->private;
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	ftrace_ops_init(ops);
 
@@ -3618,6 +3633,7 @@ ftrace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops,
 			FTRACE_ITER_FILTER | FTRACE_ITER_DO_PROBES,
 			inode, file);
@@ -3628,6 +3644,7 @@ ftrace_notrace_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_NOTRACE,
 				 inode, file);
 }
@@ -5194,9 +5211,13 @@ static int
 __ftrace_graph_open(struct inode *inode, struct file *file,
 		    struct ftrace_graph_data *fgd)
 {
-	int ret = 0;
+	int ret;
 	struct ftrace_hash *new_hash = NULL;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (file->f_mode & FMODE_WRITE) {
 		const int size_bits = FTRACE_HASH_DEFAULT_BITS;
 
@@ -6537,6 +6558,10 @@ ftrace_pid_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret = 0;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (trace_array_get(tr) < 0)
 		return -ENODEV;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 252f79c435f8..4be1be27d064 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -17,6 +17,7 @@
 #include <linux/stacktrace.h>
 #include <linux/writeback.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <linux/notifier.h>
 #include <linux/irqflags.h>
@@ -4140,6 +4141,12 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 
 int tracing_open_generic(struct inode *inode, struct file *filp)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -4159,6 +4166,11 @@ bool tracing_is_disabled(void)
 static int tracing_open_generic_tr(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (tracing_disabled)
 		return -ENODEV;
@@ -4233,7 +4245,11 @@ static int tracing_open(struct inode *inode, struct file *file)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (trace_array_get(tr) < 0)
 		return -ENODEV;
@@ -4352,6 +4368,10 @@ static int show_traces_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -4697,6 +4717,10 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -5038,6 +5062,12 @@ static const struct seq_operations tracing_saved_tgids_seq_ops = {
 
 static int tracing_saved_tgids_open(struct inode *inode, struct file *filp)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -5115,6 +5145,12 @@ static const struct seq_operations tracing_saved_cmdlines_seq_ops = {
 
 static int tracing_saved_cmdlines_open(struct inode *inode, struct file *filp)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -5280,6 +5316,12 @@ static const struct seq_operations tracing_eval_map_seq_ops = {
 
 static int tracing_eval_map_open(struct inode *inode, struct file *filp)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -5804,7 +5846,11 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (tracing_disabled)
 		return -ENODEV;
@@ -6547,6 +6593,10 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -6581,6 +6631,10 @@ static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
@@ -6638,7 +6692,11 @@ static int tracing_snapshot_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	struct trace_iterator *iter;
 	struct seq_file *m;
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	if (trace_array_get(tr) < 0)
 		return -ENODEV;
@@ -6786,6 +6844,7 @@ static int snapshot_raw_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
+	/* The following checks for tracefs lockdown */
 	ret = tracing_buffers_open(inode, filp);
 	if (ret < 0)
 		return ret;
@@ -7105,6 +7164,10 @@ static int tracing_err_log_open(struct inode *inode, struct file *file)
 	struct trace_array *tr = inode->i_private;
 	int ret = 0;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (trace_array_get(tr) < 0)
 		return -ENODEV;
 
@@ -7157,6 +7220,10 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp)
 	struct ftrace_buffer_info *info;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if (tracing_disabled)
 		return -ENODEV;
 
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index a41fed46c285..7a731416bbdd 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2018 Masami Hiramatsu <mhiramat@kernel.org>
  */
 
+#include <linux/security.h>
 #include <linux/debugfs.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -174,6 +175,10 @@ static int dyn_event_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(NULL);
 		if (ret < 0)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index b89cdfe20bc1..cc02257fd6df 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -12,6 +12,7 @@
 #define pr_fmt(fmt) fmt
 
 #include <linux/workqueue.h>
+#include <linux/security.h>
 #include <linux/spinlock.h>
 #include <linux/kthread.h>
 #include <linux/tracefs.h>
@@ -1294,6 +1295,10 @@ static int trace_format_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = seq_open(file, &trace_format_seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1771,6 +1776,10 @@ ftrace_event_open(struct inode *inode, struct file *file,
 	struct seq_file *m;
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = seq_open(file, seq_ops);
 	if (ret < 0)
 		return ret;
@@ -1795,6 +1804,7 @@ ftrace_event_avail_open(struct inode *inode, struct file *file)
 {
 	const struct seq_operations *seq_ops = &show_event_seq_ops;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_event_open(inode, file, seq_ops);
 }
 
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 9468bd8d44a2..71dfe6889d62 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -7,6 +7,7 @@
 
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/security.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
@@ -1448,6 +1449,10 @@ static int synth_events_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&synth_event_ops);
 		if (ret < 0)
@@ -5515,6 +5520,12 @@ static int hist_show(struct seq_file *m, void *v)
 
 static int event_hist_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return single_open(file, hist_show, file);
 }
 
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 2a2912cb4533..2cd53ca21b51 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2013 Tom Zanussi <tom.zanussi@linux.intel.com>
  */
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/ctype.h>
 #include <linux/mutex.h>
@@ -173,7 +174,11 @@ static const struct seq_operations event_triggers_seq_ops = {
 
 static int event_trigger_regex_open(struct inode *inode, struct file *file)
 {
-	int ret = 0;
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
 
 	mutex_lock(&event_mutex);
 
@@ -292,6 +297,7 @@ event_trigger_write(struct file *filp, const char __user *ubuf,
 static int
 event_trigger_open(struct inode *inode, struct file *filp)
 {
+	/* Checks for tracefs lockdown */
 	return event_trigger_regex_open(inode, filp);
 }
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 324ffbea3556..e4422746cb4c 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -7,6 +7,7 @@
  */
 #define pr_fmt(fmt)	"trace_kprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
 #include <linux/rculist.h>
@@ -936,6 +937,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_kprobe_ops);
 		if (ret < 0)
@@ -988,6 +993,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index c3fd849d4a8f..d4e31e969206 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -6,6 +6,7 @@
  *
  */
 #include <linux/seq_file.h>
+#include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
 #include <linux/ftrace.h>
@@ -348,6 +349,12 @@ static const struct seq_operations show_format_seq_ops = {
 static int
 ftrace_formats_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &show_format_seq_ops);
 }
 
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index ec9a34a97129..4df9a209f7ca 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -5,6 +5,7 @@
  */
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
+#include <linux/security.h>
 #include <linux/kallsyms.h>
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
@@ -470,6 +471,12 @@ static const struct seq_operations stack_trace_seq_ops = {
 
 static int stack_trace_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &stack_trace_seq_ops);
 }
 
@@ -487,6 +494,7 @@ stack_trace_filter_open(struct inode *inode, struct file *file)
 {
 	struct ftrace_ops *ops = inode->i_private;
 
+	/* Checks for tracefs lockdown */
 	return ftrace_regex_open(ops, FTRACE_ITER_FILTER,
 				 inode, file);
 }
diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c
index 75bf1bcb4a8a..9ab0a1a7ad5e 100644
--- a/kernel/trace/trace_stat.c
+++ b/kernel/trace/trace_stat.c
@@ -9,7 +9,7 @@
  *
  */
 
-
+#include <linux/security.h>
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
@@ -238,6 +238,10 @@ static int tracing_stat_open(struct inode *inode, struct file *file)
 	struct seq_file *m;
 	struct stat_session *session = inode->i_private;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	ret = stat_seq_init(session);
 	if (ret)
 		return ret;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index dd884341f5c5..352073d36585 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -7,6 +7,7 @@
  */
 #define pr_fmt(fmt)	"trace_uprobe: " fmt
 
+#include <linux/security.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
@@ -769,6 +770,10 @@ static int probes_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		ret = dyn_events_release_all(&trace_uprobe_ops);
 		if (ret)
@@ -818,6 +823,12 @@ static const struct seq_operations profile_seq_op = {
 
 static int profile_open(struct inode *inode, struct file *file)
 {
+	int ret;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
 	return seq_open(file, &profile_seq_op);
 }
 

^ permalink raw reply related

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Ben Hutchings @ 2019-10-11 19:50 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Al Viro
In-Reply-To: <20191011143610.21bcd9c0@gandalf.local.home>

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

On Fri, 2019-10-11 at 14:36 -0400, Steven Rostedt wrote:
> On Fri, 11 Oct 2019 11:20:30 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Willing to do that instead?
> 
> Honestly, what you described was my preferred solution ;-)
> 
> I just didn't want to upset the lockdown crowd if a new tracefs file
> was opened without doing this.
> 
> Once locked down is set, can it ever be undone without rebooting?
[...]

Earlier versions of the lockdown patch set added a magic SysRq command
to turn it off.  That's not currently present upstream but there may be
plans to add it.

Ben.

-- 
Ben Hutchings
It is easier to change the specification to fit the program
than vice versa.



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Linus Torvalds @ 2019-10-11 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011143610.21bcd9c0@gandalf.local.home>

On Fri, Oct 11, 2019 at 11:36 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 11 Oct 2019 11:20:30 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Willing to do that instead?
>
> Honestly, what you described was my preferred solution ;-)
>
> I just didn't want to upset the lockdown crowd if a new tracefs file
> was opened without doing this.

Well, since they introduced a bug in your code that killed your
machine with the patch _they_ did, I don't think they get to complain
when you fix it the way you (and me) want to...

Fair is fair.

             Linus

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com>

On Fri, 11 Oct 2019 11:20:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Willing to do that instead?

Honestly, what you described was my preferred solution ;-)

I just didn't want to upset the lockdown crowd if a new tracefs file
was opened without doing this.

Once locked down is set, can it ever be undone without rebooting? If
not, a lockdown call could also trigger setting tracing_disabled to 1.
Which is much stronger, as that was the code we added to kill tracing
if anything abnormal was detected (and it does a hard shutdown of all
the tracing utilities).

It's set to one on bootup and cleared, after tracing is initialized.
But it is never cleared again. If lockdown can be enabled at bootup, we
could simply not clear it, and we can have something to allow lockdown
to set it as well.

Currently, the only places tracing_disabled gets set is in the self
tests and if the ring buffer gets corrupted.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 1/2] pidfd: show pids for nested pid namespaces in fdinfo
From: Jann Horn @ 2019-10-11 18:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Kellner, kernel list, Linux API, Christian Kellner,
	Andrew Morton, Peter Zijlstra (Intel), Ingo Molnar, Michal Hocko,
	Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Al Viro, Aleksa Sarai, Dmitry V. Levin
In-Reply-To: <20191011165851.gf5i6qw2fwbunshr@wittgenstein>

On Fri, Oct 11, 2019 at 6:58 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Fri, Oct 11, 2019 at 05:30:03PM +0200, Jann Horn wrote:
> > On Fri, Oct 11, 2019 at 5:17 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Fri, Oct 11, 2019 at 04:55:59PM +0200, Jann Horn wrote:
> > > > On Fri, Oct 11, 2019 at 2:23 PM Christian Kellner <ckellner@redhat.com> wrote:
> > > > > The fdinfo file for a process file descriptor already contains the
> > > > > pid of the process in the callers namespaces. Additionally, if pid
> > > > > namespaces are configured, show the process ids of the process in
> > > > > all nested namespaces in the same format as in the procfs status
> > > > > file, i.e. "NSPid:\t%d\%d...". This allows the easy identification
> > > > > of the processes in nested namespaces.
> > > > [...]
> > > > >  #ifdef CONFIG_PROC_FS
> > > > > +static inline void print_pidfd_nspid(struct seq_file *m, struct pid *pid,
> > > > > +                                    struct pid_namespace *ns)
> > > >
> > > > `ns` is the namespace of the PID namespace of the procfs instance
> > > > through which the file descriptor is being viewed.
> > > >
> > > > > +{
> > > > > +#ifdef CONFIG_PID_NS
> > > > > +       int i;
> > > > > +
> > > > > +       seq_puts(m, "\nNSpid:");
> > > > > +       for (i = ns->level; i <= pid->level; i++) {
> > > >
> > > > ns->level is the level of the PID namespace associated with the procfs
> > > > instance through which the file descriptor is being viewed. pid->level
> > > > is the level of the PID associated with the pidfd.
> > > >
> > > > > +               ns = pid->numbers[i].ns;
> > > > > +               seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, ns));
> > > > > +       }
> > > > > +#endif
> > > > > +}
> > > >
> > > > I think you assumed that `ns` is always going to contain `pid`.
> > > > However, that's not the case. Consider the following scenario:
> > > >
> > > >  - the init_pid_ns has two child PID namespaces, A and B (each with
> > > > its own mount namespace and procfs instance)
> > > >  - process P1 lives in A
> > > >  - process P2 lives in B
> > > >  - P1 opens a pidfd for itself
> > > >  - P1 passes the pidfd to P2 (e.g. via a unix domain socket)
> > > >  - P2 reads /proc/self/fdinfo/$pidfd
> > > >
> > > > Now the loop will print the ID of P1 in A. I don't think that's what
> > > > you intended? You might want to bail out if "pid_nr_ns(pid, ns) == 0",
> > > > or something like that.
> > >
> > > I assumed the same thing happens when you pass around an fd for
> > > /proc/self/status and that's why I didn't object to this behavior.
> >
> > I don't see how /proc/$pid/status is relevant. In the
> > /proc/$pid/status case, the output is the list of PIDs starting at the
> > PID namespace the procfs is associated with; and the process is always
> > contained in that namespace, which also means that the first PID
> > listed is the one in the PID namespace of the procfs instance. In the
> > pidfd case, the process is not necessarily contained in that
> > namespace, and the output doesn't make sense.
>
> I might be misreading what you're saying.
> (Maybe I'm doing something obviously wrong.)
> If I compile the following two programs:
> b2: https://paste.ubuntu.com/p/xthMsCXy3s/
> c2: https://paste.ubuntu.com/p/y5HSzyMQJr/
>
> Then in shell1
> sudo unshare --mount --pid --fork --mount-proc
>
> and in shell2
> sudo unshare --mount --pid --fork --mount-proc
>
> and run b2 in shell1 and c2 in shell2 which sends around an fd for
> /proc/b2/status to c2. Now c2 reads b2's status file via the fd it
> received. The c2 will see the pid of b2 in b2's pid namespace even
> though the process is not contained in the pid namespace of c2.

Because the reader doesn't matter; the perspective you have on the
system is defined by which pidns the procfs instance you're looking
through is associated with, and here you're looking through shell1's
procfs. It's normal that when you look through another procfs, you see
PIDs differently.

^ permalink raw reply

* Re: [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Linus Torvalds @ 2019-10-11 18:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Matthew Garrett, James Morris James Morris, LSM List,
	Linux API, Ben Hutchings, Al Viro
In-Reply-To: <20191011135458.7399da44@gandalf.local.home>

On Fri, Oct 11, 2019 at 10:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I bisected this down to the addition of the proxy_ops into tracefs for
> lockdown. It appears that the allocation of the proxy_ops and then freeing
> it in the destroy_inode callback, is causing havoc with the memory system.
> Reading the documentation about destroy_inode, I'm not sure that this is the
> proper way to handle allocating and then freeing the fops of the inode.

What is happening is that *because* it has a "destroy_inode()"
callback, it now expects destroy_inode to not just free free the proxy
ops, but to also schedule the inode itself for freeing.

Which that tracefs)destroy_inode() doesn't do.

So the inode never gets free'd at all - and you eventually run out of
memory due to the inode leak.

The trivial fix would be to instead use the "free_inode()" callback
(which is called after the required RCU grace period) and then free
the proxy op there _and_ call free_inode_nonrcu() too.

But I think your patch to not need a proxy op allocation at all is
probably better.

That said, I think the _best_ option would be to just getting rid of
the proxy fops _entirely_, and just make the (very few)
tracefs_create_file() cases do that LOCKDOWN_TRACEFS in their open in
the first place.

The proxy_fops was a hacky attempt to make the patch smaller. Your
"just wrap all ops" thing now made the patch bigger than just doing
the lockdown thing in all the callers.

In fact, many of them use tracing_open_generic(). And others - like
subsystem_open() already call tracing_open_generic() as part of their
open.

So from a quick glance, just making tracing_open_generic() do the
lockdown testing will take care of most cases.  Add a few other cases
to fill up the whole set, and your'e done.

Willing to do that instead?

                Linus

^ permalink raw reply

* [PATCH] tracefs: Do not allocate and free proxy_ops for lockdown
From: Steven Rostedt @ 2019-10-11 17:54 UTC (permalink / raw)
  To: LKML
  Cc: Matthew Garrett, jmorris, linux-security-module, linux-api,
	Ben Hutchings, Al Viro, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 7978 bytes --]


[ Attached the reproducers to this email ]

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Running the latest kernel through my "make instances" stress tests, I
triggered the following bug (with KASAN and kmemleak enabled):

mkdir invoked oom-killer:
gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
oom_score_adj=0
CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
Call Trace:
 dump_stack+0x64/0x8c
 dump_header+0x43/0x3b7
 ? trace_hardirqs_on+0x48/0x4a
 oom_kill_process+0x68/0x2d5
 out_of_memory+0x2aa/0x2d0
 __alloc_pages_nodemask+0x96d/0xb67
 __alloc_pages_node+0x19/0x1e
 alloc_slab_page+0x17/0x45
 new_slab+0xd0/0x234
 ___slab_alloc.constprop.86+0x18f/0x336
 ? alloc_inode+0x2c/0x74
 ? irq_trace+0x12/0x1e
 ? tracer_hardirqs_off+0x1d/0xd7
 ? __slab_alloc.constprop.85+0x21/0x53
 __slab_alloc.constprop.85+0x31/0x53
 ? __slab_alloc.constprop.85+0x31/0x53
 ? alloc_inode+0x2c/0x74
 kmem_cache_alloc+0x50/0x179
 ? alloc_inode+0x2c/0x74
 alloc_inode+0x2c/0x74
 new_inode_pseudo+0xf/0x48
 new_inode+0x15/0x25
 tracefs_get_inode+0x23/0x7c
 ? lookup_one_len+0x54/0x6c
 tracefs_create_file+0x53/0x11d
 trace_create_file+0x15/0x33
 event_create_dir+0x2a3/0x34b
 __trace_add_new_event+0x1c/0x26
 event_trace_add_tracer+0x56/0x86
 trace_array_create+0x13e/0x1e1
 instance_mkdir+0x8/0x17
 tracefs_syscall_mkdir+0x39/0x50
 ? get_dname+0x31/0x31
 vfs_mkdir+0x78/0xa3
 do_mkdirat+0x71/0xb0
 sys_mkdir+0x19/0x1b
 do_fast_syscall_32+0xb0/0xed

I bisected this down to the addition of the proxy_ops into tracefs for
lockdown. It appears that the allocation of the proxy_ops and then freeing
it in the destroy_inode callback, is causing havoc with the memory system.
Reading the documentation about destroy_inode, I'm not sure that this is the
proper way to handle allocating and then freeing the fops of the inode.

Instead of allocating the proxy_ops (and then having to free it), I created
a static proxy_ops. As tracefs only uses a subset of all the file_operations
methods, that subset can be defined in the static proxy_ops, and then the
passed in fops during the creation of the inode is saved in the dentry, and
that is use to call the real functions by the proxy_ops.

Fixes: ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 fs/tracefs/inode.c | 153 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 135 insertions(+), 18 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 9fc14e38927f..d0e8e4a16812 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,6 +20,7 @@
 #include <linux/parser.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/poll.h>
 #include <linux/security.h>
 
 #define TRACEFS_DEFAULT_MODE	0700
@@ -28,7 +29,7 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
-static int default_open_file(struct inode *inode, struct file *filp)
+static int proxy_open(struct inode *inode, struct file *filp)
 {
 	struct dentry *dentry = filp->f_path.dentry;
 	struct file_operations *real_fops;
@@ -47,6 +48,138 @@ static int default_open_file(struct inode *inode, struct file *filp)
 	return real_fops->open(inode, filp);
 }
 
+static ssize_t proxy_read(struct file *file, char __user *buf,
+			  size_t count, loff_t *pos)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct file_operations *real_fops;
+
+	if (!dentry)
+		return -EINVAL;
+
+	real_fops = dentry->d_fsdata;
+
+	if (real_fops->read)
+		return real_fops->read(file, buf, count, pos);
+	else
+		return -EINVAL;
+}
+
+static ssize_t proxy_write(struct file *file, const char __user *p,
+			   size_t count, loff_t *pos)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct file_operations *real_fops;
+
+	if (!dentry)
+		return -EINVAL;
+
+	real_fops = dentry->d_fsdata;
+
+	if (real_fops->write)
+		return real_fops->write(file, p, count, pos);
+	else
+		return -EINVAL;
+}
+
+static loff_t proxy_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct file_operations *real_fops;
+	loff_t (*fn)(struct file *, loff_t, int);
+
+	if (!dentry)
+		return -EINVAL;
+
+	real_fops = dentry->d_fsdata;
+
+	fn = no_llseek;
+	if (file->f_mode & FMODE_LSEEK) {
+		if (real_fops->llseek)
+			fn = real_fops->llseek;
+	}
+	return fn(file, offset, whence);
+}
+
+static int proxy_release(struct inode *inode, struct file *filp)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct file_operations *real_fops;
+
+	if (!dentry)
+		return 0;
+
+	real_fops = dentry->d_fsdata;
+
+	if (real_fops->release)
+		return real_fops->release(inode, filp);
+	return 0;
+}
+
+static __poll_t proxy_poll(struct file *file, struct poll_table_struct *pt)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct file_operations *real_fops;
+
+	if (!dentry)
+		return 0;
+
+	real_fops = dentry->d_fsdata;
+
+	if (unlikely(!real_fops->poll))
+		return DEFAULT_POLLMASK;
+	return real_fops->poll(file, pt);
+}
+
+static ssize_t proxy_splice_read(struct file *in, loff_t *ppos,
+				 struct pipe_inode_info *pipe, size_t len,
+				 unsigned int flags)
+{
+	struct dentry *dentry = in->f_path.dentry;
+	struct file_operations *real_fops;
+	ssize_t (*splice_read)(struct file *, loff_t *,
+			       struct pipe_inode_info *, size_t, unsigned int);
+
+	if (!dentry)
+		return 0;
+
+	real_fops = dentry->d_fsdata;
+
+	if (real_fops->splice_read)
+		splice_read = real_fops->splice_read;
+	else
+		splice_read = generic_file_splice_read;
+
+	return splice_read(in, ppos, pipe, len, flags);
+}
+
+static int proxy_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct file_operations *real_fops;
+
+	if (!dentry)
+		return 0;
+
+	real_fops = dentry->d_fsdata;
+
+	if (!real_fops->mmap)
+		return -ENODEV;
+
+	return real_fops->mmap(file, vma);
+}
+
+static const struct file_operations proxy_fops = {
+	.open		= proxy_open,
+	.read		= proxy_read,
+	.write		= proxy_write,
+	.llseek		= proxy_llseek,
+	.release	= proxy_release,
+	.poll		= proxy_poll,
+	.splice_read	= proxy_splice_read,
+	.mmap		= proxy_mmap,
+};
+
 static ssize_t default_read_file(struct file *file, char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -241,12 +374,6 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }
 
-static void tracefs_destroy_inode(struct inode *inode)
-{
-	if (S_ISREG(inode->i_mode))
-		kfree(inode->i_fop);
-}
-
 static int tracefs_remount(struct super_block *sb, int *flags, char *data)
 {
 	int err;
@@ -283,7 +410,6 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
 static const struct super_operations tracefs_super_operations = {
 	.statfs		= simple_statfs,
 	.remount_fs	= tracefs_remount,
-	.destroy_inode  = tracefs_destroy_inode,
 	.show_options	= tracefs_show_options,
 };
 
@@ -414,7 +540,6 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
-	struct file_operations *proxy_fops;
 	struct dentry *dentry;
 	struct inode *inode;
 
@@ -430,20 +555,12 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return failed_creating(dentry);
 
-	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
-	if (unlikely(!proxy_fops)) {
-		iput(inode);
-		return failed_creating(dentry);
-	}
-
 	if (!fops)
 		fops = &tracefs_file_operations;
 
 	dentry->d_fsdata = (void *)fops;
-	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
-	proxy_fops->open = default_open_file;
 	inode->i_mode = mode;
-	inode->i_fop = proxy_fops;
+	inode->i_fop = &proxy_fops;
 	inode->i_private = data;
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
-- 
2.20.1


[-- Attachment #2: ftrace-test-mkinstances --]
[-- Type: application/octet-stream, Size: 1000 bytes --]

#!/bin/bash

tracefs=`cat /proc/mounts  |grep tracefs| head -1 | cut -d' ' -f2`

if [ -z "$tracefs" ]; then
	echo "tracefs not mounted"
	exit 0
fi

if [ ! -d $tracefs/instances ]; then
	echo "No instances directory"
	exit 0
fi

cd $tracefs/instances

mkdir x
rmdir x
result=$?

if [ $result -ne 0 ]; then
	echo "instance rmdir not supported, skipping this test"
	exit 0
fi

instance_slam() {
	while :; do
		mkdir x
		mkdir y
		mkdir z
		rmdir x
		rmdir y
		rmdir z
	done 2>/dev/null
}

instance_slam &
p1=$!
echo $p1

instance_slam &
p2=$!
echo $p2

instance_slam &
p3=$!
echo $p3

instance_slam &
p4=$!
echo $p4

instance_slam &
p5=$!
echo $p5

for i in `seq 10`; do
	ls
	sleep 1
done
kill -1 $p1 
kill -1 $p2
kill -1 $p3
kill -1 $p4
kill -1 $p5

echo "Wait for processes to finish"
wait $p1 $p2 $p3 $p4 $p5
echo "all processes finished, wait for cleanup"
sleep 2

mkdir x y z
ls x y z
rmdir x y z
for d in x y z; do
	if [ -d $d ]; then
		echo $d still exists
		exit -1
	fi
done
echo SUCCESS
exit 0

[-- Attachment #3: ftrace-test-mkinstances-2 --]
[-- Type: application/octet-stream, Size: 890 bytes --]

#!/bin/bash

tracefs=`cat /proc/mounts  |grep tracefs| head -1 | cut -d' ' -f2`

if [ -z "$tracefs" ]; then
	echo "tracefs not mounted"
	exit 0
fi

if [ ! -d $tracefs/instances ]; then
	echo "No instances directory"
	exit 0
fi

cd $tracefs/instances

instance_slam() {
	while :; do
		mkdir foo &> /dev/null
		rmdir foo &> /dev/null
	done
}

instance_read() {
	while :; do
		cat foo/trace &> /dev/null
	done
}

instance_set() {
	while :; do
		echo 1 > foo/events/sched/sched_switch
	done 2> /dev/null
}

instance_slam &
x=`jobs -l`
p1=`echo $x | cut -d' ' -f2`
echo $p1

instance_set &
x=`jobs -l | tail -1`
p2=`echo $x | cut -d' ' -f2`
echo $p2

sleep 10

kill -1 $p1 
kill -1 $p2

echo "Wait for processes to finish"
wait $p1 $p2
echo "all processes finished, wait for cleanup"
sleep 2

mkdir foo
ls foo
rmdir foo
if [ -d foo ]; then
	echo foo still exists
	exit -1
fi
echo SUCCESS
exit 0

^ permalink raw reply related

* Re: [PATCH v2 2/2] pidfd: add tests for NSpid info in fdinfo
From: Christian Brauner @ 2019-10-11 17:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Kellner, kernel list, Linux API, Christian Kellner,
	Shuah Khan, Andrew Morton, Peter Zijlstra (Intel), Ingo Molnar,
	Michal Hocko, Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Joel Fernandes (Google), Al Viro,
	Dmitry V. Levin, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <CAG48ez0MyiTKO2MpNVQqavoTKo7FZXYAyohx1JTR=M9Uw=QJWQ@mail.gmail.com>

On Fri, Oct 11, 2019 at 05:09:29PM +0200, Jann Horn wrote:
> On Wed, Oct 9, 2019 at 6:10 PM Christian Kellner <ckellner@redhat.com> wrote:
> > Add tests that check that if pid namespaces are configured the fdinfo
> > file of a pidfd contains an NSpid: entry containing the process id
> > in the current and additionally all nested namespaces.
> [...]
> > +static int compare_fdinfo_nspid(int pidfd, char *expect, size_t len)
> > +{
> > +       char path[512];
> > +       FILE *f;
> > +       size_t n = 0;
> > +       ssize_t k;
> > +       char *line = NULL;
> > +       int r = -1;
> > +
> > +       snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
> 
> (Maybe at some point the selftests code should add some more concise
> alternative to snprintf() calls on separate lines. A macro or
> something like that so that you can write stuff like `f =
> fopen(tprintf("/proc/self/fdinfo/%d", pidfd), "re")`.)
> 
> > +       f = fopen(path, "re");
> > +       if (!f)
> > +               return -1;
> > +
> > +       while ((k = getline(&line, &n, f)) != -1) {
> > +               if (strncmp(line, "NSpid:", 6))
> > +                       continue;
> > +
> > +               line[k - 1] = '\0';
> > +               ksft_print_msg("Child: fdinfo NSpid line: '%s'.\n", line);
> > +               r = strncmp(line + 6, expect, len);
> 
> Wouldn't it be better to get rid of the nullbyte assignment and change
> the strncmp() into a strcmp() here...
> 
> [...]
> > +       /* The child will have pid 1 in the new pid namespace,
> > +        * so the line must be 'NSPid:\t<pid>\t1'
> > +        */
> > +       n = snprintf(expect, sizeof(expect), "\t%d\t%d", pid, 1);
> 
> ... and add a "\n" to the format string? It's shorter and doesn't
> silently ignore it if the line doesn't end at that point.

Also, what Christian just told me and what I wanted to suggest is that
we add tests for sending around pidfds and reading fdinfo too.

^ permalink raw reply

* Re: [PATCH v3 1/2] pidfd: show pids for nested pid namespaces in fdinfo
From: Christian Brauner @ 2019-10-11 16:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Kellner, kernel list, Linux API, Christian Kellner,
	Andrew Morton, Peter Zijlstra (Intel), Ingo Molnar, Michal Hocko,
	Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Al Viro, Aleksa Sarai, Dmitry V. Levin
In-Reply-To: <CAG48ez2_YX0eXQP_YP2Ya20AxRGg9uGFjjkuSBxAV=hxvM9VZw@mail.gmail.com>

On Fri, Oct 11, 2019 at 05:30:03PM +0200, Jann Horn wrote:
> On Fri, Oct 11, 2019 at 5:17 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Fri, Oct 11, 2019 at 04:55:59PM +0200, Jann Horn wrote:
> > > On Fri, Oct 11, 2019 at 2:23 PM Christian Kellner <ckellner@redhat.com> wrote:
> > > > The fdinfo file for a process file descriptor already contains the
> > > > pid of the process in the callers namespaces. Additionally, if pid
> > > > namespaces are configured, show the process ids of the process in
> > > > all nested namespaces in the same format as in the procfs status
> > > > file, i.e. "NSPid:\t%d\%d...". This allows the easy identification
> > > > of the processes in nested namespaces.
> > > [...]
> > > >  #ifdef CONFIG_PROC_FS
> > > > +static inline void print_pidfd_nspid(struct seq_file *m, struct pid *pid,
> > > > +                                    struct pid_namespace *ns)
> > >
> > > `ns` is the namespace of the PID namespace of the procfs instance
> > > through which the file descriptor is being viewed.
> > >
> > > > +{
> > > > +#ifdef CONFIG_PID_NS
> > > > +       int i;
> > > > +
> > > > +       seq_puts(m, "\nNSpid:");
> > > > +       for (i = ns->level; i <= pid->level; i++) {
> > >
> > > ns->level is the level of the PID namespace associated with the procfs
> > > instance through which the file descriptor is being viewed. pid->level
> > > is the level of the PID associated with the pidfd.
> > >
> > > > +               ns = pid->numbers[i].ns;
> > > > +               seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, ns));
> > > > +       }
> > > > +#endif
> > > > +}
> > >
> > > I think you assumed that `ns` is always going to contain `pid`.
> > > However, that's not the case. Consider the following scenario:
> > >
> > >  - the init_pid_ns has two child PID namespaces, A and B (each with
> > > its own mount namespace and procfs instance)
> > >  - process P1 lives in A
> > >  - process P2 lives in B
> > >  - P1 opens a pidfd for itself
> > >  - P1 passes the pidfd to P2 (e.g. via a unix domain socket)
> > >  - P2 reads /proc/self/fdinfo/$pidfd
> > >
> > > Now the loop will print the ID of P1 in A. I don't think that's what
> > > you intended? You might want to bail out if "pid_nr_ns(pid, ns) == 0",
> > > or something like that.
> >
> > I assumed the same thing happens when you pass around an fd for
> > /proc/self/status and that's why I didn't object to this behavior.
> 
> I don't see how /proc/$pid/status is relevant. In the
> /proc/$pid/status case, the output is the list of PIDs starting at the
> PID namespace the procfs is associated with; and the process is always
> contained in that namespace, which also means that the first PID
> listed is the one in the PID namespace of the procfs instance. In the
> pidfd case, the process is not necessarily contained in that
> namespace, and the output doesn't make sense.

I might be misreading what you're saying.
(Maybe I'm doing something obviously wrong.)
If I compile the following two programs:
b2: https://paste.ubuntu.com/p/xthMsCXy3s/
c2: https://paste.ubuntu.com/p/y5HSzyMQJr/

Then in shell1
sudo unshare --mount --pid --fork --mount-proc

and in shell2
sudo unshare --mount --pid --fork --mount-proc

and run b2 in shell1 and c2 in shell2 which sends around an fd for
/proc/b2/status to c2. Now c2 reads b2's status file via the fd it
received. The c2 will see the pid of b2 in b2's pid namespace even
though the process is not contained in the pid namespace of c2.

Christian

^ permalink raw reply

* Re: [PATCH v3 1/2] pidfd: show pids for nested pid namespaces in fdinfo
From: Jann Horn @ 2019-10-11 15:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Kellner, Christian Brauner, kernel list, Linux API,
	Christian Kellner, Andrew Morton, Peter Zijlstra (Intel),
	Ingo Molnar, Michal Hocko, Thomas Gleixner, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Aleksa Sarai,
	Dmitry V. Levin
In-Reply-To: <20191011151700.hdvztoxonpvogadv@wittgenstein>

On Fri, Oct 11, 2019 at 5:17 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Fri, Oct 11, 2019 at 04:55:59PM +0200, Jann Horn wrote:
> > On Fri, Oct 11, 2019 at 2:23 PM Christian Kellner <ckellner@redhat.com> wrote:
> > > The fdinfo file for a process file descriptor already contains the
> > > pid of the process in the callers namespaces. Additionally, if pid
> > > namespaces are configured, show the process ids of the process in
> > > all nested namespaces in the same format as in the procfs status
> > > file, i.e. "NSPid:\t%d\%d...". This allows the easy identification
> > > of the processes in nested namespaces.
> > [...]
> > >  #ifdef CONFIG_PROC_FS
> > > +static inline void print_pidfd_nspid(struct seq_file *m, struct pid *pid,
> > > +                                    struct pid_namespace *ns)
> >
> > `ns` is the namespace of the PID namespace of the procfs instance
> > through which the file descriptor is being viewed.
> >
> > > +{
> > > +#ifdef CONFIG_PID_NS
> > > +       int i;
> > > +
> > > +       seq_puts(m, "\nNSpid:");
> > > +       for (i = ns->level; i <= pid->level; i++) {
> >
> > ns->level is the level of the PID namespace associated with the procfs
> > instance through which the file descriptor is being viewed. pid->level
> > is the level of the PID associated with the pidfd.
> >
> > > +               ns = pid->numbers[i].ns;
> > > +               seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, ns));
> > > +       }
> > > +#endif
> > > +}
> >
> > I think you assumed that `ns` is always going to contain `pid`.
> > However, that's not the case. Consider the following scenario:
> >
> >  - the init_pid_ns has two child PID namespaces, A and B (each with
> > its own mount namespace and procfs instance)
> >  - process P1 lives in A
> >  - process P2 lives in B
> >  - P1 opens a pidfd for itself
> >  - P1 passes the pidfd to P2 (e.g. via a unix domain socket)
> >  - P2 reads /proc/self/fdinfo/$pidfd
> >
> > Now the loop will print the ID of P1 in A. I don't think that's what
> > you intended? You might want to bail out if "pid_nr_ns(pid, ns) == 0",
> > or something like that.
>
> I assumed the same thing happens when you pass around an fd for
> /proc/self/status and that's why I didn't object to this behavior.

I don't see how /proc/$pid/status is relevant. In the
/proc/$pid/status case, the output is the list of PIDs starting at the
PID namespace the procfs is associated with; and the process is always
contained in that namespace, which also means that the first PID
listed is the one in the PID namespace of the procfs instance. In the
pidfd case, the process is not necessarily contained in that
namespace, and the output doesn't make sense.

^ permalink raw reply

* Re: [PATCH v3 1/2] pidfd: show pids for nested pid namespaces in fdinfo
From: Christian Brauner @ 2019-10-11 15:17 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Kellner, Christian Brauner, kernel list, Linux API,
	Christian Kellner, Andrew Morton, Peter Zijlstra (Intel),
	Ingo Molnar, Michal Hocko, Thomas Gleixner, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Aleksa Sarai,
	Dmitry V. Levin
In-Reply-To: <CAG48ez1xNonmxwa3DRD44WJiComOHRxdHud5+LWea3OXzr4hkg@mail.gmail.com>

On Fri, Oct 11, 2019 at 04:55:59PM +0200, Jann Horn wrote:
> On Fri, Oct 11, 2019 at 2:23 PM Christian Kellner <ckellner@redhat.com> wrote:
> > The fdinfo file for a process file descriptor already contains the
> > pid of the process in the callers namespaces. Additionally, if pid
> > namespaces are configured, show the process ids of the process in
> > all nested namespaces in the same format as in the procfs status
> > file, i.e. "NSPid:\t%d\%d...". This allows the easy identification
> > of the processes in nested namespaces.
> [...]
> >  #ifdef CONFIG_PROC_FS
> > +static inline void print_pidfd_nspid(struct seq_file *m, struct pid *pid,
> > +                                    struct pid_namespace *ns)
> 
> `ns` is the namespace of the PID namespace of the procfs instance
> through which the file descriptor is being viewed.
> 
> > +{
> > +#ifdef CONFIG_PID_NS
> > +       int i;
> > +
> > +       seq_puts(m, "\nNSpid:");
> > +       for (i = ns->level; i <= pid->level; i++) {
> 
> ns->level is the level of the PID namespace associated with the procfs
> instance through which the file descriptor is being viewed. pid->level
> is the level of the PID associated with the pidfd.
> 
> > +               ns = pid->numbers[i].ns;
> > +               seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, ns));
> > +       }
> > +#endif
> > +}
> 
> I think you assumed that `ns` is always going to contain `pid`.
> However, that's not the case. Consider the following scenario:
> 
>  - the init_pid_ns has two child PID namespaces, A and B (each with
> its own mount namespace and procfs instance)
>  - process P1 lives in A
>  - process P2 lives in B
>  - P1 opens a pidfd for itself
>  - P1 passes the pidfd to P2 (e.g. via a unix domain socket)
>  - P2 reads /proc/self/fdinfo/$pidfd
> 
> Now the loop will print the ID of P1 in A. I don't think that's what
> you intended? You might want to bail out if "pid_nr_ns(pid, ns) == 0",
> or something like that.

I assumed the same thing happens when you pass around an fd for
/proc/self/status and that's why I didn't object to this behavior.

Christian

^ permalink raw reply

* Re: [PATCH v2 2/2] pidfd: add tests for NSpid info in fdinfo
From: Jann Horn @ 2019-10-11 15:09 UTC (permalink / raw)
  To: Christian Kellner
  Cc: kernel list, Linux API, Christian Kellner, Christian Brauner,
	Shuah Khan, Andrew Morton, Peter Zijlstra (Intel), Ingo Molnar,
	Michal Hocko, Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Joel Fernandes (Google), Al Viro,
	Dmitry V. Levin, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <20191009160532.20674-2-ckellner@redhat.com>

On Wed, Oct 9, 2019 at 6:10 PM Christian Kellner <ckellner@redhat.com> wrote:
> Add tests that check that if pid namespaces are configured the fdinfo
> file of a pidfd contains an NSpid: entry containing the process id
> in the current and additionally all nested namespaces.
[...]
> +static int compare_fdinfo_nspid(int pidfd, char *expect, size_t len)
> +{
> +       char path[512];
> +       FILE *f;
> +       size_t n = 0;
> +       ssize_t k;
> +       char *line = NULL;
> +       int r = -1;
> +
> +       snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);

(Maybe at some point the selftests code should add some more concise
alternative to snprintf() calls on separate lines. A macro or
something like that so that you can write stuff like `f =
fopen(tprintf("/proc/self/fdinfo/%d", pidfd), "re")`.)

> +       f = fopen(path, "re");
> +       if (!f)
> +               return -1;
> +
> +       while ((k = getline(&line, &n, f)) != -1) {
> +               if (strncmp(line, "NSpid:", 6))
> +                       continue;
> +
> +               line[k - 1] = '\0';
> +               ksft_print_msg("Child: fdinfo NSpid line: '%s'.\n", line);
> +               r = strncmp(line + 6, expect, len);

Wouldn't it be better to get rid of the nullbyte assignment and change
the strncmp() into a strcmp() here...

[...]
> +       /* The child will have pid 1 in the new pid namespace,
> +        * so the line must be 'NSPid:\t<pid>\t1'
> +        */
> +       n = snprintf(expect, sizeof(expect), "\t%d\t%d", pid, 1);

... and add a "\n" to the format string? It's shorter and doesn't
silently ignore it if the line doesn't end at that point.

^ permalink raw reply

* Re: [PATCH v3 1/2] pidfd: show pids for nested pid namespaces in fdinfo
From: Jann Horn @ 2019-10-11 14:55 UTC (permalink / raw)
  To: Christian Kellner, Christian Brauner
  Cc: kernel list, Linux API, Christian Kellner, Andrew Morton,
	Peter Zijlstra (Intel), Ingo Molnar, Michal Hocko,
	Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Al Viro, Aleksa Sarai, Dmitry V. Levin
In-Reply-To: <20191011122323.7770-1-ckellner@redhat.com>

On Fri, Oct 11, 2019 at 2:23 PM Christian Kellner <ckellner@redhat.com> wrote:
> The fdinfo file for a process file descriptor already contains the
> pid of the process in the callers namespaces. Additionally, if pid
> namespaces are configured, show the process ids of the process in
> all nested namespaces in the same format as in the procfs status
> file, i.e. "NSPid:\t%d\%d...". This allows the easy identification
> of the processes in nested namespaces.
[...]
>  #ifdef CONFIG_PROC_FS
> +static inline void print_pidfd_nspid(struct seq_file *m, struct pid *pid,
> +                                    struct pid_namespace *ns)

`ns` is the namespace of the PID namespace of the procfs instance
through which the file descriptor is being viewed.

> +{
> +#ifdef CONFIG_PID_NS
> +       int i;
> +
> +       seq_puts(m, "\nNSpid:");
> +       for (i = ns->level; i <= pid->level; i++) {

ns->level is the level of the PID namespace associated with the procfs
instance through which the file descriptor is being viewed. pid->level
is the level of the PID associated with the pidfd.

> +               ns = pid->numbers[i].ns;
> +               seq_put_decimal_ull(m, "\t", pid_nr_ns(pid, ns));
> +       }
> +#endif
> +}

I think you assumed that `ns` is always going to contain `pid`.
However, that's not the case. Consider the following scenario:

 - the init_pid_ns has two child PID namespaces, A and B (each with
its own mount namespace and procfs instance)
 - process P1 lives in A
 - process P2 lives in B
 - P1 opens a pidfd for itself
 - P1 passes the pidfd to P2 (e.g. via a unix domain socket)
 - P2 reads /proc/self/fdinfo/$pidfd

Now the loop will print the ID of P1 in A. I don't think that's what
you intended? You might want to bail out if "pid_nr_ns(pid, ns) == 0",
or something like that.

^ permalink raw reply

* Re: [PATCH v3 2/2] pidfd: add tests for NSpid info in fdinfo
From: Christian Brauner @ 2019-10-11 13:18 UTC (permalink / raw)
  To: Christian Kellner
  Cc: linux-kernel, linux-api, Christian Kellner, Shuah Khan,
	Andrew Morton, Peter Zijlstra (Intel), Ingo Molnar, Michal Hocko,
	Thomas Gleixner, Elena Reshetova, Roman Gushchin,
	Andrea Arcangeli, Al Viro, Aleksa Sarai, Dmitry V. Levin,
	linux-kselftest
In-Reply-To: <20191011122323.7770-2-ckellner@redhat.com>

On Fri, Oct 11, 2019 at 02:23:21PM +0200, Christian Kellner wrote:
> From: Christian Kellner <christian@kellner.me>
> 
> Add tests that check that if pid namespaces are configured the fdinfo
> file of a pidfd contains an NSpid: entry containing the process id
> in the current and additionally all nested namespaces.
> 
> Signed-off-by: Christian Kellner <christian@kellner.me>

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Shuah, can I get an Ack for you from this. If you have no objections I'd
queue up this patchset for the 5.5 merge window.

Thanks!
Christian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox