From: Steven Rostedt <rostedt@goodmis.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
Masami Hiramatsu <mhiramat@kernel.org>,
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
Date: Wed, 31 Jan 2024 07:57:40 -0500 [thread overview]
Message-ID: <20240131075740.660e7634@rorschach.local.home> (raw)
In-Reply-To: <CAHk-=whSse54d+X361K2Uw6jO2SvrO-U0LHLBTLnqHaA+406Fw@mail.gmail.com>
On Tue, 30 Jan 2024 22:20:24 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ugh.
>
> Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
> it does that "unlock and re-lock inode" thing.
I'd figured you'd like that one.
>
> It's a disease, and no, it's not an acceptable response to "lockdep
> shows there's a locking problem".
>
> The comment says "It is up to the individual mkdir routine to handle
> races" but that's *completely* bogus and shows a fundamental
> misunderstanding of locking.
>
> Dropping the lock in the middle of a locked section does NOT affect
> just the section that you dropped the lock around.
>
> It affects the *caller*, who took the lock in the first place, and who
> may well assume that the lock protects things for the whole duration
> of the lock.
>
> And so dropping the lock in the middle of an operation is a bad BAD
> *BAD* thing to do.
>
> Honestly, I had only been looking at the eventfs_inode.c lookup code.
> But now that I started looking at mkdir/rmdir, I'm seeing the same
> signs of horrible mistakes there too.
>
> And yes, it might be a real problem. For example, for the rmdir case,
> the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
> only the ->rmdir() call itself.
>
> It also, for example, is supposed to make the ->rmdir be atomic wrt things like
>
> dentry->d_inode->i_flags |= S_DEAD;
>
> and
>
> dont_mount(dentry);
> detach_mounts(dentry);
>
> but now because the inode lock was dropped in the middle, for all we
> know a "mkdir()" could come in on the same name, and make a mockery of
> the whole inode locking.
>
> The inode lock on that directory that is getting removed is also what
> is supposed to make it impossible to add new files to the directory
> while the rmdir is going on.
>
> Again, dropping the lock violates those kinds of guarantees.
>
> "git blame" actually fingers Al for that "unlock and relock", but
> that's because Al did a search-and-replace on
> "mutex_[un]lock(&inode->i_mutex)" and replaced it with
> "inode_[un]lock(inode)" back in 2016.
>
> The real culprit is you, and that sh*t-show goes back to commit
> 277ba04461c2 ("tracing: Add interface to allow multiple trace
> buffers").
>
> Christ. Now I guess I need to start looking if there is anything else
> horrid lurking in that mkdir/rmdir path.
>
> I doubt the inode locking problem ends up mattering in this situation.
> Mostly since this is only tracefs, and normal users shouldn't be able
> to access these things anyway. And I think (hope?) that you only
> accept mkdir/rmdir in specific places.
Yes, this was very deliberate. It was for a very "special" feature, and
thus very limited.
>
> But this really is another case of "This code smells *fundamentally* wrong".
>
> Adding Al, in the hopes that he will take a look at this too.
This is something I asked Al about when I wrote it. This isn't a normal
rmdir. I remember telling Al what this was doing. Basically, it's just
a way to tell the tracing infrastructure to create a new instance. It
doesn't actually create a normal directory. It's similar to the
kprobe_events interface, where writing to a file would create
directories and files. Ideally I wanted a mkdir interface as it felt
more realistic, and I was ready to have another "echo foo > make_instance"
if this didn't work. But after talking with Al, I felt that it could
work. The main issue is that mkdir doesn't just make a directory, it
creates the entire tree (like what is done in /sys/fs/cgroup). If this
was more like kernfs instead of debugfs, I may not have had this
problem. That was another time I tried to understand how krenfs worked.
This is the reason all the opens in the tracing code has:
struct trace_array *tr = inode->i_private;
int ret;
ret = tracing_check_open_get_tr(tr);
if (ret)
return ret;
In kernel/trace/trace.c:
LIST_HEAD(ftrace_trace_arrays);
int trace_array_get(struct trace_array *this_tr)
{
struct trace_array *tr;
int ret = -ENODEV;
mutex_lock(&trace_types_lock);
list_for_each_entry(tr, &ftrace_trace_arrays, list) {
if (tr == this_tr) {
tr->ref++;
ret = 0;
break;
}
}
mutex_unlock(&trace_types_lock);
return ret;
}
static void __trace_array_put(struct trace_array *this_tr)
{
WARN_ON(!this_tr->ref);
this_tr->ref--;
}
void trace_array_put(struct trace_array *this_tr)
{
if (!this_tr)
return;
mutex_lock(&trace_types_lock);
__trace_array_put(this_tr);
mutex_unlock(&trace_types_lock);
}
int tracing_check_open_get_tr(struct trace_array *tr)
{
int ret;
ret = security_locked_down(LOCKDOWN_TRACEFS);
if (ret)
return ret;
if (tracing_disabled)
return -ENODEV;
if (tr && trace_array_get(tr) < 0)
return -ENODEV;
return 0;
}
The rmdir code that tracefs calls is also in kernel/trace/trace.c:
static int instance_rmdir(const char *name)
{
struct trace_array *tr;
int ret;
mutex_lock(&event_mutex);
mutex_lock(&trace_types_lock);
ret = -ENODEV;
tr = trace_array_find(name);
if (tr)
ret = __remove_instance(tr);
mutex_unlock(&trace_types_lock);
mutex_unlock(&event_mutex);
return ret;
}
The mkdir creates a new trace_array (tr) and rmdir destroys it. If
there's any reference on a trace array the rmdir fails. On every open,
a check is made to see if the trace array that was added to the inode
still exists, and if it does, it ups its ref count to prevent a rmdir
from happening.
It's a very limited mkdir and rmdir. I remember Al asking me questions
about what it can and can't do, and me telling him to make sure the
lock release wasn't going to cause problems.
I wrote several fuzzers on this code that perform mkdir and rmdir on the
instances while other tasks are trying to access them (reading and
writing). In the beginning, it found a few bugs. But I was finally able
to get my fuzzers to work non-stop for over a month and that's when I
felt that this is fine.
I was always weary of this code because of the locking situation. The
kernel selftests has a stress test on this code too.
tools/testing/selftests/ftrace/test.d/instances/instance-event.tc
tools/testing/selftests/ftrace/test.d/instances/instance.tc
Which is run as part of the selftest suite, which is run by most people
testing the kernel, including KernelCI.
I haven't had a problem with this code in years, and unless we find a
real bug that needs to be fixed, I'm afraid to touch it.
-- Steve
next prev parent reply other threads:[~2024-01-31 12:57 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
2024-01-30 19:48 ` Steven Rostedt
2024-01-30 19:56 ` Steven Rostedt
2024-01-30 19:49 ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
2024-01-30 23:26 ` Al Viro
2024-01-30 23:55 ` Steven Rostedt
2024-01-31 0:07 ` Al Viro
2024-01-31 0:23 ` Al Viro
2024-01-31 0:39 ` Linus Torvalds
2024-01-30 19:03 ` [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field Linus Torvalds
2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
2024-01-30 20:55 ` Steven Rostedt
2024-01-30 21:52 ` Linus Torvalds
2024-01-30 22:56 ` Steven Rostedt
2024-01-30 22:58 ` Linus Torvalds
2024-01-30 23:04 ` Steven Rostedt
2024-01-30 22:56 ` Linus Torvalds
2024-01-30 23:06 ` Linus Torvalds
2024-01-30 23:10 ` Steven Rostedt
2024-01-30 23:25 ` Linus Torvalds
2024-01-31 0:48 ` Al Viro
2024-01-31 5:09 ` Steven Rostedt
2024-01-31 5:25 ` Linus Torvalds
2024-01-31 5:33 ` Steven Rostedt
2024-01-31 5:57 ` Linus Torvalds
2024-01-31 6:20 ` Linus Torvalds
2024-01-31 12:57 ` Steven Rostedt [this message]
2024-01-31 13:14 ` Steven Rostedt
2024-01-31 18:08 ` Linus Torvalds
2024-01-31 18:39 ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
2024-01-30 21:25 ` Steven Rostedt
2024-01-30 21:36 ` Linus Torvalds
2024-01-31 1:12 ` Al Viro
2024-01-31 2:37 ` Linus Torvalds
2024-01-31 2:46 ` Al Viro
2024-01-31 3:39 ` Linus Torvalds
2024-01-31 4:28 ` Al Viro
2024-01-31 18:00 ` Steven Rostedt
2024-01-31 23:07 ` Masami Hiramatsu
2024-01-31 23:23 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240131075740.660e7634@rorschach.local.home \
--to=rostedt@goodmis.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.