public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
Date: Wed, 31 Jan 2024 07:57:40 -0500	[thread overview]
Message-ID: <20240131075740.660e7634@rorschach.local.home> (raw)
In-Reply-To: <CAHk-=whSse54d+X361K2Uw6jO2SvrO-U0LHLBTLnqHaA+406Fw@mail.gmail.com>

On Tue, 30 Jan 2024 22:20:24 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ugh.  
> 
> Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
> it does that "unlock and re-lock inode" thing.

I'd figured you'd like that one.

> 
> It's a disease, and no, it's not an acceptable response to "lockdep
> shows there's a locking problem".
> 
> The comment says "It is up to the individual mkdir routine to handle
> races" but that's *completely* bogus and shows a fundamental
> misunderstanding of locking.
> 
> Dropping the lock in the middle of a locked section does NOT affect
> just the section that you dropped the lock around.
> 
> It affects the *caller*, who took the lock in the first place, and who
> may well assume that the lock protects things for the whole duration
> of the lock.
> 
> And so dropping the lock in the middle of an operation is a bad BAD
> *BAD* thing to do.
> 
> Honestly, I had only been looking at the eventfs_inode.c lookup code.
> But now that I started looking at mkdir/rmdir, I'm seeing the same
> signs of horrible mistakes there too.
> 
> And yes, it might be a real problem. For example, for the rmdir case,
> the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
> only the ->rmdir() call itself.
> 
> It also, for example, is supposed to make the ->rmdir be atomic wrt things like
> 
>         dentry->d_inode->i_flags |= S_DEAD;
> 
> and
> 
>         dont_mount(dentry);
>         detach_mounts(dentry);
> 
> but now because the inode lock was dropped in the middle, for all we
> know a "mkdir()" could come in on the same name, and make a mockery of
> the whole inode locking.
> 
> The inode lock on that directory that is getting removed is also what
> is supposed to make it impossible to add new files to the directory
> while the rmdir is going on.
> 
> Again, dropping the lock violates those kinds of guarantees.
> 
> "git blame" actually fingers Al for that "unlock and relock", but
> that's because Al did a search-and-replace on
> "mutex_[un]lock(&inode->i_mutex)" and replaced it with
> "inode_[un]lock(inode)" back in 2016.
> 
> The real culprit is you, and that sh*t-show goes back to commit
> 277ba04461c2 ("tracing: Add interface to allow multiple trace
> buffers").
> 
> Christ. Now I guess I need to start looking if there is anything else
> horrid lurking in that mkdir/rmdir path.
> 
> I doubt the inode locking problem ends up mattering in this situation.
> Mostly since this is only tracefs, and normal users shouldn't be able
> to access these things anyway. And I think (hope?) that you only
> accept mkdir/rmdir in specific places.

Yes, this was very deliberate. It was for a very "special" feature, and
thus very limited.

> 
> But this really is another case of "This code smells *fundamentally* wrong".
> 
> Adding Al, in the hopes that he will take a look at this too.

This is something I asked Al about when I wrote it. This isn't a normal
rmdir. I remember telling Al what this was doing. Basically, it's just
a way to tell the tracing infrastructure to create a new instance. It
doesn't actually create a normal directory. It's similar to the
kprobe_events interface, where writing to a file would create
directories and files. Ideally I wanted a mkdir interface as it felt
more realistic, and I was ready to have another "echo foo > make_instance"
if this didn't work. But after talking with Al, I felt that it could
work. The main issue is that mkdir doesn't just make a directory, it
creates the entire tree (like what is done in /sys/fs/cgroup). If this
was more like kernfs instead of debugfs, I may not have had this
problem. That was another time I tried to understand how krenfs worked.

This is the reason all the opens in the tracing code has:

        struct trace_array *tr = inode->i_private;
        int ret;

        ret = tracing_check_open_get_tr(tr);
	if (ret)
		return ret;


In kernel/trace/trace.c:

LIST_HEAD(ftrace_trace_arrays);

int trace_array_get(struct trace_array *this_tr)
{
        struct trace_array *tr;
        int ret = -ENODEV;
        
        mutex_lock(&trace_types_lock);
        list_for_each_entry(tr, &ftrace_trace_arrays, list) {
                if (tr == this_tr) {
                        tr->ref++;
                        ret = 0;
                        break;
                }
        }
        mutex_unlock(&trace_types_lock);
                
        return ret;
}       

static void __trace_array_put(struct trace_array *this_tr)
{
        WARN_ON(!this_tr->ref);
        this_tr->ref--;       
}

void trace_array_put(struct trace_array *this_tr)
{
        if (!this_tr)
                return;

        mutex_lock(&trace_types_lock);
        __trace_array_put(this_tr);
        mutex_unlock(&trace_types_lock);
}

int tracing_check_open_get_tr(struct trace_array *tr)
{
        int ret;

        ret = security_locked_down(LOCKDOWN_TRACEFS);
        if (ret)
                return ret;

        if (tracing_disabled)
                return -ENODEV;

        if (tr && trace_array_get(tr) < 0)
                return -ENODEV;

        return 0;
}

The rmdir code that tracefs calls is also in kernel/trace/trace.c:

static int instance_rmdir(const char *name)
{
        struct trace_array *tr;
        int ret;

        mutex_lock(&event_mutex);
        mutex_lock(&trace_types_lock);

        ret = -ENODEV;
        tr = trace_array_find(name);
        if (tr)
                ret = __remove_instance(tr);

        mutex_unlock(&trace_types_lock);
        mutex_unlock(&event_mutex);

        return ret;
}

The mkdir creates a new trace_array (tr) and rmdir destroys it. If
there's any reference on a trace array the rmdir fails. On every open,
a check is made to see if the trace array that was added to the inode
still exists, and if it does, it ups its ref count to prevent a rmdir
from happening.

It's a very limited mkdir and rmdir. I remember Al asking me questions
about what it can and can't do, and me telling him to make sure the
lock release wasn't going to cause problems.

I wrote several fuzzers on this code that perform mkdir and rmdir on the
instances while other tasks are trying to access them (reading and
writing). In the beginning, it found a few bugs. But I was finally able
to get my fuzzers to work non-stop for over a month and that's when I
felt that this is fine.

I was always weary of this code because of the locking situation. The
kernel selftests has a stress test on this code too.

  tools/testing/selftests/ftrace/test.d/instances/instance-event.tc
  tools/testing/selftests/ftrace/test.d/instances/instance.tc

Which is run as part of the selftest suite, which is run by most people
testing the kernel, including KernelCI.

I haven't had a problem with this code in years, and unless we find a
real bug that needs to be fixed, I'm afraid to touch it.

-- Steve

  reply	other threads:[~2024-01-31 12:57 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
2024-01-30 19:48   ` Steven Rostedt
2024-01-30 19:56     ` Steven Rostedt
2024-01-30 19:49   ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
2024-01-30 23:26   ` Al Viro
2024-01-30 23:55     ` Steven Rostedt
2024-01-31  0:07       ` Al Viro
2024-01-31  0:23   ` Al Viro
2024-01-31  0:39     ` Linus Torvalds
2024-01-30 19:03 ` [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field Linus Torvalds
2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
2024-01-30 20:55   ` Steven Rostedt
2024-01-30 21:52     ` Linus Torvalds
2024-01-30 22:56       ` Steven Rostedt
2024-01-30 22:58         ` Linus Torvalds
2024-01-30 23:04           ` Steven Rostedt
2024-01-30 22:56       ` Linus Torvalds
2024-01-30 23:06         ` Linus Torvalds
2024-01-30 23:10           ` Steven Rostedt
2024-01-30 23:25             ` Linus Torvalds
2024-01-31  0:48   ` Al Viro
2024-01-31  5:09   ` Steven Rostedt
2024-01-31  5:25     ` Linus Torvalds
2024-01-31  5:33       ` Steven Rostedt
2024-01-31  5:57         ` Linus Torvalds
2024-01-31  6:20           ` Linus Torvalds
2024-01-31 12:57             ` Steven Rostedt [this message]
2024-01-31 13:14               ` Steven Rostedt
2024-01-31 18:08                 ` Linus Torvalds
2024-01-31 18:39                   ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
2024-01-30 21:25   ` Steven Rostedt
2024-01-30 21:36     ` Linus Torvalds
2024-01-31  1:12   ` Al Viro
2024-01-31  2:37     ` Linus Torvalds
2024-01-31  2:46       ` Al Viro
2024-01-31  3:39         ` Linus Torvalds
2024-01-31  4:28           ` Al Viro
2024-01-31 18:00   ` Steven Rostedt
2024-01-31 23:07     ` Masami Hiramatsu
2024-01-31 23:23       ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240131075740.660e7634@rorschach.local.home \
    --to=rostedt@goodmis.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox