From: Omar Sandoval <osandov@osandov.com>
To: dsterba@suse.cz, Chris Mason <clm@fb.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
Kernel Team <Kernel-team@fb.com>,
Nikolay Borisov <nborisov@suse.com>
Subject: Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount
Date: Thu, 1 Nov 2018 09:00:56 -0700 [thread overview]
Message-ID: <20181101160056.GD18005@vader> (raw)
In-Reply-To: <20181101152948.GP9136@twin.jikos.cz>
On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > >
> > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > > >> From: Omar Sandoval <osandov@fb.com>
> > > > > >>
> > > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > > > >> sees it set, but this is racy; the cleaner might have already checked
> > > > > >> the bit and could be cleaning stuff. In particular, if it deletes
> > > > > >> unused
> > > > > >> block groups, it will create delayed iputs for the free space cache
> > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > > >> longer running delayed iputs after a commit. Therefore, if the
> > > > > >> cleaner
> > > > > >> creates more delayed iputs after delayed iputs are run in
> > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > > >> inode crash from the VFS.
> > > > > >>
> > > > > >> Fix it by parking the cleaner
> > > > > >
> > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > > we're missing a commit or clean up data structures. Messing with state
> > > > > > of a thread would be the last thing I'd try after proving that it's
> > > > > > not
> > > > > > possible to fix in the logic of btrfs itself.
> > > > > >
> > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > > > there. The interdependencies of thread and data structures and other
> > > > > > subsystems cannot have loops that could not find an ordering that will
> > > > > > not leak something.
> > > > > >
> > > > > > It's not a big problem if some step is done more than once, like
> > > > > > committing or cleaning up some other structures if we know that
> > > > > > it could create new.
> > > > >
> > > > > The problem is the cleaner thread needs to be told to stop doing new
> > > > > work, and we need to wait for the work it's already doing to be
> > > > > finished. We're getting "stop doing new work" already because the
> > > > > cleaner thread checks to see if the FS is closing, but we don't have a
> > > > > way today to wait for him to finish what he's already doing.
> > > > >
> > > > > kthread_park() is basically the same as adding another mutex or
> > > > > synchronization point. I'm not sure how we could change close_tree() or
> > > > > the final commit to pick this up more effectively?
> > > >
> > > > The idea is:
> > > >
> > > > cleaner close_ctree thread
> > > >
> > > > tell cleaner to stop
> > > > wait
> > > > start work
> > > > if should_stop, then exit
> > > > cleaner is stopped
> > > >
> > > > [does not run: finish work]
> > > > [does not run: loop]
> > > > pick up the work or make
> > > > sure there's nothing in
> > > > progress anymore
> > > >
> > > >
> > > > A simplified version in code:
> > > >
> > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > > >
> > > > wait for defrag - could be started from cleaner but next iteration will
> > > > see the fs closed and will not continue
> > > >
> > > > kthread_stop(transaction_kthread)
> > > >
> > > > kthread_stop(cleaner_kthread)
> > > >
> > > > /* next, everything that could be left from cleaner should be finished */
> > > >
> > > > btrfs_delete_unused_bgs();
> > > > assert there are no defrags
> > > > assert there are no delayed iputs
> > > > commit if necessary
> > > >
> > > > IOW the unused block groups are removed from close_ctree too early,
> > > > moving that after the threads stop AND makins sure that it does not need
> > > > either of them should work.
> > > >
> > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > > > there would need to be an argument passed to tell it to do full commit.
> > >
> > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> > > thread is a no-op,
> >
> > How is that? kthread_stop() frees the task struct, so wake_up_process()
> > would be a use-after-free.
>
> That was an assumption for the demonstration purposes, the wording was
> confusing sorry.
Oh, well in that case, that's exactly what kthread_park() is ;) Stop the
thread and make wake_up a noop, and then we don't need to add special
cases everywhere else.
next prev parent reply other threads:[~2018-11-01 16:01 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-31 17:06 [PATCH v2] Btrfs: fix missing delayed iputs on unmount Omar Sandoval
2018-11-01 10:15 ` David Sterba
2018-11-01 13:31 ` Chris Mason
2018-11-01 15:08 ` David Sterba
2018-11-01 15:22 ` David Sterba
2018-11-01 15:24 ` Omar Sandoval
2018-11-01 15:28 ` Omar Sandoval
2018-11-01 15:29 ` David Sterba
2018-11-01 16:00 ` Omar Sandoval [this message]
2018-11-01 16:44 ` David Sterba
2018-11-01 16:50 ` Nikolay Borisov
2018-11-01 17:15 ` David Sterba
2018-11-01 17:36 ` Chris Mason
2018-11-01 15:23 ` Omar Sandoval
2018-11-01 15:28 ` David Sterba
2018-11-01 14:35 ` Nikolay Borisov
2018-11-01 15:10 ` Nikolay Borisov
2018-11-07 16:01 ` David Sterba
2018-11-10 4:07 ` Omar Sandoval
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181101160056.GD18005@vader \
--to=osandov@osandov.com \
--cc=Kernel-team@fb.com \
--cc=clm@fb.com \
--cc=dsterba@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
--cc=nborisov@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).