Re: [PATCH V4] btrfs: implement delayed inode items operation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Mason <chris.mason@oracle.com>
To: Miao Xie <miaox@cn.fujitsu.com>
Cc: Linux Btrfs <linux-btrfs@vger.kernel.org>,
	David Sterba <dave@jikos.cz>, Ito <t-itoh@jp.fujitsu.com>,
	Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Subject: Re: [PATCH V4] btrfs: implement delayed inode items operation
Date: Mon, 21 Mar 2011 08:08:17 -0400	[thread overview]
Message-ID: <1300709085-sup-9849@think> (raw)
In-Reply-To: <4D86DC92.40608@cn.fujitsu.com>

Excerpts from Miao Xie's message of 2011-03-21 01:05:22 -0400:
> On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote:
> > Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400:
> >> Changelog V3 -> V4:
> >> - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
> >>   inodes in time.
> > 
> > I ran some tests on this and had trouble with my stress.sh script:
> > 
> > http://oss.oracle.com/~mason/stress.sh
> > 
> > I used:
> > 
> > stress.sh -n 50 -c <path to linux kernel git tree> /mnt
> > 
> > The git tree has all the .git files but no .o files.
> > 
> > The problem was that within about 20 minutes, the filesystem was
> > spending almost all of its time in balance_dirty_pages().  The problem
> > is that data writeback isn't complete until the endio handlers have
> > finished inserting metadata into the btree.
> > 
> > The v4 patch calls btrfs_btree_balance_dirty() from all the
> > btrfs_end_transaction variants, which means that the FS writeback code
> > waits for balance_dirty_pages(), which won't make progress until the FS
> > writeback code is done.
> > 
> > So I changed things to call the delayed inode balance function only from
> > inside btrfs_btree_balance_dirty(), which did resolve the stalls.  But
> 
> Ok, but can we invoke the delayed inode balance function before
> balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and
> deletion also bring us some dirty pages.

Yes, good point.

> 
> > I found a few times that when I did rmmod btrfs, there would be delayed
> > inode objects leaked in the slab cache.  rmmod will try to destroy the
> > slab cache, which will fail because we haven't freed everything.
> > 
> > It looks like we have a race in btrfs_get_or_create_delayed_node, where
> > two concurrent callers can both create delayed nodes and then race on
> > adding it to the inode.
> 
> Sorry for my mistake, I thought we updated the inodes when holding i_mutex originally,
> so I didn't use any lock or other method to protect delayed_node of the inodes.
> 
> But I think we needn't use rcu lock to protect delayed_node when we want to get the
> delayed node, because we won't change it after it is created, cmpxchg() and ACCESS_ONCE()
> can protect it well. What do you think about?
> 
> PS: I worry about the inode update without holding i_mutex.

We have the tree locks to make sure we're serialized while we actually
change the tree.  The only places that go in without locking are times
updates.

> 
> > I also think that code is racing with the code that frees delayed nodes,
> > but haven't yet triggered my debugging printks to prove either one.
> 
> We free delayed nodes when we want to destroy the inode, at that time, just one task,
> which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is
> enough. What do you think about?

Great, I see what you mean.  The bigger problem right now is that we may do
a lot of operations in destroy_inode(), which can block the slab
shrinkers on our metadata operations.  That same stress.sh -n 50 run is
running into OOM.

So we need to rework the part where the final free is done.  We could
keep a ref on the inode until the delayed items are complete, or we
could let the inode go and make a way to lookup the delayed node when
the inode is read.

I'll read more today.

-chris

next prev parent reply	other threads:[~2011-03-21 12:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-18  9:24 [PATCH V4] btrfs: implement delayed inode items operation Miao Xie
2011-03-21  0:33 ` Chris Mason
2011-03-21  5:05   ` Miao Xie
2011-03-21 12:08     ` Chris Mason [this message]
2011-03-23  1:57       ` Miao Xie
2011-03-23 14:20         ` Miao Xie
2011-03-22  2:33 ` Itaru Kitayama
2011-03-22  3:12   ` Miao Xie
2011-03-22  3:50     ` Itaru Kitayama
2011-03-22 10:03       ` Miao Xie
2011-03-22 13:33         ` Itaru Kitayama
2011-03-23  1:27           ` Miao Xie
2011-03-23  3:24             ` Itaru Kitayama
2011-03-23  4:00               ` Miao Xie
2011-03-23  4:19                 ` Itaru Kitayama
2011-03-23  9:47                   ` Miao Xie
2011-03-24  3:38                     ` Itaru Kitayama

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1300709085-sup-9849@think \
    --to=chris.mason@oracle.com \
    --cc=dave@jikos.cz \
    --cc=kitayama@cl.bb4u.ne.jp \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=miaox@cn.fujitsu.com \
    --cc=t-itoh@jp.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.