Is VFS behavior fine? - Artem Bityutskiy

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
To: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org,
	Hunter Adrian <ext-Adrian.Hunter@nokia.com>
Subject: Is VFS behavior fine?
Date: Wed, 02 Jul 2008 11:51:43 +0300	[thread overview]
Message-ID: <486B419F.7050407@nokia.com> (raw)

Hi,

Zoltan Sogor noticed the following VFS behavior while testing
UBIFS. He ran the following shell script:

mkdir xxx
cd xxx
rmdir ../xxx
touch tmp
cd /

The result of it was that the '->delete_inode()' operation for
the 'xxx' directory inode is not called. It is not called even
after 'cd /'. However, if we do not do the 'touch tmp' command
(which actually fails), '->delete_inode()' _is_ called for 'xxx'.

So 'touch tmp' has a side effect. You may notice this by watching
'cat /proc/slabinfo | grep dentry' while running the following
script:

while true; do
        echo live
        for i in `seq 1 1000`
        do
                mkdir $i
                cd $i
                rmdir ../$i
                touch tmp &> /dev/null
                cd - &> /dev/null
        done;
done;

Dentry cache will be growing. However, the same script but
without 'touch tmp' does not make dentry cache to grow.

For UBIFS this causes the problems because its orphan area
gets overflowed (orphans are added, but not removed because
'->delete_inode()' is not called).

==============================================================
I explored this some more. What happens  is:

1. Current directory is 'xxx', its 'i_nlink' is 0 (because of
   'rmdir ../xxx'), but because it is current directory, 'xxx'
   dentry has 'd_count' = 1, so 'iput_final()' is not called
   so far. If the script would just do 'cd /', the inode
   would be deleted. However, we are going to run
   'touch tmp'.

2. 'touch tmp' calls 'open(tmp, O_CREAT)'.

3. we are in 'do_filp_open()' function (see fs/namei.c).
   The situation is: xxx dentry 'd_count' is 1,
   'xxx' inode i_count is 1 as well.

4. 'path_lookup_create()' is called, which increases 'd_count'
   of 'xxx' direntry to 2 and succeeds.

5. 'do_filp_open()' calls 'lookup_hash()', which in turn calls
   'd_alloc()', which allocates a dentry object for 'tmp', sets
   its 'd_count' to 1, and also increases 'd_count' of 'xxx'
   dentry (the parent) to 3.

   So the situation is: 'xxx' dentry 'd_count' is 3
                        'tmp' dentry has d_count 1

6. '__lookup_hash()' calls 'inode->i_op->lookup()', which does
   not find the dentry, but calls 'd_splice_alias(NULL, dentry)'
   anyway, which makes 'tmp' hashed. Not sure, but AFAIU the idea
   is to make it "negative" dentry anyway, to optimize lookups.

7. Return to 'do_filp_open()', and '__open_namei_create()' is
   called, which in turn calls 'vfs_create()', which fails in
   the 'may_create()' check on the 'IS_DEADDIR(dir)' check.

   '__open_namei_create()' 'dput()'s 'xxx' direntry and returns
   error.

   So the situation is: 'xxx' dentry 'd_count' is 2
                        'tmp' dentry has d_count 1
                        'xxx' inode 'i_count' is still 1.

8. We do 'goto exit', which calls 'dput()' for 'tmp' dentry,
   (because 'd_alloc()' set it to 1). However, this _does not_
   decrease 'd_count' of the parent 'xxx' dentry (but it _did_
   increase it - see step 5.

   So the situation is: 'xxx' dentry 'd_count' is 2
                        'tmp' dentry d_count is 0
                        'xxx' inode 'i_count' is still 1.

9. The shell script does 'cd /', which causes 'xxx' dentry
   'd_count' to be decreased. However, 'd_count' stays 1,
   and 'xxx' inode 'i_count' stays 1. So the result is that
   '->delete_inode()' is not called for removed 'xxx' inode
   until unmount or memory pressure.

=============================================================

My understanding is that negative direntry 'tmp' keeps deleted
directory inode 'xxx'. However, I am not 100% sure I understand
the situation correctly.

The 'tmp' dentry is freed eventually because of unmount or
memory pressure, but not earlier than that. So the deleted
inodes may be kept for really long time. Is this OK?

I may just say that I fixed this in UBIFS by not calling
'd_splice_alias()' for not found dentries if the parent
directory inode has 'n_link' = 0. However, ext[23] always
call 'd_splice_alias()' for not found direntries (passing
NULL as the 'inode' parameter).

Again, I am not 100% sure this is the right fix, because
I suspect this should be "fixed" in VFS. I tried to do this
and I have a small VFS patch, but it is probably incorrect.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next             reply	other threads:[~2008-07-02  8:53 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-02  8:51 Artem Bityutskiy [this message]
2008-07-02 11:03 ` Is VFS behavior fine? Miklos Szeredi
2008-07-02 13:18   ` Artem Bityutskiy
2008-07-03 14:29 ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=486B419F.7050407@nokia.com \
    --to=artem.bityutskiy@nokia.com \
    --cc=ext-Adrian.Hunter@nokia.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).