All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josh Hunt <johunt@akamai.com>
To: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"sandeen@redhat.com" <sandeen@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
Date: Mon, 28 Feb 2011 12:17:55 -0800	[thread overview]
Message-ID: <4D6C02F3.7070209@akamai.com> (raw)
In-Reply-To: <20110228175734.GC20805@quack.suse.cz>

On 02/28/2011 09:57 AM, Jan Kara wrote:
> On Thu 24-02-11 12:18:32, Josh Hunt wrote:
>> On 02/24/2011 03:20 AM, Jan Kara wrote:
>>> On Thu 24-02-11 06:37:49, Al Viro wrote:
>>>> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
>>>>> [resending: left Jan off the original mail by accident]
>>>>>
>>>>> We have a multi-threaded workload which is currently "losing" files in the form
>>>>> of unattached inodes. The workload is link, rename, unlink intensive. This is
>>>>> happening on an ext2 filesystem and have reproduced the issue in kernel
>>>>> 2.6.37.  Here's a sample strace:
>>>>>
>>>>>     open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
>>>>>     link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
>>>>>     rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
>>>>>     stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
>>>>>     link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile")        = 0
>>>>>     open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
>>>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
>>>>>     rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL")                = 0
>>>>>     unlink("/a/tmp/tmpfile.1296184058") = 0
>>>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
>>>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
>>>>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
>>>>>     unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>>>>
>>>>> The application behavior shown above repeats indefinitely with most filenames
>>>>> changing during each iteration except for 'tmpfile'. Looking into this issue I
>>>>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
>>>>> inode's directory as well as the old directory's mutex. This works for
>>>>> modifying the dir entry and appears to be fine for most filesystems, but
>>>>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
>>>>> inside of their respective rename functions without grabbing the i_mutex. The
>>>>> modifications are done through calls to inode_inc_link_count(old_inode) and
>>>>> inode_dec_link_count(old_inode), etc.
>>>>>
>>>>> Taking the mutex for the old inode appears to resolve the issue of the
>>>>> lost files/unattached inodes that I am seeing with this workload. I've attached
>>>>> a patch below doing what I've described above. If this is an accepted solution
>>>>> I believe other filesystems may also be affected by this and I could provide
>>>>> a patch for those as well.
>>>>
>>>> I don't know...  The thing is, we mostly do that to make life easier for
>>>> fsck in case of crash.  Other than that, there's no reason to play with
>>>> link count of that sucker at all.  The question is, do we really want
>>>> such rename() interrupted by dirty shutdown to result in what looks like two
>>>> legitimate links to that inode without any indications of what had happened?
>>>> Note that fsck (at least on ext2) will correct link counts anyway and if
>>>> nothing else, we probably want some noise pointing to the inode in question...
>>>   Yeah, I agree that playing with the link count is not worth it. It is
>>> even more disputable because it would have some reasonable effect only if
>>> we happened to write out the moved inode after it is linked to the new
>>> directory and before it is unlinked from the old one. Moreover we'd need
>>> to writeout the new directory and not the old directory before crash
>>> happens. All this is highly unlikely and even if that happens, it is
>>> questionable whether the result is worth it. So I'll just do away with
>>> those games with link count...
>>>   The patch is attached. Josh, can you test it as well? Thanks.
>>>
>>> 								Honza
>> Jan
>>
>> I'm not seeing the problem with your patch as was expected since we're
>> not messing with i_nlink anymore. Al suggested marking the inode as
>> dirty where we were previously doing the old_inode dec. I believe this
>> is needed as well since we are updating it's ctime. I've attached a
>> version marking the inode dirty and it also fixes the comment making
>> reference to calling inode_dec_link_count().
>   Yeah, good catch. Thanks.
> 
>> I'm not completely clear on the historical reasons for messing with the
>> link count of old_inode in the first place. It was just to simulate the
>> linking and unlinking of the old_inode?
>   Yes.
> 
> So I took your patch and used a changelog from mine as I find it more
> descriptive. The resulting patch is in my tree (and attached).
> 
> 								Honza
Jan

Thanks. We should probably send this to the stable guys as well.

I've found the same issue with a few other filesystems. I'll bundle up
those patches and send out a set in the coming days along with an easily
reproducible testcase.

Josh

  reply	other threads:[~2011-02-28 20:17 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-24  6:21 [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Josh Hunt
2011-02-24  6:37 ` Al Viro
2011-02-24  6:42   ` fat64 implementation Keshava Munegowda
2011-02-24 11:20   ` [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Jan Kara
2011-02-24 20:18     ` Josh Hunt
2011-02-25  7:38       ` Marco Stornelli
2011-02-28 17:57       ` Jan Kara
2011-02-28 20:17         ` Josh Hunt [this message]
2011-02-28 20:56           ` Jan Kara
2011-03-01 12:15             ` Al Viro
2011-03-01 11:07       ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2011-02-22 23:42 Josh Hunt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D6C02F3.7070209@akamai.com \
    --to=johunt@akamai.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.