* [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
@ 2011-02-24 6:21 Josh Hunt
2011-02-24 6:37 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Josh Hunt @ 2011-02-24 6:21 UTC (permalink / raw)
To: Jan Kara, linux-ext4; +Cc: sandeen, linux-kernel, linux-fsdevel, viro
[resending: left Jan off the original mail by accident]
We have a multi-threaded workload which is currently "losing" files in the form
of unattached inodes. The workload is link, rename, unlink intensive. This is
happening on an ext2 filesystem and have reproduced the issue in kernel
2.6.37. Here's a sample strace:
open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
unlink("/a/tmp/tmpfile.1296184058") = 0
open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
The application behavior shown above repeats indefinitely with most filenames
changing during each iteration except for 'tmpfile'. Looking into this issue I
see that vfs_rename_other() only takes i_mutex for the new inode and the new
inode's directory as well as the old directory's mutex. This works for
modifying the dir entry and appears to be fine for most filesystems, but
ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
inside of their respective rename functions without grabbing the i_mutex. The
modifications are done through calls to inode_inc_link_count(old_inode) and
inode_dec_link_count(old_inode), etc.
Taking the mutex for the old inode appears to resolve the issue of the
lost files/unattached inodes that I am seeing with this workload. I've attached
a patch below doing what I've described above. If this is an accepted solution
I believe other filesystems may also be affected by this and I could provide
a patch for those as well.
Thanks
Josh
ext2_rename modifies old_inode's nlink values through
inode_inc_link_count(old_inode) and inode_dec_link_count(old_inode) without
holding old_inode's mutex. vfs_rename_other() only takes the mutex of the new
inode and directory and old inode's directory. This causes old inode's nlink
values to become incorrect and results in an unattached inode.
CC: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
---
fs/ext2/namei.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 2e1d834..827839a 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -321,6 +321,8 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
dquot_initialize(old_dir);
dquot_initialize(new_dir);
+ mutex_lock(&old_inode->i_mutex);
+
old_de = ext2_find_entry (old_dir, &old_dentry->d_name, &old_page);
if (!old_de)
goto out;
@@ -375,6 +377,7 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
ext2_delete_entry (old_de, old_page);
inode_dec_link_count(old_inode);
+ mutex_unlock(&old_inode->i_mutex);
if (dir_de) {
if (old_dir != new_dir)
@@ -397,6 +400,7 @@ out_old:
kunmap(old_page);
page_cache_release(old_page);
out:
+ mutex_unlock(&old_inode->i_mutex);
return err;
}
--
1.7.0.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 6:21 [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Josh Hunt
@ 2011-02-24 6:37 ` Al Viro
2011-02-24 6:42 ` fat64 implementation Keshava Munegowda
2011-02-24 11:20 ` [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Jan Kara
0 siblings, 2 replies; 11+ messages in thread
From: Al Viro @ 2011-02-24 6:37 UTC (permalink / raw)
To: Josh Hunt; +Cc: Jan Kara, linux-ext4, sandeen, linux-kernel, linux-fsdevel
On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
> [resending: left Jan off the original mail by accident]
>
> We have a multi-threaded workload which is currently "losing" files in the form
> of unattached inodes. The workload is link, rename, unlink intensive. This is
> happening on an ext2 filesystem and have reproduced the issue in kernel
> 2.6.37. Here's a sample strace:
>
> open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
> link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
> rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
> stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
> link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
> open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
> rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> unlink("/a/tmp/tmpfile.1296184058") = 0
> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
> unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>
> The application behavior shown above repeats indefinitely with most filenames
> changing during each iteration except for 'tmpfile'. Looking into this issue I
> see that vfs_rename_other() only takes i_mutex for the new inode and the new
> inode's directory as well as the old directory's mutex. This works for
> modifying the dir entry and appears to be fine for most filesystems, but
> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
> inside of their respective rename functions without grabbing the i_mutex. The
> modifications are done through calls to inode_inc_link_count(old_inode) and
> inode_dec_link_count(old_inode), etc.
>
> Taking the mutex for the old inode appears to resolve the issue of the
> lost files/unattached inodes that I am seeing with this workload. I've attached
> a patch below doing what I've described above. If this is an accepted solution
> I believe other filesystems may also be affected by this and I could provide
> a patch for those as well.
I don't know... The thing is, we mostly do that to make life easier for
fsck in case of crash. Other than that, there's no reason to play with
link count of that sucker at all. The question is, do we really want
such rename() interrupted by dirty shutdown to result in what looks like two
legitimate links to that inode without any indications of what had happened?
Note that fsck (at least on ext2) will correct link counts anyway and if
nothing else, we probably want some noise pointing to the inode in question...
IOW, maybe it's better to rip these inc/dec for old_inode out and replace
them with mark_inode_dirty() in place where we currently do dec. Comments?
^ permalink raw reply [flat|nested] 11+ messages in thread
* fat64 implementation
2011-02-24 6:37 ` Al Viro
@ 2011-02-24 6:42 ` Keshava Munegowda
2011-02-24 11:20 ` [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Jan Kara
1 sibling, 0 replies; 11+ messages in thread
From: Keshava Munegowda @ 2011-02-24 6:42 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-kernel
Hi All
Is fat64 implementation is existing today? Are there any RFC patches?
Any body working on the implementation of fat64 file system?
Regards
Keshava Munegowda
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 6:37 ` Al Viro
2011-02-24 6:42 ` fat64 implementation Keshava Munegowda
@ 2011-02-24 11:20 ` Jan Kara
2011-02-24 20:18 ` Josh Hunt
1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2011-02-24 11:20 UTC (permalink / raw)
To: Al Viro
Cc: Josh Hunt, Jan Kara, linux-ext4, sandeen, linux-kernel,
linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 3621 bytes --]
On Thu 24-02-11 06:37:49, Al Viro wrote:
> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
> > [resending: left Jan off the original mail by accident]
> >
> > We have a multi-threaded workload which is currently "losing" files in the form
> > of unattached inodes. The workload is link, rename, unlink intensive. This is
> > happening on an ext2 filesystem and have reproduced the issue in kernel
> > 2.6.37. Here's a sample strace:
> >
> > open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
> > link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
> > rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
> > stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
> > link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
> > open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
> > open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
> > rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> > unlink("/a/tmp/tmpfile.1296184058") = 0
> > open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
> > open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
> > open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
> > unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> >
> > The application behavior shown above repeats indefinitely with most filenames
> > changing during each iteration except for 'tmpfile'. Looking into this issue I
> > see that vfs_rename_other() only takes i_mutex for the new inode and the new
> > inode's directory as well as the old directory's mutex. This works for
> > modifying the dir entry and appears to be fine for most filesystems, but
> > ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
> > inside of their respective rename functions without grabbing the i_mutex. The
> > modifications are done through calls to inode_inc_link_count(old_inode) and
> > inode_dec_link_count(old_inode), etc.
> >
> > Taking the mutex for the old inode appears to resolve the issue of the
> > lost files/unattached inodes that I am seeing with this workload. I've attached
> > a patch below doing what I've described above. If this is an accepted solution
> > I believe other filesystems may also be affected by this and I could provide
> > a patch for those as well.
>
> I don't know... The thing is, we mostly do that to make life easier for
> fsck in case of crash. Other than that, there's no reason to play with
> link count of that sucker at all. The question is, do we really want
> such rename() interrupted by dirty shutdown to result in what looks like two
> legitimate links to that inode without any indications of what had happened?
> Note that fsck (at least on ext2) will correct link counts anyway and if
> nothing else, we probably want some noise pointing to the inode in question...
Yeah, I agree that playing with the link count is not worth it. It is
even more disputable because it would have some reasonable effect only if
we happened to write out the moved inode after it is linked to the new
directory and before it is unlinked from the old one. Moreover we'd need
to writeout the new directory and not the old directory before crash
happens. All this is highly unlikely and even if that happens, it is
questionable whether the result is worth it. So I'll just do away with
those games with link count...
The patch is attached. Josh, can you test it as well? Thanks.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
[-- Attachment #2: 0001-ext2-Fix-link-count-corruption-under-heavy-link-rena.patch --]
[-- Type: text/x-patch, Size: 2180 bytes --]
>From 69b42a924bcdbbc68413d9217a4269b9d95d79fc Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 24 Feb 2011 11:48:22 +0100
Subject: [PATCH] ext2: Fix link count corruption under heavy link+rename load
vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
it as reported and analyzed by Josh.
In fact, there is no good reason to mess with i_nlink of the moved file.
We did it presumably to simulate linking into the new directory and unlinking
from an old one. But the practical effect of this is disputable because fsck
can possibly treat file as being properly linked into both directories without
writing any error which is confusing. So we just stop increment-decrement
games with i_nlink which also fixes the corruption.
CC: Al Viro <viro@ZenIV.linux.org.uk>
Reported-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext2/namei.c | 7 +------
1 files changed, 1 insertions(+), 6 deletions(-)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 2e1d834..801a5bf 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -344,7 +344,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
new_de = ext2_find_entry (new_dir, &new_dentry->d_name, &new_page);
if (!new_de)
goto out_dir;
- inode_inc_link_count(old_inode);
ext2_set_link(new_dir, new_de, new_page, old_inode, 1);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
@@ -356,12 +355,9 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
if (new_dir->i_nlink >= EXT2_LINK_MAX)
goto out_dir;
}
- inode_inc_link_count(old_inode);
err = ext2_add_link(new_dentry, old_inode);
- if (err) {
- inode_dec_link_count(old_inode);
+ if (err)
goto out_dir;
- }
if (dir_de)
inode_inc_link_count(new_dir);
}
@@ -374,7 +370,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
old_inode->i_ctime = CURRENT_TIME_SEC;
ext2_delete_entry (old_de, old_page);
- inode_dec_link_count(old_inode);
if (dir_de) {
if (old_dir != new_dir)
--
1.7.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 11:20 ` [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Jan Kara
@ 2011-02-24 20:18 ` Josh Hunt
2011-02-25 7:38 ` Marco Stornelli
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Josh Hunt @ 2011-02-24 20:18 UTC (permalink / raw)
To: Jan Kara
Cc: Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]
On 02/24/2011 03:20 AM, Jan Kara wrote:
> On Thu 24-02-11 06:37:49, Al Viro wrote:
>> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
>>> [resending: left Jan off the original mail by accident]
>>>
>>> We have a multi-threaded workload which is currently "losing" files in the form
>>> of unattached inodes. The workload is link, rename, unlink intensive. This is
>>> happening on an ext2 filesystem and have reproduced the issue in kernel
>>> 2.6.37. Here's a sample strace:
>>>
>>> open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
>>> link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
>>> rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
>>> stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
>>> link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
>>> open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
>>> rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>> unlink("/a/tmp/tmpfile.1296184058") = 0
>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
>>> unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>>
>>> The application behavior shown above repeats indefinitely with most filenames
>>> changing during each iteration except for 'tmpfile'. Looking into this issue I
>>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
>>> inode's directory as well as the old directory's mutex. This works for
>>> modifying the dir entry and appears to be fine for most filesystems, but
>>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
>>> inside of their respective rename functions without grabbing the i_mutex. The
>>> modifications are done through calls to inode_inc_link_count(old_inode) and
>>> inode_dec_link_count(old_inode), etc.
>>>
>>> Taking the mutex for the old inode appears to resolve the issue of the
>>> lost files/unattached inodes that I am seeing with this workload. I've attached
>>> a patch below doing what I've described above. If this is an accepted solution
>>> I believe other filesystems may also be affected by this and I could provide
>>> a patch for those as well.
>>
>> I don't know... The thing is, we mostly do that to make life easier for
>> fsck in case of crash. Other than that, there's no reason to play with
>> link count of that sucker at all. The question is, do we really want
>> such rename() interrupted by dirty shutdown to result in what looks like two
>> legitimate links to that inode without any indications of what had happened?
>> Note that fsck (at least on ext2) will correct link counts anyway and if
>> nothing else, we probably want some noise pointing to the inode in question...
> Yeah, I agree that playing with the link count is not worth it. It is
> even more disputable because it would have some reasonable effect only if
> we happened to write out the moved inode after it is linked to the new
> directory and before it is unlinked from the old one. Moreover we'd need
> to writeout the new directory and not the old directory before crash
> happens. All this is highly unlikely and even if that happens, it is
> questionable whether the result is worth it. So I'll just do away with
> those games with link count...
> The patch is attached. Josh, can you test it as well? Thanks.
>
> Honza
Jan
I'm not seeing the problem with your patch as was expected since we're
not messing with i_nlink anymore. Al suggested marking the inode as
dirty where we were previously doing the old_inode dec. I believe this
is needed as well since we are updating it's ctime. I've attached a
version marking the inode dirty and it also fixes the comment making
reference to calling inode_dec_link_count().
I'm not completely clear on the historical reasons for messing with the
link count of old_inode in the first place. It was just to simulate the
linking and unlinking of the old_inode?
Thanks
Josh
[-- Attachment #2: 0001-ext2-Resolve-i_nlink-count-corruption-with-heavy-unl.patch --]
[-- Type: text/x-patch, Size: 2039 bytes --]
>From cb347219bc21622dee06491b689c711e452005ce Mon Sep 17 00:00:00 2001
From: Josh Hunt <johunt@akamai.com>
Date: Thu, 24 Feb 2011 11:56:43 -0800
Subject: [PATCH] ext2: Resolve i_nlink count corruption with heavy unlink and rename
old_inode's i_nlink count can become corrupt in ext2_rename() because we do not
grab it's i_mutex in vfs_rename_other(). Looking into this further it has been
determined there is no good reason for changing old_inode's link counts at all.
Instead we just mark the inode as dirty when we are done.
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Josh Hunt <johunt@akamai.com>
---
fs/ext2/namei.c | 9 ++-------
1 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 2e1d834..adb9185 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -344,7 +344,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
new_de = ext2_find_entry (new_dir, &new_dentry->d_name, &new_page);
if (!new_de)
goto out_dir;
- inode_inc_link_count(old_inode);
ext2_set_link(new_dir, new_de, new_page, old_inode, 1);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
@@ -356,12 +355,9 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
if (new_dir->i_nlink >= EXT2_LINK_MAX)
goto out_dir;
}
- inode_inc_link_count(old_inode);
err = ext2_add_link(new_dentry, old_inode);
- if (err) {
- inode_dec_link_count(old_inode);
+ if (err)
goto out_dir;
- }
if (dir_de)
inode_inc_link_count(new_dir);
}
@@ -369,12 +365,11 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
/*
* Like most other Unix systems, set the ctime for inodes on a
* rename.
- * inode_dec_link_count() will mark the inode dirty.
*/
old_inode->i_ctime = CURRENT_TIME_SEC;
+ mark_inode_dirty(old_inode);
ext2_delete_entry (old_de, old_page);
- inode_dec_link_count(old_inode);
if (dir_de) {
if (old_dir != new_dir)
--
1.7.0.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 20:18 ` Josh Hunt
@ 2011-02-25 7:38 ` Marco Stornelli
2011-02-28 17:57 ` Jan Kara
2011-03-01 11:07 ` Christoph Hellwig
2 siblings, 0 replies; 11+ messages in thread
From: Marco Stornelli @ 2011-02-25 7:38 UTC (permalink / raw)
To: Josh Hunt
Cc: Jan Kara, Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Hi,
2011/2/24 Josh Hunt <johunt@akamai.com>:
> Jan
>
> I'm not seeing the problem with your patch as was expected since we're
> not messing with i_nlink anymore. Al suggested marking the inode as
> dirty where we were previously doing the old_inode dec. I believe this
> is needed as well since we are updating it's ctime. I've attached a
> version marking the inode dirty and it also fixes the comment making
> reference to calling inode_dec_link_count().
>
> I'm not completely clear on the historical reasons for messing with the
> link count of old_inode in the first place. It was just to simulate the
> linking and unlinking of the old_inode?
>
> Thanks
> Josh
>
can we share your test/benchmark? I'd like to add it to my test suite
as no-regression test.
Marco
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 20:18 ` Josh Hunt
2011-02-25 7:38 ` Marco Stornelli
@ 2011-02-28 17:57 ` Jan Kara
2011-02-28 20:17 ` Josh Hunt
2011-03-01 11:07 ` Christoph Hellwig
2 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2011-02-28 17:57 UTC (permalink / raw)
To: Josh Hunt
Cc: Jan Kara, Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 4636 bytes --]
On Thu 24-02-11 12:18:32, Josh Hunt wrote:
> On 02/24/2011 03:20 AM, Jan Kara wrote:
> > On Thu 24-02-11 06:37:49, Al Viro wrote:
> >> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
> >>> [resending: left Jan off the original mail by accident]
> >>>
> >>> We have a multi-threaded workload which is currently "losing" files in the form
> >>> of unattached inodes. The workload is link, rename, unlink intensive. This is
> >>> happening on an ext2 filesystem and have reproduced the issue in kernel
> >>> 2.6.37. Here's a sample strace:
> >>>
> >>> open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
> >>> link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
> >>> rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
> >>> stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
> >>> link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
> >>> open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
> >>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
> >>> rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> >>> unlink("/a/tmp/tmpfile.1296184058") = 0
> >>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
> >>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
> >>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
> >>> unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> >>>
> >>> The application behavior shown above repeats indefinitely with most filenames
> >>> changing during each iteration except for 'tmpfile'. Looking into this issue I
> >>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
> >>> inode's directory as well as the old directory's mutex. This works for
> >>> modifying the dir entry and appears to be fine for most filesystems, but
> >>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
> >>> inside of their respective rename functions without grabbing the i_mutex. The
> >>> modifications are done through calls to inode_inc_link_count(old_inode) and
> >>> inode_dec_link_count(old_inode), etc.
> >>>
> >>> Taking the mutex for the old inode appears to resolve the issue of the
> >>> lost files/unattached inodes that I am seeing with this workload. I've attached
> >>> a patch below doing what I've described above. If this is an accepted solution
> >>> I believe other filesystems may also be affected by this and I could provide
> >>> a patch for those as well.
> >>
> >> I don't know... The thing is, we mostly do that to make life easier for
> >> fsck in case of crash. Other than that, there's no reason to play with
> >> link count of that sucker at all. The question is, do we really want
> >> such rename() interrupted by dirty shutdown to result in what looks like two
> >> legitimate links to that inode without any indications of what had happened?
> >> Note that fsck (at least on ext2) will correct link counts anyway and if
> >> nothing else, we probably want some noise pointing to the inode in question...
> > Yeah, I agree that playing with the link count is not worth it. It is
> > even more disputable because it would have some reasonable effect only if
> > we happened to write out the moved inode after it is linked to the new
> > directory and before it is unlinked from the old one. Moreover we'd need
> > to writeout the new directory and not the old directory before crash
> > happens. All this is highly unlikely and even if that happens, it is
> > questionable whether the result is worth it. So I'll just do away with
> > those games with link count...
> > The patch is attached. Josh, can you test it as well? Thanks.
> >
> > Honza
> Jan
>
> I'm not seeing the problem with your patch as was expected since we're
> not messing with i_nlink anymore. Al suggested marking the inode as
> dirty where we were previously doing the old_inode dec. I believe this
> is needed as well since we are updating it's ctime. I've attached a
> version marking the inode dirty and it also fixes the comment making
> reference to calling inode_dec_link_count().
Yeah, good catch. Thanks.
> I'm not completely clear on the historical reasons for messing with the
> link count of old_inode in the first place. It was just to simulate the
> linking and unlinking of the old_inode?
Yes.
So I took your patch and used a changelog from mine as I find it more
descriptive. The resulting patch is in my tree (and attached).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
[-- Attachment #2: 0001-ext2-Fix-link-count-corruption-under-heavy-link-rena.patch --]
[-- Type: text/x-patch, Size: 2368 bytes --]
>From 56c78bf2068b00bb6e7726bd8a94a34a1e206f03 Mon Sep 17 00:00:00 2001
From: Josh Hunt <johunt@akamai.com>
Date: Thu, 24 Feb 2011 11:48:22 +0100
Subject: [PATCH] ext2: Fix link count corruption under heavy link+rename load
vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
it as reported and analyzed by Josh.
In fact, there is no good reason to mess with i_nlink of the moved file.
We did it presumably to simulate linking into the new directory and unlinking
from an old one. But the practical effect of this is disputable because fsck
can possibly treat file as being properly linked into both directories without
writing any error which is confusing. So we just stop increment-decrement
games with i_nlink which also fixes the corruption.
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext2/namei.c | 9 ++-------
1 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 2e1d834..adb9185 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -344,7 +344,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
new_de = ext2_find_entry (new_dir, &new_dentry->d_name, &new_page);
if (!new_de)
goto out_dir;
- inode_inc_link_count(old_inode);
ext2_set_link(new_dir, new_de, new_page, old_inode, 1);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
@@ -356,12 +355,9 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
if (new_dir->i_nlink >= EXT2_LINK_MAX)
goto out_dir;
}
- inode_inc_link_count(old_inode);
err = ext2_add_link(new_dentry, old_inode);
- if (err) {
- inode_dec_link_count(old_inode);
+ if (err)
goto out_dir;
- }
if (dir_de)
inode_inc_link_count(new_dir);
}
@@ -369,12 +365,11 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
/*
* Like most other Unix systems, set the ctime for inodes on a
* rename.
- * inode_dec_link_count() will mark the inode dirty.
*/
old_inode->i_ctime = CURRENT_TIME_SEC;
+ mark_inode_dirty(old_inode);
ext2_delete_entry (old_de, old_page);
- inode_dec_link_count(old_inode);
if (dir_de) {
if (old_dir != new_dir)
--
1.7.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-28 17:57 ` Jan Kara
@ 2011-02-28 20:17 ` Josh Hunt
2011-02-28 20:56 ` Jan Kara
0 siblings, 1 reply; 11+ messages in thread
From: Josh Hunt @ 2011-02-28 20:17 UTC (permalink / raw)
To: Jan Kara
Cc: Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 02/28/2011 09:57 AM, Jan Kara wrote:
> On Thu 24-02-11 12:18:32, Josh Hunt wrote:
>> On 02/24/2011 03:20 AM, Jan Kara wrote:
>>> On Thu 24-02-11 06:37:49, Al Viro wrote:
>>>> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
>>>>> [resending: left Jan off the original mail by accident]
>>>>>
>>>>> We have a multi-threaded workload which is currently "losing" files in the form
>>>>> of unattached inodes. The workload is link, rename, unlink intensive. This is
>>>>> happening on an ext2 filesystem and have reproduced the issue in kernel
>>>>> 2.6.37. Here's a sample strace:
>>>>>
>>>>> open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
>>>>> link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
>>>>> rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
>>>>> stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
>>>>> link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile") = 0
>>>>> open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
>>>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
>>>>> rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>>>> unlink("/a/tmp/tmpfile.1296184058") = 0
>>>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
>>>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
>>>>> open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
>>>>> unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
>>>>>
>>>>> The application behavior shown above repeats indefinitely with most filenames
>>>>> changing during each iteration except for 'tmpfile'. Looking into this issue I
>>>>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
>>>>> inode's directory as well as the old directory's mutex. This works for
>>>>> modifying the dir entry and appears to be fine for most filesystems, but
>>>>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
>>>>> inside of their respective rename functions without grabbing the i_mutex. The
>>>>> modifications are done through calls to inode_inc_link_count(old_inode) and
>>>>> inode_dec_link_count(old_inode), etc.
>>>>>
>>>>> Taking the mutex for the old inode appears to resolve the issue of the
>>>>> lost files/unattached inodes that I am seeing with this workload. I've attached
>>>>> a patch below doing what I've described above. If this is an accepted solution
>>>>> I believe other filesystems may also be affected by this and I could provide
>>>>> a patch for those as well.
>>>>
>>>> I don't know... The thing is, we mostly do that to make life easier for
>>>> fsck in case of crash. Other than that, there's no reason to play with
>>>> link count of that sucker at all. The question is, do we really want
>>>> such rename() interrupted by dirty shutdown to result in what looks like two
>>>> legitimate links to that inode without any indications of what had happened?
>>>> Note that fsck (at least on ext2) will correct link counts anyway and if
>>>> nothing else, we probably want some noise pointing to the inode in question...
>>> Yeah, I agree that playing with the link count is not worth it. It is
>>> even more disputable because it would have some reasonable effect only if
>>> we happened to write out the moved inode after it is linked to the new
>>> directory and before it is unlinked from the old one. Moreover we'd need
>>> to writeout the new directory and not the old directory before crash
>>> happens. All this is highly unlikely and even if that happens, it is
>>> questionable whether the result is worth it. So I'll just do away with
>>> those games with link count...
>>> The patch is attached. Josh, can you test it as well? Thanks.
>>>
>>> Honza
>> Jan
>>
>> I'm not seeing the problem with your patch as was expected since we're
>> not messing with i_nlink anymore. Al suggested marking the inode as
>> dirty where we were previously doing the old_inode dec. I believe this
>> is needed as well since we are updating it's ctime. I've attached a
>> version marking the inode dirty and it also fixes the comment making
>> reference to calling inode_dec_link_count().
> Yeah, good catch. Thanks.
>
>> I'm not completely clear on the historical reasons for messing with the
>> link count of old_inode in the first place. It was just to simulate the
>> linking and unlinking of the old_inode?
> Yes.
>
> So I took your patch and used a changelog from mine as I find it more
> descriptive. The resulting patch is in my tree (and attached).
>
> Honza
Jan
Thanks. We should probably send this to the stable guys as well.
I've found the same issue with a few other filesystems. I'll bundle up
those patches and send out a set in the coming days along with an easily
reproducible testcase.
Josh
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-28 20:17 ` Josh Hunt
@ 2011-02-28 20:56 ` Jan Kara
2011-03-01 12:15 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2011-02-28 20:56 UTC (permalink / raw)
To: Josh Hunt
Cc: Jan Kara, Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Mon 28-02-11 12:17:55, Josh Hunt wrote:
> On 02/28/2011 09:57 AM, Jan Kara wrote:
> > So I took your patch and used a changelog from mine as I find it more
> > descriptive. The resulting patch is in my tree (and attached).
> >
> > Honza
> Jan
>
> Thanks. We should probably send this to the stable guys as well.
Good point. Made sure they are warned when the patch gets in.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-24 20:18 ` Josh Hunt
2011-02-25 7:38 ` Marco Stornelli
2011-02-28 17:57 ` Jan Kara
@ 2011-03-01 11:07 ` Christoph Hellwig
2 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-03-01 11:07 UTC (permalink / raw)
To: Josh Hunt
Cc: Jan Kara, Al Viro, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Can you wire up your testcase with xfstests so we can regression test
for it in the future?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename
2011-02-28 20:56 ` Jan Kara
@ 2011-03-01 12:15 ` Al Viro
0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2011-03-01 12:15 UTC (permalink / raw)
To: Jan Kara
Cc: Josh Hunt, linux-ext4@vger.kernel.org, sandeen@redhat.com,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Mon, Feb 28, 2011 at 09:56:52PM +0100, Jan Kara wrote:
> On Mon 28-02-11 12:17:55, Josh Hunt wrote:
> > On 02/28/2011 09:57 AM, Jan Kara wrote:
> > > So I took your patch and used a changelog from mine as I find it more
> > > descriptive. The resulting patch is in my tree (and attached).
> > >
> > > Honza
> > Jan
> >
> > Thanks. We should probably send this to the stable guys as well.
> Good point. Made sure they are warned when the patch gets in.
I have an equivalent of that for other affected fs in my tree; will go
to Linus in an hour or so.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-03-01 12:15 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-24 6:21 [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Josh Hunt
2011-02-24 6:37 ` Al Viro
2011-02-24 6:42 ` fat64 implementation Keshava Munegowda
2011-02-24 11:20 ` [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename Jan Kara
2011-02-24 20:18 ` Josh Hunt
2011-02-25 7:38 ` Marco Stornelli
2011-02-28 17:57 ` Jan Kara
2011-02-28 20:17 ` Josh Hunt
2011-02-28 20:56 ` Jan Kara
2011-03-01 12:15 ` Al Viro
2011-03-01 11:07 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).