From: Carsten Otte <cotte@freenet.de>
To: linux-fsdevel@vger.kernel.org, akpm@digeo.com, torvalds@osdl.org
Cc: schwidefsky@de.ibm.com, cotte@de.ibm.com
Subject: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
Date: Thu, 19 Feb 2004 13:21:39 +0100 [thread overview]
Message-ID: <200402191321.39592.cotte@freenet.de> (raw)
Hi all,
recently we ran into a problem found during our 2.6. test activities on s390
architecture. The problem occured running a glibc build on Linux 2.6.2.
running in a z/VM virtual machine with 6 processors on a z990 Server
(6Processor SMP, 64-bit, big endian).
We were able to identify ext3 as the cause of the problem with the following
debugging patch:
diff -ruN linux-2.6.2/fs/ext3/super.c
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c 2004-02-19 12:51:35.000000000
+0100
@@ -449,6 +449,7 @@
static void ext3_destroy_inode(struct inode *inode)
{
+ BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
}
The output of the BUG_ON statement shows, that prune_icache [fs/inode.c] picks
an inode with i_count == 0 and calls (via dispose_list) clear_inode() and
destroy_inode(). Because the inode is still in use by ext3 internally, ext3
later on reads from or writes to freed memory causing random behavior of the
system. I think, this BUG_ON should go into the vanilla kernel to prevent
data corruption.
Above was so far only reproducable in rare conditions when the system was
under heavy memory pressure. Therefore I continued to debug this further:
The problem seemed to be related to reference counting of inodes. Therefore I
added a debugging patch to iput_final() [fs/inode.c], that checks if an inode
that is being dropped is still in the ext3 orphan list. This situation is
quite easy to reproduce, and has shown one example path were this happens:
A user process calls sys_unlink() [fs/namei.c], which increments i_count,
calls vfs_unlink [fs/namei.c] and afterwards calls iput().
vfs_unlink [fs/namei.c] works with the dentry, calls i_op->unlink() (in this
case ext3_unlink) and returns.
ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orphan list
before returning.
After sys_unlink() has completed, the inode is still referenced by ext3 while
i_count has the same value like before, which triggers the problem in case
prune_icache would now choose this inode to be freed.
Possible ways to fix above problem is change reference counting for inodes or
make the prune_icache function aware of the internal reference to the inode
(preferably without knowing about the internal data structures of the
filesystem which would be a layering violation). The patch below does take
the 2nd approach:
- adds additional super_operation s_op->inode_busy() allowing VFS to query if
an inode is still internally referenced by the fs.
- adds ext3_inode_busy() to ext3 that checks if he inode is still internally
referenced
- changes prune_icache to query inode_busy() in case this s_op is implemented
by the individual fs
Patch fixing the problem:
diff -ruN linux-2.6.2/fs/ext3/super.c linux-2.6.2+ext3fix/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/ext3/super.c 2004-02-04 17:50:03.000000000 +0100
@@ -453,6 +453,14 @@
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
}
+static int ext3_inode_busy(struct inode *inode)
+{
+ if (!list_empty(&EXT3_I(inode)->i_orphan))
+ return 1;
+ else
+ return 0;
+}
+
static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
{
struct ext3_inode_info *ei = (struct ext3_inode_info *) foo;
@@ -510,6 +518,7 @@
static struct super_operations ext3_sops = {
.alloc_inode = ext3_alloc_inode,
.destroy_inode = ext3_destroy_inode,
+ .inode_busy = ext3_inode_busy,
.read_inode = ext3_read_inode,
.write_inode = ext3_write_inode,
.dirty_inode = ext3_dirty_inode,
diff -ruN linux-2.6.2/fs/inode.c linux-2.6.2+ext3fix/fs/inode.c
--- linux-2.6.2/fs/inode.c 2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/inode.c 2004-01-28 09:02:25.000000000 +0100
@@ -391,6 +391,9 @@
return 0;
if (inode->i_data.nrpages)
return 0;
+ if (inode->i_sb->s_op->inode_busy
+ && inode->i_sb->s_op->inode_busy(inode))
+ return 0;
return 1;
}
@@ -424,7 +427,9 @@
inode = list_entry(inode_unused.prev, struct inode, i_list);
- if (inode->i_state || atomic_read(&inode->i_count)) {
+ if (inode->i_state || atomic_read(&inode->i_count)
+ || (inode->i_sb->s_op->inode_busy
+ && (inode->i_sb->s_op->inode_busy(inode)))) {
list_move(&inode->i_list, &inode_unused);
continue;
}
diff -ruN linux-2.6.2/include/linux/fs.h
linux-2.6.2+ext3fix/include/linux/fs.h
--- linux-2.6.2/include/linux/fs.h 2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/include/linux/fs.h 2004-02-18 18:45:15.000000000
+0100
@@ -866,6 +866,7 @@
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
+ int (*inode_busy)(struct inode *);
void (*read_inode) (struct inode *);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next reply other threads:[~2004-02-19 12:21 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-02-19 12:21 Carsten Otte [this message]
2004-02-19 16:53 ` [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Linus Torvalds
2004-02-19 17:39 ` Stephen C. Tweedie
2004-02-19 18:49 ` Andrew Morton
2004-02-19 20:28 ` Carsten Otte
2004-02-19 20:26 ` viro
2004-02-19 20:35 ` Carsten Otte
2004-02-19 20:14 ` Carsten Otte
2004-02-20 3:41 ` Andrew Morton
2004-02-19 20:19 ` Carsten Otte
[not found] ` <20040220164325.659c4e45.akpm@osdl.org>
[not found] ` <200402241338.57855.cotte@freenet.de>
2004-02-24 22:55 ` Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2004-02-19 18:00 Martin Schwidefsky
2004-03-29 19:07 Martin Schwidefsky
2004-03-29 20:11 ` Linus Torvalds
2004-03-29 20:29 ` Dave Kleikamp
2004-03-30 11:57 Martin Schwidefsky
2004-03-30 13:39 ` David Woodhouse
2004-03-30 14:16 ` Matthew Wilcox
2004-03-30 15:51 ` Linus Torvalds
2004-04-02 16:12 ` viro
2004-04-02 18:01 ` viro
2004-04-02 18:52 ` Linus Torvalds
2004-04-02 19:02 ` Linus Torvalds
2004-04-02 19:10 ` viro
2004-04-02 19:07 ` viro
2004-04-02 20:23 ` viro
2004-04-02 22:40 ` Trond Myklebust
2004-04-02 23:06 ` viro
2004-04-02 23:23 ` Trond Myklebust
2004-04-03 0:53 ` Neil Brown
2004-04-02 23:19 ` Trond Myklebust
2004-04-02 19:17 ` Jamie Lokier
2004-04-02 19:25 ` viro
2004-04-02 19:32 ` Linus Torvalds
2004-04-02 19:37 ` viro
2004-04-02 19:45 ` Linus Torvalds
2004-04-02 20:08 ` viro
2004-04-02 20:40 ` Jamie Lokier
2004-04-02 20:59 ` Christoph Hellwig
2004-04-02 21:09 ` viro
2004-04-02 23:42 ` Jamie Lokier
2004-04-02 21:08 ` viro
2004-04-03 0:39 ` Jamie Lokier
2004-04-05 14:07 ` Stephen C. Tweedie
2004-03-30 15:07 ` Linus Torvalds
2004-04-02 16:14 ` viro
2004-03-30 15:13 Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200402191321.39592.cotte@freenet.de \
--to=cotte@freenet.de \
--cc=akpm@digeo.com \
--cc=cotte@de.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox