[PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-02-19 12:21 Carsten Otte
  2004-02-19 16:53 ` Linus Torvalds
       [not found] ` <20040220164325.659c4e45.akpm@osdl.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 12:21 UTC (permalink / raw)
  To: linux-fsdevel, akpm, torvalds; +Cc: schwidefsky, cotte

Hi all,

recently we ran into a problem found during our 2.6. test activities on s390 
architecture. The problem occured running a glibc build on Linux 2.6.2. 
running in a z/VM virtual machine with 6 processors on a z990 Server 
(6Processor SMP, 64-bit, big endian).
We were able to identify ext3 as the cause of the problem with the following 
debugging patch:
diff -ruN linux-2.6.2/fs/ext3/super.c 
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c   2004-02-19 12:51:35.000000000 
+0100
@@ -449,6 +449,7 @@
 
 static void ext3_destroy_inode(struct inode *inode)
 {
+       BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
        kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 

The output of the BUG_ON statement shows, that prune_icache [fs/inode.c] picks 
an inode with i_count == 0 and calls (via dispose_list) clear_inode() and 
destroy_inode(). Because the inode is still in use by ext3 internally, ext3 
later on reads from or writes to freed memory causing random behavior of the 
system. I think, this BUG_ON should go into the vanilla kernel to prevent 
data corruption.

Above was so far only reproducable in rare conditions when the system was 
under heavy memory pressure. Therefore I continued to debug this further:
The problem seemed to be related to reference counting of inodes. Therefore I 
added a debugging patch to iput_final() [fs/inode.c], that checks if an inode 
that is being dropped is still in the ext3 orphan list. This situation is 
quite easy to reproduce, and has shown one example path were this happens:
A user process calls sys_unlink() [fs/namei.c], which increments i_count, 
calls vfs_unlink [fs/namei.c] and afterwards calls iput().
vfs_unlink [fs/namei.c] works with the dentry, calls i_op->unlink() (in this 
case ext3_unlink) and returns.
ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orphan list 
before returning.
After sys_unlink() has completed, the inode is still referenced by ext3 while 
i_count has the same value like before, which triggers the problem in case 
prune_icache would now choose this inode to be freed.

Possible ways to fix above problem is change reference counting for inodes or 
make the prune_icache function aware of the internal reference to the inode 
(preferably without knowing about the internal data structures of the 
filesystem which would be a layering violation). The patch below does take 
the 2nd approach:
- adds additional super_operation s_op->inode_busy() allowing VFS to query if 
an inode is still internally referenced by the fs.
- adds ext3_inode_busy() to ext3 that checks if he inode is still internally 
referenced
- changes prune_icache to query inode_busy() in case this s_op is implemented 
by the individual fs

Patch fixing the problem:
diff -ruN linux-2.6.2/fs/ext3/super.c linux-2.6.2+ext3fix/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/ext3/super.c 2004-02-04 17:50:03.000000000 +0100
@@ -453,6 +453,14 @@
        kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 
+static int ext3_inode_busy(struct inode *inode)
+{
+       if (!list_empty(&EXT3_I(inode)->i_orphan)) 
+               return 1;
+       else
+               return 0;
+}
+
 static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
 {
        struct ext3_inode_info *ei = (struct ext3_inode_info *) foo;
@@ -510,6 +518,7 @@
 static struct super_operations ext3_sops = {
        .alloc_inode    = ext3_alloc_inode,
        .destroy_inode  = ext3_destroy_inode,
+       .inode_busy     = ext3_inode_busy,
        .read_inode     = ext3_read_inode,
        .write_inode    = ext3_write_inode,
        .dirty_inode    = ext3_dirty_inode,
diff -ruN linux-2.6.2/fs/inode.c linux-2.6.2+ext3fix/fs/inode.c
--- linux-2.6.2/fs/inode.c      2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/inode.c      2004-01-28 09:02:25.000000000 +0100
@@ -391,6 +391,9 @@
                return 0;
        if (inode->i_data.nrpages)
                return 0;
+       if (inode->i_sb->s_op->inode_busy
+           && inode->i_sb->s_op->inode_busy(inode))
+               return 0;
        return 1;
 }
 
@@ -424,7 +427,9 @@
 
                inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-               if (inode->i_state || atomic_read(&inode->i_count)) {
+               if (inode->i_state || atomic_read(&inode->i_count)
+                   || (inode->i_sb->s_op->inode_busy 
+                        && (inode->i_sb->s_op->inode_busy(inode)))) {
                        list_move(&inode->i_list, &inode_unused);
                        continue;
                }
diff -ruN linux-2.6.2/include/linux/fs.h 
linux-2.6.2+ext3fix/include/linux/fs.h
--- linux-2.6.2/include/linux/fs.h      2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/include/linux/fs.h      2004-02-18 18:45:15.000000000 
+0100
@@ -866,6 +866,7 @@
 struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb);
        void (*destroy_inode)(struct inode *);
+       int  (*inode_busy)(struct inode *);
 
        void (*read_inode) (struct inode *);
   

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 12:21 [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Carsten Otte
@ 2004-02-19 16:53 ` Linus Torvalds
  2004-02-19 17:39   ` Stephen C. Tweedie
  2004-02-19 20:19   ` Carsten Otte
       [not found] ` <20040220164325.659c4e45.akpm@osdl.org>
  1 sibling, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2004-02-19 16:53 UTC (permalink / raw)
  To: Carsten Otte
  Cc: linux-fsdevel, Andrew Morton, Stephen C. Tweedie, schwidefsky,
	cotte



Thanks Carsten,

On Thu, 19 Feb 2004, Carsten Otte wrote:
>
> ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orphan list 
> before returning.
> After sys_unlink() has completed, the inode is still referenced by ext3 while 
> i_count has the same value like before, which triggers the problem in case 
> prune_icache would now choose this inode to be freed.

Ouch. ext3 should definitely keep the inode count elevated while it is on 
the s_orphan list, and do the final iput() on it when removing it from the 
orphan list.

> Possible ways to fix above problem is change reference counting for inodes or 
> make the prune_icache function aware of the internal reference to the inode 
> (preferably without knowing about the internal data structures of the 
> filesystem which would be a layering violation). The patch below does take 
> the 2nd approach:
> - adds additional super_operation s_op->inode_busy() allowing VFS to query if 
> an inode is still internally referenced by the fs.
> - adds ext3_inode_busy() to ext3 that checks if he inode is still internally 
> referenced
> - changes prune_icache to query inode_busy() in case this s_op is implemented 
> by the individual fs

I really think this is wrong. The bug is clearly in ext3 reference count 
handling, and I'd much rather just have ext3 fix its reference count than 
add a totally new interface for ext3 breakage.

Andrew, Stephen? Comments?

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 16:53 ` Linus Torvalds
@ 2004-02-19 17:39   ` Stephen C. Tweedie
  2004-02-19 18:49     ` Andrew Morton
  2004-02-19 20:14     ` Carsten Otte
  2004-02-19 20:19   ` Carsten Otte
  1 sibling, 2 replies; 47+ messages in thread
From: Stephen C. Tweedie @ 2004-02-19 17:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Carsten Otte, linux-fsdevel, Andrew Morton, schwidefsky, cotte,
	Stephen Tweedie

Hi,

On Thu, 2004-02-19 at 16:53, Linus Torvalds wrote:

> > ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orphan list 
> > before returning.
> > After sys_unlink() has completed, the inode is still referenced by ext3 while 
> > i_count has the same value like before, which triggers the problem in case 
> > prune_icache would now choose this inode to be freed.
> 
> Ouch. ext3 should definitely keep the inode count elevated while it is on 
> the s_orphan list, and do the final iput() on it when removing it from the 
> orphan list.

An orphan is either a file being truncated (if the truncate spans
multiple transactions), or a file that's deleted but still open.  

In the latter case, we need to preserve the inode on the on-disk orphan
list until the final user closes it.  The final remove-from-orphan-list
really has to happen in ext3_inode_delete, which happens after the last
iput(); so if we held an extra refcount ourselves, the inode would never
get deleted at all.

At the end of ext3_truncate(), we do:

	if (inode->i_nlink)
		ext3_orphan_del(handle, inode);

so the only time we'll be left with the inode on the orphan list after
the truncate is on a true final-unlink, not on a [f]truncate().  If it's
unlink, we rely on the delete_inode() processing to call us and remove
the orphan reference before the inode is destroyed.

I can't immediately see why this should be wrong --- delete_inode() is
supposed to give us this chance to cleanup before the struct inode
actually gets nuked.  The orphan list entry should persist only as long
as inode->i_count is raised, and if it's raised, there should be no risk
of the struct getting reused.

I may be missing some background from previous emails --- is this just a
hypothetical race, or have we actually seen this mechanism breaking down
in practice?

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-02-19 18:00 Martin Schwidefsky
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-02-19 18:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel,
	Stephen Tweedie, Linus Torvalds






Hi Stephen,

> I may be missing some background from previous emails --- is this just a
> hypothetical race, or have we actually seen this mechanism breaking down
> in practice?

I have seen this in practice on a s390 machine with 6 processor, 256 MB main
storage and compiling the glibc. I limited the storage to 256 MB on purpose
to get the machine to do heavy swapping. In the middle of the glibc build
it generates the locales and this uses a lot of storage with -j6. The crash
itself is anywhere and it took me a while to find out that a list_add in
ext3_orphan_add accessed the list_head in an already freed inode.

For easier recreation I added a BUG_ON in ext3_destroy_inode:

diff -ruN linux-2.6.2/fs/ext3/super.c
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c   2004-02-19 12:51:35.000000000
+0100
@@ -449,6 +449,7 @@+       BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
I can easily trigger the BUG_ON on my machine with a simple rpmbuild.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 17:39   ` Stephen C. Tweedie
@ 2004-02-19 18:49     ` Andrew Morton
  2004-02-19 20:28       ` Carsten Otte
  2004-02-19 20:14     ` Carsten Otte
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2004-02-19 18:49 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: torvalds, cotte, linux-fsdevel, schwidefsky, cotte, sct

"Stephen C. Tweedie" <sct@redhat.com> wrote:
>
> At the end of ext3_truncate(), we do:
> 
>  	if (inode->i_nlink)
>  		ext3_orphan_del(handle, inode);

This bug seems to only manifest on s390, which makes me suspect it is a
race.  Because s390's I think tend to have wider race windows at times. 
(But no reports from NUMAQ, which also has wide windows...)

I wonder about the i_nlink locking.  Perhaps we lack some, and we are
accidentally leaving indoes on the orphan list.

There are a lot of places where we don't actually hold inode->i_sem while
diddling inode->i_nlink: ext3_unlink, ext3_rename, others.

hmm, I wonder if that could be it?  If the inode is hardlinked into a
different directory then the parent dir's i_sem gives us no i_nlink
protection.

Carsten, are you aware of a particular workload which triggers this?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 17:39   ` Stephen C. Tweedie
  2004-02-19 18:49     ` Andrew Morton
@ 2004-02-19 20:14     ` Carsten Otte
  2004-02-20  3:41       ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 20:14 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds
  Cc: linux-fsdevel, Andrew Morton, schwidefsky, cotte, Stephen Tweedie

Stephen wrote:
> The orphan list entry should persist only as long
> as inode->i_count is raised, and if it's raised, there should be no risk
> of the struct getting reused.
I tried the following *dirty* debugging patch that triggers in iput_final() 
exactly when i_count is lowered to zero while the orphan list entry does 
still persist.
I have not been able to startup my system, panic output shows the 
sys_unlink() path  -like fully described in the initial mail starting this 
thread- causes running into the panic()
diff -ruN linux-2.6.3/fs/inode.c linux-2.6.3+debug/fs/inode.c
--- linux-2.6.3/fs/inode.c	2004-01-28 09:02:25.000000000 +0100
+++ linux-2.6.3+debug/fs/inode.c	2004-02-19 21:03:26.000000000 +0100
@@ -1058,6 +1058,96 @@
 		generic_forget_inode(inode);
 }
 
+struct ext3_inode_info {
+        __u32   i_data[15];
+        __u32   i_flags;
+#ifdef EXT3_FRAGMENTS
+        __u32   i_faddr;
+        __u8    i_frag_no;
+        __u8    i_frag_size;
+#endif
+        __u32   i_file_acl;
+        __u32   i_dir_acl;
+        __u32   i_dtime;
+
+        /*
+         * i_block_group is the number of the block group which contains
+         * this file's inode.  Constant across the lifetime of the inode,
+         * it is ued for making block allocation decisions - we try to
+         * place a file's data blocks near its inode block, and new inodes
+         * near to their parent directory's inode.
+         */
+        __u32   i_block_group;
+        __u32   i_state;                /* Dynamic state flags for ext3 */
+
+        /*
+         * i_next_alloc_block is the logical (file-relative) number of the
+         * most-recently-allocated block in this file.  Yes, it is misnamed.
+         * We use this for detecting linearly ascending allocation requests.
+         */
+        __u32   i_next_alloc_block;
+
+        /*
+         * i_next_alloc_goal is the *physical* companion to 
i_next_alloc_block.
+         * it the the physical block number of the block which was 
most-recently
+         * allocated to this file.  This give us the goal (target) for the 
next
+         * allocation when we detect linearly ascending requests.
+         */
+        __u32   i_next_alloc_goal;
+#ifdef EXT3_PREALLOCATE
+        __u32   i_prealloc_block;
+        __u32   i_prealloc_count;
+#endif
+        __u32   i_dir_start_lookup;
+#ifdef CONFIG_EXT3_FS_XATTR
+        /*
+         * Extended attributes can be read independently of the main file
+         * data. Taking i_sem even when reading would cause contention
+         * between readers of EAs and writers of regular file data, so
+         * instead we synchronize on xattr_sem when reading or changing
+         * EAs.
+         */
+        struct rw_semaphore xattr_sem;
+#endif
+#ifdef CONFIG_EXT3_FS_POSIX_ACL
+        struct posix_acl        *i_acl;
+        struct posix_acl        *i_default_acl;
+#endif
+
+        struct list_head i_orphan;      /* unlinked but open inodes */
+
+        /*
+         * i_disksize keeps track of what the inode size is ON DISK, not
+         * in memory.  During truncate, i_size is set to the new size by
+         * the VFS prior to calling ext3_truncate(), but the filesystem won't
+         * set i_disksize to 0 until the truncate is actually under way.
+         *
+         * The intent is that i_disksize always represents the blocks which
+         * are used by this file.  This allows recovery to restart truncate
+         * on orphans if we crash during truncate.  We actually write 
i_disksize
+         * into the on-disk inode when writing inodes out, instead of i_size.
+         *
+         * The only time when i_disksize and i_size may be different is when
+         * a truncate is in progress.  The only things which change 
i_disksize
+         * are ext3_get_block (growth) and ext3_truncate (shrinkth).
+         */
+        loff_t  i_disksize;
+
+        /*
+         * truncate_sem is for serialising ext3_truncate() against
+         * ext3_getblock().  In the 2.4 ext2 design, great chunks of inode's
+         * data tree are chopped off during truncate. We can't do that in
+         * ext3 because whenever we perform intermediate commits during
+         * truncate, the inode and all the metadata blocks *must* be in a
+         * consistent state which allows truncation of the orphans to restart
+         * during recovery.  Hence we must fix the get_block-vs-truncate race
+         * by other means, so we have truncate_sem.
+         */
+        struct semaphore truncate_sem;
+        struct inode vfs_inode;
+};
+
+
 /*
  * Called when we're dropping the last reference
  * to an inode. 
@@ -1073,6 +1163,15 @@
 {
 	struct super_operations *op = inode->i_sb->s_op;
 	void (*drop)(struct inode *) = generic_drop_inode;
+	struct ext3_inode_info* ext3_i;
+
+	if (!strcmp (inode->i_sb->s_type->name, "ext3")) {
+		// this is an ext3 inode. check if it is still orphan
+		ext3_i = container_of (inode, struct ext3_inode_info, vfs_inode);
+		if (!list_empty (&ext3_i->i_orphan)) {
+			panic ("iput_final: found inode in ext3's orphan list, am asked to drop it
\n");
+		}
+	}
 
 	if (op && op->drop_inode)
 		drop = op->drop_inode;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 16:53 ` Linus Torvalds
  2004-02-19 17:39   ` Stephen C. Tweedie
@ 2004-02-19 20:19   ` Carsten Otte
  1 sibling, 0 replies; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Andrew Morton, Stephen C. Tweedie, schwidefsky,
	cotte

Am Donnerstag 19 Februar 2004 17:53 schrieb Linus Torvalds:
> I really think this is wrong. The bug is clearly in ext3 reference count
> handling, and I'd much rather just have ext3 fix its reference count than
> add a totally new interface for ext3 breakage.
I do agree that the final patch fixing the problem should not add a new 
interface. This one clearly should not go into vanilla, but consider adding 
the BUG_ON statement at least for the time being until there is a real fix.
diff -ruN linux-2.6.2/fs/ext3/super.c 
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c   2004-02-19 12:51:35.000000000 
+0100
@@ -449,6 +449,7 @@
 
 static void ext3_destroy_inode(struct inode *inode)
 {
+       BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
        kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 20:28       ` Carsten Otte
@ 2004-02-19 20:26         ` viro
  2004-02-19 20:35           ` Carsten Otte
  0 siblings, 1 reply; 47+ messages in thread
From: viro @ 2004-02-19 20:26 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Andrew Morton, Stephen C. Tweedie, torvalds, linux-fsdevel,
	schwidefsky, cotte

On Thu, Feb 19, 2004 at 09:28:06PM +0100, Carsten Otte wrote:
> Am Donnerstag 19 Februar 2004 19:49 schrieb Andrew Morton:
> > There are a lot of places where we don't actually hold inode->i_sem while
> > diddling inode->i_nlink: ext3_unlink, ext3_rename, others.
> >
> > hmm, I wonder if that could be it?  If the inode is hardlinked into a
> > different directory then the parent dir's i_sem gives us no i_nlink
> > protection.
> >
> > Carsten, are you aware of a particular workload which triggers this?
> So far not, all I know is the symptom and a probable cause may be the 
> sys_unlink path. I will try to dig that direction with another debugging 
> patch tomorrow.

We hold ->i_sem *on* *the* *inode*.  And yes, unlink does grab it both
on parent and child.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 18:49     ` Andrew Morton
@ 2004-02-19 20:28       ` Carsten Otte
  2004-02-19 20:26         ` viro
  0 siblings, 1 reply; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 20:28 UTC (permalink / raw)
  To: Andrew Morton, Stephen C. Tweedie
  Cc: torvalds, linux-fsdevel, schwidefsky, cotte, sct

Am Donnerstag 19 Februar 2004 19:49 schrieb Andrew Morton:
> There are a lot of places where we don't actually hold inode->i_sem while
> diddling inode->i_nlink: ext3_unlink, ext3_rename, others.
>
> hmm, I wonder if that could be it?  If the inode is hardlinked into a
> different directory then the parent dir's i_sem gives us no i_nlink
> protection.
>
> Carsten, are you aware of a particular workload which triggers this?
So far not, all I know is the symptom and a probable cause may be the 
sys_unlink path. I will try to dig that direction with another debugging 
patch tomorrow.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 20:26         ` viro
@ 2004-02-19 20:35           ` Carsten Otte
  0 siblings, 0 replies; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 20:35 UTC (permalink / raw)
  To: viro
  Cc: Andrew Morton, Stephen C. Tweedie, torvalds, linux-fsdevel,
	schwidefsky, cotte

Am Donnerstag 19 Februar 2004 21:26 schrieb 
viro@parcelfarce.linux.theplanet.co.uk:
> We hold ->i_sem *on* *the* *inode*.  And yes, unlink does grab it both
> on parent and child.
True. The problem is, that both are released before the inode is been removed 
from the s_orphan list.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-02-19 20:14     ` Carsten Otte
@ 2004-02-20  3:41       ` Andrew Morton
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Morton @ 2004-02-20  3:41 UTC (permalink / raw)
  To: Carsten Otte; +Cc: sct, torvalds, linux-fsdevel, schwidefsky, cotte

Carsten Otte <cotte@freenet.de> wrote:
>
>  I tried the following *dirty* debugging patch that triggers in iput_final() 
>  exactly when i_count is lowered to zero while the orphan list entry does 
>  still persist.

That patch isn't right.  Note how we go on to generic_delete_inode() which
then does the orphan_del().

The below patch should detect us adding orphan inodes to inode_unused. 
Could you see if this triggers?   It doesn't for me, of course :(

Also, I have gone through and verified that all modifications of
inode->i_nlink are happening under inode->i_sem, so scrub that theory.

I'm a bit suspicious of the code in __sync_single_inode(), where it decides
to dump the inode on inode_unused.  Maybe it's not checking all the right
things.  Still, that doesn't explain why the inode gets through the
prune_icache checks as well.


Both patches are here:

diff -puN fs/inode.c~ext3-dirty-debug-patch fs/inode.c
--- 25/fs/inode.c~ext3-dirty-debug-patch	2004-02-19 18:32:44.000000000 -0800
+++ 25-akpm/fs/inode.c	2004-02-19 18:50:23.000000000 -0800
@@ -20,6 +20,7 @@
 #include <linux/security.h>
 #include <linux/pagemap.h>
 #include <linux/cdev.h>
+#include <linux/ext3_fs_i.h>
 
 /*
  * This is needed for the following functions:
@@ -1019,6 +1020,14 @@ static void generic_forget_inode(struct 
 
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
+			if (!strcmp(inode->i_sb->s_type->name, "ext3")) {
+				struct ext3_inode_info *ext3_i;
+				ext3_i = container_of(inode,
+						struct ext3_inode_info,
+						vfs_inode);
+				if (!list_empty(&ext3_i->i_orphan))
+					WARN_ON(1);
+			}
 			list_del(&inode->i_list);
 			list_add(&inode->i_list, &inode_unused);
 		}
@@ -1066,7 +1075,7 @@ static void generic_drop_inode(struct in
  * held, and the drop function is supposed to release
  * the lock!
  */
-static inline void iput_final(struct inode *inode)
+static void iput_final(struct inode *inode)
 {
 	struct super_operations *op = inode->i_sb->s_op;
 	void (*drop)(struct inode *) = generic_drop_inode;
diff -puN fs/fs-writeback.c~ext3-dirty-debug-patch fs/fs-writeback.c
--- 25/fs/fs-writeback.c~ext3-dirty-debug-patch	2004-02-19 19:33:20.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c	2004-02-19 19:33:23.000000000 -0800
@@ -22,6 +22,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/ext3_fs_i.h>
 
 extern struct super_block *blockdev_superblock;
 
@@ -187,6 +188,14 @@ __sync_single_inode(struct inode *inode,
 			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			mapping->dirtied_when = 0;
+			if (!strcmp(inode->i_sb->s_type->name, "ext3")) {
+				struct ext3_inode_info *ext3_i;
+				ext3_i = container_of(inode,
+						struct ext3_inode_info,
+						vfs_inode);
+				if (!list_empty(&ext3_i->i_orphan))
+					WARN_ON(1);
+			}
 			list_move(&inode->i_list, &inode_unused);
 		}
 	}

_




 fs/ext3/ialloc.c |    0 
 fs/ext3/inode.c  |    0 
 fs/ext3/namei.c  |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext3/super.c  |    0 
 4 files changed, 47 insertions(+), 2 deletions(-)

diff -puN fs/ext3/ialloc.c~ext3-i_nlink-check fs/ext3/ialloc.c
diff -puN fs/ext3/inode.c~ext3-i_nlink-check fs/ext3/inode.c
diff -puN fs/ext3/namei.c~ext3-i_nlink-check fs/ext3/namei.c
--- 25/fs/ext3/namei.c~ext3-i_nlink-check	2004-02-19 19:14:12.000000000 -0800
+++ 25-akpm/fs/ext3/namei.c	2004-02-19 19:20:00.000000000 -0800
@@ -1591,13 +1591,23 @@ static int ext3_delete_entry (handle_t *
  * do not perform it in these functions.  We perform it at the call site,
  * if it is needed.
  */
-static inline void ext3_inc_count(handle_t *handle, struct inode *inode)
+static void ext3_inc_count(handle_t *handle, struct inode *inode)
 {
+	if (!down_trylock(&inode->i_sem)) {
+		printk("%s: i_sem not held\n", __FUNCTION__);
+		dump_stack();
+		up(&inode->i_sem);
+	}
 	inode->i_nlink++;
 }
 
-static inline void ext3_dec_count(handle_t *handle, struct inode *inode)
+static void ext3_dec_count(handle_t *handle, struct inode *inode)
 {
+	if (!down_trylock(&inode->i_sem)) {
+		printk("%s: i_sem not held\n", __FUNCTION__);
+		dump_stack();
+		up(&inode->i_sem);
+	}
 	inode->i_nlink--;
 }
 
@@ -1712,6 +1722,11 @@ static int ext3_mkdir(struct inode * dir
 	inode->i_size = EXT3_I(inode)->i_disksize = inode->i_sb->s_blocksize;
 	dir_block = ext3_bread (handle, inode, 0, 1, &err);
 	if (!dir_block) {
+		if (!down_trylock(&inode->i_sem)) {
+			printk("%s: i_sem not held\n", __FUNCTION__);
+			dump_stack();
+			up(&inode->i_sem);
+		}
 		inode->i_nlink--; /* is this nlink == 0? */
 		ext3_mark_inode_dirty(handle, inode);
 		iput (inode);
@@ -2014,6 +2029,11 @@ static int ext3_rmdir (struct inode * di
 	ext3_orphan_add(handle, inode);
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	ext3_mark_inode_dirty(handle, inode);
+	if (!down_trylock(&dir->i_sem)) {
+		printk("%s: i_sem not held\n", __FUNCTION__);
+		dump_stack();
+		up(&dir->i_sem);
+	}
 	dir->i_nlink--;
 	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
@@ -2064,6 +2084,11 @@ static int ext3_unlink(struct inode * di
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
+	if (!down_trylock(&inode->i_sem)) {
+		printk("%s: i_sem not held\n", __FUNCTION__);
+		dump_stack();
+		up(&inode->i_sem);
+	}
 	inode->i_nlink--;
 	if (!inode->i_nlink)
 		ext3_orphan_add(handle, inode);
@@ -2272,6 +2297,11 @@ static int ext3_rename (struct inode * o
 	}
 
 	if (new_inode) {
+		if (!down_trylock(&new_inode->i_sem)) {
+			printk("%s: i_sem not held\n", __FUNCTION__);
+			dump_stack();
+			up(&new_inode->i_sem);
+		}
 		new_inode->i_nlink--;
 		new_inode->i_ctime = CURRENT_TIME;
 	}
@@ -2283,10 +2313,25 @@ static int ext3_rename (struct inode * o
 		PARENT_INO(dir_bh->b_data) = le32_to_cpu(new_dir->i_ino);
 		BUFFER_TRACE(dir_bh, "call ext3_journal_dirty_metadata");
 		ext3_journal_dirty_metadata(handle, dir_bh);
+		if (!down_trylock(&old_dir->i_sem)) {
+			printk("%s: i_sem not held\n", __FUNCTION__);
+			dump_stack();
+			up(&old_dir->i_sem);
+		}
 		old_dir->i_nlink--;
 		if (new_inode) {
+			if (!down_trylock(&new_inode->i_sem)) {
+				printk("%s: i_sem not held\n", __FUNCTION__);
+				dump_stack();
+				up(&new_inode->i_sem);
+			}
 			new_inode->i_nlink--;
 		} else {
+			if (!down_trylock(&new_dir->i_sem)) {
+				printk("%s: i_sem not held\n", __FUNCTION__);
+				dump_stack();
+				up(&new_dir->i_sem);
+			}
 			new_dir->i_nlink++;
 			ext3_update_dx_flag(new_dir);
 			ext3_mark_inode_dirty(handle, new_dir);
diff -puN fs/ext3/super.c~ext3-i_nlink-check fs/ext3/super.c

_


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
       [not found]   ` <200402241338.57855.cotte@freenet.de>
@ 2004-02-24 22:55     ` Andrew Morton
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Morton @ 2004-02-24 22:55 UTC (permalink / raw)
  To: Carsten Otte; +Cc: cotte, sct, torvalds, linux-fsdevel, schwidefsky, cotte

Carsten Otte <cotte@freenet.de> wrote:
>
> Am Samstag 21 Februar 2004 01:43 schrieb Andrew Morton:
> > Could you chage it so we generate a backtrace in iput() if the inode has an
> > i_count of one (the dentry) and it is still on the orphan list?
>
> I did change iput() to the following:
> void iput(struct inode *inode)
> {
>         if (inode) {
>                 struct super_operations *op = inode->i_sb->s_op;
> 
>                 if (inode->i_state == I_CLEAR)
>                         BUG();
> 
>                 if (op && op->put_inode)
>                         op->put_inode(inode);
> 
>                 if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
>                         iput_final(inode);
>                 if (atomic_read(&inode->i_count) == 1)
>                         if (!strcmp(inode->i_sb->s_type->name, "ext3")) {
>                         struct ext3_inode_info *ext3_i;
>                                ext3_i = container_of(inode,
>                                                struct ext3_inode_info,
>                                                vfs_inode);
>                                if (!list_empty(&ext3_i->i_orphan))
>                                        WARN_ON(1);
>                         }
>         }
> }

atomic_dec_and_lock() decrements i_count, so this warning will be triggered
by people calling iput() with an orphan inode which has a refcount of 2,
not of 1.   Please change the

	if (atomic_read(&inode->i_count) == 1)

to

	if (atomic_read(&inode->i_count) == 0)

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-29 19:07 Martin Schwidefsky
  2004-03-29 20:11 ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-29 19:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Carsten Otte, Carsten Otte, linux-fsdevel, sct, torvalds





> Sigh.  This doesn't help at all really.  All we know is that we have an
> inode which is still on the orphan list when it shouldn't be.  Later, the
> dentry was reaped and the inode gets moved to inode_unused, generating
this
> trace.   Later still, the inode is reaped off inode_unused and bang.
>
> Could you chage it so we generate a backtrace in iput() if the inode has
an
> i_count of one (the dentry) and it is still on the orphan list?

Finally I was able to recreate the crash with my inode-trace patch applied.
The machine is in a bad mood again, the first time I am happy to see it crash..

The inode-trace patch logs every change to i_count of every inode in vfs and
ext3. I can now present the life of a broken inode:

cpu0 0.000000 sec: <alloc_inode> <new_inode> <ext3_new_inode> <ext3_create>
cpu0 0.001145 sec: <sys_unlink> <system_call>
cpu0 0.001153 sec: <ext3_orphan_add> <ext3_unlink> <vfs_unlink> <sys_unlink>
cpu0 0.001164 sec: <iput> <sys_unlink> <system_call>
cpu5 0.001174 sec: <ext3_link> <vfs_link> <sys_link> <system_call>
cpu5 0.001186 sec: <iput> <dput> <path_release> <sys_link>
cpu4 1.060962 sec: <__iget> <sync_sb_inodes> <writeback_inodes> <background_writeout>
cpu4 1.060970 sec: <iput> <sync_sb_inodes> <writeback_inodes> <background_writeout>
cpu3 1.273330 sec: <iput> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
cpu3 1.273331 sec: <iput_final> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
cpu3 1.280405 sec: <destroy_inode> <prune_dcache> <prune_icache> <shrink_icache_memory>
cpu3 1.280405 sec: <ext3_destroy_inode> <destroy_inode> <dispose_list> <prune_icache>

What I find suspicious is the sys_unlink followed by ext3_orphan_add
on cpu 0 vs. ext3_link on cpu 5.

I hope this helps. If not what else can I trace?

blue skies,
   Martin

Debug-Patch:

Index: arch/s390/boot/kerntypes.c
===================================================================
RCS file: /home/cvs/linux-2.5/arch/s390/boot/kerntypes.c,v
retrieving revision 1.1
diff -u -r1.1 kerntypes.c
--- arch/s390/boot/kerntypes.c      17 Mar 2003 16:45:34 -0000    1.1
+++ arch/s390/boot/kerntypes.c      26 Mar 2004 12:50:23 -0000
@@ -51,3 +51,4 @@
  *    - struct runqueue
  */
 #include "kernel/sched.c"
+#include "arch/s390/kernel/debug.c"
Index: arch/s390/kernel/debug.c
===================================================================
RCS file: /home/cvs/linux-2.5/arch/s390/kernel/debug.c,v
retrieving revision 1.12
diff -u -r1.12 debug.c
--- arch/s390/kernel/debug.c  27 May 2003 11:33:55 -0000    1.12
+++ arch/s390/kernel/debug.c  26 Mar 2004 12:50:25 -0000
@@ -1198,3 +1198,21 @@
 EXPORT_SYMBOL(debug_sprintf_view);
 EXPORT_SYMBOL(debug_sprintf_exception);
 EXPORT_SYMBOL(debug_sprintf_event);
+
+DEFINE_PER_CPU(struct inode_log *, inode_logs);
+
+void INODE_DBF_init(void)
+{
+     struct inode_log *log;
+     int i;
+
+     log = __get_cpu_var(inode_logs);
+     for_each_cpu(i) {
+           log = (void *) __get_free_pages(GFP_KERNEL, 0);
+           log->index = 0;
+           log->entries = (void *) __get_free_pages(GFP_KERNEL, 8);
+           memset(log->entries, 0, 1024*1024);
+           per_cpu(inode_logs, i) = log;
+           printk("cpu %i log %p\n", i, log);
+     }
+}
Index: fs/inode.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/inode.c,v
retrieving revision 1.42
diff -u -r1.42 inode.c
--- fs/inode.c    11 Mar 2004 09:59:47 -0000    1.42
+++ fs/inode.c    26 Mar 2004 12:50:33 -0000
@@ -151,6 +151,7 @@
                  mapping->backing_dev_info = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
            memset(&inode->u, 0, sizeof(inode->u));
            inode->i_mapping = mapping;
+           INODE_DBF(inode, "allc");
      }
      return inode;
 }
@@ -159,6 +160,7 @@
 {
      if (inode_has_buffers(inode))
            BUG();
+     INODE_DBF(inode, "free");
      security_inode_free(inode);
      if (inode->i_sb->s_op->destroy_inode)
            inode->i_sb->s_op->destroy_inode(inode);
@@ -211,6 +213,8 @@
  */
 void __iget(struct inode * inode)
 {
+     INODE_DBF(inode,"iget");
+
      if (atomic_read(&inode->i_count)) {
            atomic_inc(&inode->i_count);
            return;
@@ -1074,6 +1073,8 @@
      struct super_operations *op = inode->i_sb->s_op;
      void (*drop)(struct inode *) = generic_drop_inode;

+     INODE_DBF(inode,"drop");
+
      if (op && op->drop_inode)
            drop = op->drop_inode;
      drop(inode);
@@ -1091,6 +1092,8 @@
      if (inode) {
            struct super_operations *op = inode->i_sb->s_op;

+           INODE_DBF(inode,"iput");
+
            if (inode->i_state == I_CLEAR)
                  BUG();

Index: fs/libfs.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/libfs.c,v
retrieving revision 1.25
diff -u -r1.25 libfs.c
--- fs/libfs.c    18 Feb 2004 17:44:57 -0000    1.25
+++ fs/libfs.c    26 Mar 2004 12:50:33 -0000
@@ -229,6 +229,7 @@

      inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
      inode->i_nlink++;
+     INODE_DBF(inode,"slnk");
      atomic_inc(&inode->i_count);
      dget(dentry);
      d_instantiate(dentry, inode);
Index: fs/namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/namei.c,v
retrieving revision 1.33
diff -u -r1.33 namei.c
--- fs/namei.c    11 Mar 2004 09:59:48 -0000    1.33
+++ fs/namei.c    26 Mar 2004 12:50:34 -0000
@@ -1726,8 +1726,10 @@
            if (nd.last.name[nd.last.len])
                  goto slashes;
            inode = dentry->d_inode;
-           if (inode)
+           if (inode) {
                  atomic_inc(&inode->i_count);
+                 INODE_DBF(inode,"ulnk");
+           }
            error = vfs_unlink(nd.dentry->d_inode, dentry);
      exit2:
            dput(dentry);
Index: fs/ext3/ialloc.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/ialloc.c,v
retrieving revision 1.22
diff -u -r1.22 ialloc.c
--- fs/ext3/ialloc.c    11 Mar 2004 09:59:50 -0000    1.22
+++ fs/ext3/ialloc.c    26 Mar 2004 12:50:35 -0000
@@ -117,6 +117,7 @@

      ino = inode->i_ino;
      ext3_debug ("freeing inode %lu\n", ino);
+     INODE_DBF(inode, "e3fr");

      /*
       * Note: we must free any quota before locking the superblock,
@@ -687,6 +688,9 @@
      }
 out:
      brelse(bitmap_bh);
+
+     INODE_DBF(inode,"oget");
+
      return inode;
 }

Index: fs/ext3/namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
retrieving revision 1.25
diff -u -r1.25 namei.c
--- fs/ext3/namei.c     18 Feb 2004 17:44:58 -0000    1.25
+++ fs/ext3/namei.c     26 Mar 2004 12:50:35 -0000
@@ -1877,8 +1877,10 @@
       *
       * This is safe: on error we're going to ignore the orphan list
       * anyway on the next recovery. */
-     if (!err)
+     if (!err) {
            list_add(&EXT3_I(inode)->i_orphan, &EXT3_SB(sb)->s_orphan);
+           INODE_DBF(inode,"oadd");
+     }

      jbd_debug(4, "superblock will point to %ld\n", inode->i_ino);
      jbd_debug(4, "orphan inode %ld will point to %d\n",
@@ -1916,6 +1918,8 @@

      list_del_init(&ei->i_orphan);

+     INODE_DBF(inode,"odel");
+
      /* If we're on an error path, we may not have a valid
       * transaction handle with which to update the orphan list on
       * disk, but we still need to remove the inode from the linked
@@ -2146,7 +2150,10 @@

      inode->i_ctime = CURRENT_TIME;
      ext3_inc_count(handle, inode);
+
      atomic_inc(&inode->i_count);
+
+     INODE_DBF(inode,"link");

      err = ext3_add_nondir(handle, dentry, inode);
      ext3_journal_stop(handle);
Index: fs/ext3/super.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/super.c,v
retrieving revision 1.41
diff -u -r1.41 super.c
--- fs/ext3/super.c     11 Mar 2004 09:59:51 -0000    1.41
+++ fs/ext3/super.c     26 Mar 2004 12:50:35 -0000
@@ -449,6 +449,10 @@

 static void ext3_destroy_inode(struct inode *inode)
 {
+     if (!list_empty(&EXT3_I(inode)->i_orphan)) {
+           INODE_DBF(inode,"BUG!");
+           cpcmd("IPL 5E0A", NULL, NULL);
+     }
      kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }

Index: include/linux/fs.h
===================================================================
RCS file: /home/cvs/linux-2.5/include/linux/fs.h,v
retrieving revision 1.56
diff -u -r1.56 fs.h
--- include/linux/fs.h  11 Mar 2004 10:00:24 -0000    1.56
+++ include/linux/fs.h  26 Mar 2004 12:50:39 -0000
@@ -866,7 +866,6 @@
 struct super_operations {
      struct inode *(*alloc_inode)(struct super_block *sb);
      void (*destroy_inode)(struct inode *);
-     int  (*inode_busy)(struct inode *);

      void (*read_inode) (struct inode *);

@@ -1440,4 +1439,58 @@
 #endif     /* CONFIG_SECURITY */

 #endif /* __KERNEL__ */
+
+#include <asm/debug.h>
+
+struct inode_log_entry {
+     unsigned long long clock;
+     unsigned long event;
+     unsigned long inode;
+     unsigned long bc1;
+     unsigned long bc2;
+     unsigned long bc3;
+     unsigned long bc4;
+};
+
+struct inode_log {
+     struct inode_log_entry *entries;
+     unsigned int index;
+};
+
+DECLARE_PER_CPU(struct inode_log *, inode_logs);
+
+void __INODE_DBF(struct inode *inode, char *tag);
+
+static inline void INODE_DBF(struct inode *inode, char *tag)
+{
+     struct inode_log *log;
+     struct inode_log_entry *entry;
+     unsigned long bc;
+
+     log = get_cpu_var(inode_logs);
+     if (log) {
+           entry = log->entries + log->index;
+           log->index = (log->index + 1) & 0x7fff;
+           memset(&entry, sizeof(struct inode_log_entry), 0);
+           entry->clock = get_clock();
+           memcpy(&entry->event, tag, 4);
+           entry->inode = (unsigned long) inode;
+           asm volatile("lr %0,14" : "=d" (bc) );
+           entry->bc1 = bc;
+           asm volatile("lr %0,15" : "=d" (bc) );
+           bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+           if (bc) {
+                 entry->bc2 = *(unsigned long *) (bc+56);
+                 bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+                 if (bc) {
+                       entry->bc3 = *(unsigned long *) (bc+56);
+                       bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+                       if (bc)
+                             entry->bc4 = *(unsigned long *) (bc+56);
+                 }
+           }
+     }
+     put_cpu_var(inode_logs);
+}
+
 #endif /* _LINUX_FS_H */
Index: init/main.c
===================================================================
RCS file: /home/cvs/linux-2.5/init/main.c,v
retrieving revision 1.44
diff -u -r1.44 main.c
--- init/main.c   11 Mar 2004 10:00:30 -0000    1.44
+++ init/main.c   26 Mar 2004 12:50:40 -0000
@@ -599,6 +599,7 @@
      do_pre_smp_initcalls();

      smp_init();
+     INODE_DBF_init();
      do_basic_setup();

      prepare_namespace();

Raw-Data for inode 0x0ba08d90:
bafd8d87 c850130a 616c6c63 0ba08d90 8008127a 80082060 800a34e8 800ab3ea : .....P..allc.......z.. `..4..... cpu0
bafd8d87 c897ae40 756c6e6b 0ba08d90 80077400 8001fc92 8001fc92 800495f8 : .......@ulnk......t............. cpu0
bafd8d87 c8982880 6f616464 0ba08d90 800abc88 800ac3b6 8007726e 800774d2 : ......(.oadd..............rn..t. cpu0
bafd8d87 c898d880 69707574 0ba08d90 0c2a1ec0 8007752c 8001fc92 80082bb8 : ........iput.....*....u,......+. cpu0
bafd8d87 c89982c8 6c696e6b 0ba08d90 800ac60a 800777ee 800779a6 8001fc92 : ........link..........w...y..... cpu5
bafd8d87 c89a3ac8 69707574 0ba08d90 0cb41de0 8007ef26 8007464a 80077930 : ......:.iput...........&..FJ..y0 cpu5
bafd8d88 cb563f45 69676574 0ba08d90 80089e32 80089e32 80089fc2 800495f8 : .....V?Eiget.......2...2........ cpu4
bafd8d88 cb56b405 69707574 0ba08d90 0f541ba0 80089e66 80089fc2 800495f8 : .....V..iput.....T.....f........ cpu4
bafd8d88 ff2f4280 69707574 0ba08d90 0f53fc08 8007f61e 8007fbf8 8004f5c4 : ...../B.iput.....S.............. cpu3
bafd8d88 ff2f44c0 64726f70 0ba08d90 80082cca 8007f61e 8007fbf8 8004f5c4 : ...../D.drop......,............. cpu3
bafd8d89 00e96907 66726565 0ba08d90 8008141a 80081930 80081d66 80081f50 : ......i.free...........0...f...P cpu3
bafd8d89 00e96a07 42554721 0ba08d90 0f53fb20 800814b8 80081930 80081d66 : ......j.BUG!.....S. .......0...f cpu3


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-29 19:07 Martin Schwidefsky
@ 2004-03-29 20:11 ` Linus Torvalds
  2004-03-29 20:29   ` Dave Kleikamp
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2004-03-29 20:11 UTC (permalink / raw)
  To: Martin Schwidefsky, Al Viro
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct

On Mon, 29 Mar 2004, Martin Schwidefsky wrote:
> 
> The inode-trace patch logs every change to i_count of every inode in vfs and
> ext3. I can now present the life of a broken inode:
> 
> cpu0 0.000000 sec: <alloc_inode> <new_inode> <ext3_new_inode> <ext3_create>
> cpu0 0.001145 sec: <sys_unlink> <system_call>
> cpu0 0.001153 sec: <ext3_orphan_add> <ext3_unlink> <vfs_unlink> <sys_unlink>
> cpu0 0.001164 sec: <iput> <sys_unlink> <system_call>
> cpu5 0.001174 sec: <ext3_link> <vfs_link> <sys_link> <system_call>
> cpu5 0.001186 sec: <iput> <dput> <path_release> <sys_link>
> cpu4 1.060962 sec: <__iget> <sync_sb_inodes> <writeback_inodes> <background_writeout>
> cpu4 1.060970 sec: <iput> <sync_sb_inodes> <writeback_inodes> <background_writeout>
> cpu3 1.273330 sec: <iput> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
> cpu3 1.273331 sec: <iput_final> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
> cpu3 1.280405 sec: <destroy_inode> <prune_dcache> <prune_icache> <shrink_icache_memory>
> cpu3 1.280405 sec: <ext3_destroy_inode> <destroy_inode> <dispose_list> <prune_icache>
> 
> What I find suspicious is the sys_unlink followed by ext3_orphan_add
> on cpu 0 vs. ext3_link on cpu 5.

I agree, it definitely looks like some strange race on link/unlink.

Those two are protected by the directory inode semaphore, so a link/unlink
of the same path should _not_ be able to race, but I wonder if we have a
inode linked to two different directories, and do not protect some _other_
data.

Hmm. grep, grep, grep.. Both vfs_unlink() and vfs_link() also take the
object inode semaphore (in addition to the directory inode semaphore), so
I _really_ don't see how we could race on anything here..

Is this trace all from the same inode?

It really does look like the ext3_link() has happened _after_ we should no
longer have been able to look up the inode. Now, for lookup, we normally
won't even _get_ the directory semaphore, since the dcache is separate, 
but that inode semaphore thing is interesting.

Al - should we not maybe re-check the "dentry" state of the source dentry 
of "vfs_link()". What happens if something like this is going on:

	CPU 0			CPU5

	unlink("A/file");
	gets semphore on A
				looks up "A/file" in dcache
				link("A/file", "B/file");
				gets semaphore on B

	gets semaphore on "file"
	unlinks file
	drops semaphore on "file"
				gets semaphore on (deleted) "file"
	drops semaphore on "A"
				does the link of a deleted file.

Al, where did I go wrong?

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-29 20:11 ` Linus Torvalds
@ 2004-03-29 20:29   ` Dave Kleikamp
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Kleikamp @ 2004-03-29 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Al Viro, Andrew Morton, Carsten Otte,
	Carsten Otte, fsdevel, sct

On Mon, 2004-03-29 at 14:11, Linus Torvalds wrote:
> What happens if something like this is going on:
> 
> 	CPU 0			CPU5
> 
> 	unlink("A/file");
> 	gets semphore on A
> 				looks up "A/file" in dcache
> 				link("A/file", "B/file");
> 				gets semaphore on B
> 
> 	gets semaphore on "file"
> 	unlinks file
> 	drops semaphore on "file"
> 				gets semaphore on (deleted) "file"
> 	drops semaphore on "A"
> 				does the link of a deleted file.

I saw this problem a while back and fixed it in jfs_link by not allowing
a link to succeed if i_link was zero.  I didn't realize at the time it
was a problem in ext3.  The jfs fix:
http://jfs.bkbits.net:8080/linux-2.5/cset@1.1380.5.2
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-30 11:57 Martin Schwidefsky
  2004-03-30 13:39 ` David Woodhouse
  2004-03-30 15:07 ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-30 11:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Al Viro, Dave Kleikamp





Hi Linus,

> Is this trace all from the same inode?

Yes, its all for the same inode.

> I saw this problem a while back and fixed it in jfs_link by not allowing
> a link to succeed if i_link was zero.  I didn't realize at the time it
> was a problem in ext3.  The jfs fix:
> http://jfs.bkbits.net:8080/linux-2.5/cset@1.1380.5.2

I'm just testing the analog ext3 patch:

Index: namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
retrieving revision 1.25
diff -u -r1.25 namei.c
--- namei.c 18 Feb 2004 17:44:58 -0000    1.25
+++ namei.c 30 Mar 2004 11:48:15 -0000
@@ -2136,6 +2136,9 @@
      if (inode->i_nlink >= EXT3_LINK_MAX)
            return -EMLINK;

+     if (inode->i_nlink == 0)
+           return -ENOENT;
+
      handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
                              EXT3_INDEX_EXTRA_TRANS_BLOCKS);
      if (IS_ERR(handle))

So far it works but the race is quite hard to trigger. I'll have to stress
the system for a few hours/days.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 11:57 Martin Schwidefsky
@ 2004-03-30 13:39 ` David Woodhouse
  2004-03-30 14:16   ` Matthew Wilcox
  2004-03-30 15:51   ` Linus Torvalds
  2004-03-30 15:07 ` Linus Torvalds
  1 sibling, 2 replies; 47+ messages in thread
From: David Woodhouse @ 2004-03-30 13:39 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, Andrew Morton, Carsten Otte, Carsten Otte,
	linux-fsdevel, sct, Al Viro, Dave Kleikamp

On Tue, 2004-03-30 at 13:57 +0200, Martin Schwidefsky wrote:
> I'm just testing the analog ext3 patch:

You almost had me doing precisely the same thing in JFFS2 too... but I'd
rather do it like this instead.

If we ignore the anecdotes about reinstating a libc which has been
unlinked but is still in use -- is there any reason we're want any file
system to allow link() to inodes with i_nlink == 0?

===== fs/namei.c 1.93 vs edited =====
--- 1.93/fs/namei.c	Thu Mar 18 23:44:20 2004
+++ edited/fs/namei.c	Tue Mar 30 14:33:02 2004
@@ -1834,8 +1834,13 @@
 		return error;
 
 	down(&old_dentry->d_inode->i_sem);
-	DQUOT_INIT(dir);
-	error = dir->i_op->link(old_dentry, dir, new_dentry);
+	if (old_dentry->d_inode->i_nlink) {
+		DQUOT_INIT(dir);
+		error = dir->i_op->link(old_dentry, dir, new_dentry);
+	} else {
+		/* Someone deleted it while we were looking the other way */
+		error = -ENOENT;
+	}
 	up(&old_dentry->d_inode->i_sem);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);




-- 
dwmw2


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 13:39 ` David Woodhouse
@ 2004-03-30 14:16   ` Matthew Wilcox
  2004-03-30 15:51   ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2004-03-30 14:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Al Viro, Dave Kleikamp

On Tue, Mar 30, 2004 at 02:39:30PM +0100, David Woodhouse wrote:
> If we ignore the anecdotes about reinstating a libc which has been
> unlinked but is still in use -- is there any reason we're want any file
> system to allow link() to inodes with i_nlink == 0?

To preserve the anecdotes, we could check i_count as well, but this starts
to get hairy.  If iput_final() races with vfs_link() we'd have problems.
Should iput_final() be taking i_sem anyway?  Seems to me we have the race:

CPU0					CPU1

iput()
atomic_dec_and_lock(&inode->i_count, &inode_lock)
					sys_unlink()
					atomic_inc(&inode->i_count);
					vfs_unlink()
					inode->i_nlink-- (reaches 0)
iput_final()
generic_drop_inode()
					
if (!inode->i_nlink)
        generic_delete_inode(inode);
else
        generic_forget_inode(inode);
					iput()
					atomic_dec_and_lock()
					generic_delete_inode()

Calling generic_delete_inode() twice gives us a BUG() in clear_inode()
(if we even get that far).  Have I overlooked something here?	

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 11:57 Martin Schwidefsky
  2004-03-30 13:39 ` David Woodhouse
@ 2004-03-30 15:07 ` Linus Torvalds
  2004-04-02 16:14   ` viro
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2004-03-30 15:07 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Al Viro, Dave Kleikamp



On Tue, 30 Mar 2004, Martin Schwidefsky wrote:
> 
> I'm just testing the analog ext3 patch:

I think this should be fixable in the VFS layer. In particular, the VFS 
layer could check that the dentry is still unhashed after having gotten 
the inode semaphore, or something like that. Al?

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-30 15:13 Martin Schwidefsky
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-30 15:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Al Viro, Dave Kleikamp, David Woodhouse





> I'm just testing the analog ext3 patch:
>
> Index: namei.c
> ===================================================================
> RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
> retrieving revision 1.25
> diff -u -r1.25 namei.c
> --- namei.c     18 Feb 2004 17:44:58 -0000    1.25
> +++ namei.c     30 Mar 2004 11:48:15 -0000
> @@ -2136,6 +2136,9 @@
>     if (inode->i_nlink >= EXT3_LINK_MAX)
>           return -EMLINK;
>
> +   if (inode->i_nlink == 0)
> +         return -ENOENT;
> +
>     handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
>                             EXT3_INDEX_EXTRA_TRANS_BLOCKS);
>     if (IS_ERR(handle))
>
> So far it works but the race is quite hard to trigger. I'll have to stress
> the system for a few hours/days.

Ok, I compiled a lot of glibcs today which is the workload that I use to
trigger the problem. With the patch the problem disappears. Just to make
sure I removed the patch again and sure enough the machine crashed after
1/2 hour. This patch solves the problem for me. Davids patch for fs/namei.c
seems to me the generic solution for all filesystems.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 13:39 ` David Woodhouse
  2004-03-30 14:16   ` Matthew Wilcox
@ 2004-03-30 15:51   ` Linus Torvalds
  2004-04-02 16:12     ` viro
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2004-03-30 15:51 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Martin Schwidefsky, Andrew Morton, Carsten Otte, Carsten Otte,
	linux-fsdevel, sct, Al Viro, Dave Kleikamp

On Tue, 30 Mar 2004, David Woodhouse wrote:
> 
> If we ignore the anecdotes about reinstating a libc which has been
> unlinked but is still in use -- is there any reason we're want any file
> system to allow link() to inodes with i_nlink == 0?

Other (non-UNIX) filesystems might not care about i_nlink, so we generally 
shouldn't touch it in the VFS layer. 

However, another thing strikes me: the reason the inode is gone may not be 
due to a unlink(), it might be _another_ inode being renamed on top of the 
old source inode.

In which case returning ENOENT is _wrong_, since "rename()" is supposed to 
act atomically (ie when we do the link(), we should get the old _or_ the 
new inode, but at no point should we get ENOENT).

This cannot be sanely fixed at a low-level filesystem layer. It's a VFS 
locking deficiency. Possibly solved by having a loop (ie "retry the source 
path lookup if the source dentry is unhashed by the time we have the inode 
lock"). Being very careful indeed about /proc.

Or we should just get the source directory lock, like in rename(). 

Al, we need your bigger brain!

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 15:51   ` Linus Torvalds
@ 2004-04-02 16:12     ` viro
  2004-04-02 18:01       ` viro
  0 siblings, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 16:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Tue, Mar 30, 2004 at 07:51:28AM -0800, Linus Torvalds wrote:
> > If we ignore the anecdotes about reinstating a libc which has been
> > unlinked but is still in use -- is there any reason we're want any file
> > system to allow link() to inodes with i_nlink == 0?
> 
> Other (non-UNIX) filesystems might not care about i_nlink, so we generally 
> shouldn't touch it in the VFS layer. 
> 
> However, another thing strikes me: the reason the inode is gone may not be 
> due to a unlink(), it might be _another_ inode being renamed on top of the 
> old source inode.
> 
> In which case returning ENOENT is _wrong_, since "rename()" is supposed to 
> act atomically (ie when we do the link(), we should get the old _or_ the 
> new inode, but at no point should we get ENOENT).
> 
> This cannot be sanely fixed at a low-level filesystem layer. It's a VFS 
> locking deficiency. Possibly solved by having a loop (ie "retry the source 
> path lookup if the source dentry is unhashed by the time we have the inode 
> lock"). Being very careful indeed about /proc.
> 
> Or we should just get the source directory lock, like in rename(). 
> 
> Al, we need your bigger brain!

We should get source directory lock to deal with that.  I'll take a look
at that (we need to deal carefully with ordering there); expect a patch
in an hour or so...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-03-30 15:07 ` Linus Torvalds
@ 2004-04-02 16:14   ` viro
  0 siblings, 0 replies; 47+ messages in thread
From: viro @ 2004-04-02 16:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Carsten Otte, Carsten Otte,
	linux-fsdevel, sct, Dave Kleikamp

On Tue, Mar 30, 2004 at 07:07:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 30 Mar 2004, Martin Schwidefsky wrote:
> > 
> > I'm just testing the analog ext3 patch:
> 
> I think this should be fixable in the VFS layer. In particular, the VFS 
> layer could check that the dentry is still unhashed after having gotten 
> the inode semaphore, or something like that. Al?

We could check ->i_nlink there after getting said semaphore; the question
is what to do if it's zero.  IOW, see your own argument re rename().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 16:12     ` viro
@ 2004-04-02 18:01       ` viro
  2004-04-02 18:52         ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 18:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, Apr 02, 2004 at 05:12:23PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:

> We should get source directory lock to deal with that.  I'll take a look
> at that (we need to deal carefully with ordering there); expect a patch
> in an hour or so...

Argh...  It's nastier than I thought - the corner case that can bite us is

mount --bind path/foo bar
ln bar baz

What should happen here?  Current code would make baz a link to path/foo.
Which is fine and should cause no problems, except that we have no idea
what the parent is.  Ignoring the mountpoint and linking to underlying
bar is clearly bogus.  Failing with -EXDEV might be more or less OK,
though...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 18:01       ` viro
@ 2004-04-02 18:52         ` Linus Torvalds
  2004-04-02 19:02           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2004-04-02 18:52 UTC (permalink / raw)
  To: viro
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp



On Fri, 2 Apr 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> Argh...  It's nastier than I thought - the corner case that can bite us
> is
> 
> mount --bind path/foo bar ln bar baz
> 
> What should happen here?  Current code would make baz a link to
> path/foo. Which is fine and should cause no problems, except that we
> have no idea what the parent is.  Ignoring the mountpoint and linking to
> underlying bar is clearly bogus.  Failing with -EXDEV might be more or
> less OK, though...

I think that -EXDEV is not only OK, it's _required_. It may be the same 
physical filesystem, but it's a different mount and thus a different 
filesystem.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 18:52         ` Linus Torvalds
@ 2004-04-02 19:02           ` Linus Torvalds
  2004-04-02 19:10             ` viro
  2004-04-02 19:07           ` viro
  2004-04-02 19:17           ` Jamie Lokier
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2004-04-02 19:02 UTC (permalink / raw)
  To: viro
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp



On Fri, 2 Apr 2004, Linus Torvalds wrote:
> 
> I think that -EXDEV is not only OK, it's _required_. It may be the same 
> physical filesystem, but it's a different mount and thus a different 
> filesystem.

Actually, we already do this in "sys_link()" and "do_rename()":

sys_link:

        error = -EXDEV;
        if (old_nd.mnt != nd.mnt)
                goto out_release;
        new_dentry = lookup_create(&nd, 0);

and do_rename:

	error = -EXDEV;
        if (oldnd.mnt != newnd.mnt)
                goto exit2;


so this case should already be ok, no?

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 18:52         ` Linus Torvalds
  2004-04-02 19:02           ` Linus Torvalds
@ 2004-04-02 19:07           ` viro
  2004-04-02 20:23             ` viro
  2004-04-02 19:17           ` Jamie Lokier
  2 siblings, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, Apr 02, 2004 at 10:52:09AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Apr 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> > 
> > Argh...  It's nastier than I thought - the corner case that can bite us
> > is
> > 
> > mount --bind path/foo bar ln bar baz
> > 
> > What should happen here?  Current code would make baz a link to
> > path/foo. Which is fine and should cause no problems, except that we
> > have no idea what the parent is.  Ignoring the mountpoint and linking to
> > underlying bar is clearly bogus.  Failing with -EXDEV might be more or
> > less OK, though...
> 
> I think that -EXDEV is not only OK, it's _required_. It may be the same 
> physical filesystem, but it's a different mount and thus a different 
> filesystem.

Yes - as long as we don't allow links to directories.  If we do, that's
much, much worse mess anyway.

Error values for link(2) are interesting, to put it mildly.  E.g. according
to Linux manpages a dangling symlink as a target must give -ENOENT; however,
link(2) does _NOT_ follow symlinks.  Since current tree doesn't agree with
that idiocy (we simply create a link to dangling symlink), I'd say "screw
the manpages".  Especially since SuS for once does not mandate that idiocy...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:02           ` Linus Torvalds
@ 2004-04-02 19:10             ` viro
  0 siblings, 0 replies; 47+ messages in thread
From: viro @ 2004-04-02 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, Apr 02, 2004 at 11:02:24AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Apr 2004, Linus Torvalds wrote:
> > 
> > I think that -EXDEV is not only OK, it's _required_. It may be the same 
> > physical filesystem, but it's a different mount and thus a different 
> > filesystem.
> 
> Actually, we already do this in "sys_link()" and "do_rename()":
> 
> sys_link:
> 
>         error = -EXDEV;
>         if (old_nd.mnt != nd.mnt)
>                 goto out_release;
>         new_dentry = lookup_create(&nd, 0);
 
> so this case should already be ok, no?

This case is OK now, but we'll need to do explicit lookup_mnt() since we are
switching to LOOKUP_PARENT for old_mnt.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 18:52         ` Linus Torvalds
  2004-04-02 19:02           ` Linus Torvalds
  2004-04-02 19:07           ` viro
@ 2004-04-02 19:17           ` Jamie Lokier
  2004-04-02 19:25             ` viro
  2004-04-02 19:32             ` Linus Torvalds
  2 siblings, 2 replies; 47+ messages in thread
From: Jamie Lokier @ 2004-04-02 19:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: viro, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

Linus Torvalds wrote:
> I think that -EXDEV is not only OK, it's _required_. It may be the same 
> physical filesystem, but it's a different mount and thus a different 
> filesystem.

That reminds me.  How is userspace supposed to determine when two
names are equivalent on different bind-mounted filesystems?  stat()
returns the same value in st_dev for both names.

The usual strategy for detecting equivalent files is to check
st_nlink, and if that's not 1 compare (st_dev, st_ino).  It doesn't
work for bind mounts, which is a very non-POSIX property.

I think the useful thing to do is return a different st_dev value for
each bind mount.

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:17           ` Jamie Lokier
@ 2004-04-02 19:25             ` viro
  2004-04-02 19:32             ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: viro @ 2004-04-02 19:25 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

On Fri, Apr 02, 2004 at 08:17:52PM +0100, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > I think that -EXDEV is not only OK, it's _required_. It may be the same 
> > physical filesystem, but it's a different mount and thus a different 
> > filesystem.
> 
> That reminds me.  How is userspace supposed to determine when two
> names are equivalent on different bind-mounted filesystems?  stat()
> returns the same value in st_dev for both names.
> 
> The usual strategy for detecting equivalent files is to check
> st_nlink, and if that's not 1 compare (st_dev, st_ino).  It doesn't
> work for bind mounts, which is a very non-POSIX property.
> 
> I think the useful thing to do is return a different st_dev value for
> each bind mount.

Huh???  You know, st_dev has a meaning - it's a device number of underlying
device.  Making it random is an interesting idea...

Besides, files are equivalent - try to e.g. open one of them, write there
and then take a look at another.  What exactly are you trying to check?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:17           ` Jamie Lokier
  2004-04-02 19:25             ` viro
@ 2004-04-02 19:32             ` Linus Torvalds
  2004-04-02 19:37               ` viro
  2004-04-02 20:40               ` Jamie Lokier
  1 sibling, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2004-04-02 19:32 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: viro, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, 2 Apr 2004, Jamie Lokier wrote:
> 
> I think the useful thing to do is return a different st_dev value for
> each bind mount.

Naah. You _want_ user space to see that they are the same file, and then 
the algorithm should be: "open+open+fstat+fstat+cmp st_dev/st_ino".

But I agree that it might be good to also have a way to enquire about
mount information. One logical place for that might be "fstatfs()".  
We've got a few spare bytes there, so it wouldn't be impossible to do.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:32             ` Linus Torvalds
@ 2004-04-02 19:37               ` viro
  2004-04-02 19:45                 ` Linus Torvalds
  2004-04-02 20:40               ` Jamie Lokier
  1 sibling, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, Apr 02, 2004 at 11:32:26AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Apr 2004, Jamie Lokier wrote:
> > 
> > I think the useful thing to do is return a different st_dev value for
> > each bind mount.
> 
> Naah. You _want_ user space to see that they are the same file, and then 
> the algorithm should be: "open+open+fstat+fstat+cmp st_dev/st_ino".
> 
> But I agree that it might be good to also have a way to enquire about
> mount information. One logical place for that might be "fstatfs()".  
> We've got a few spare bytes there, so it wouldn't be impossible to do.

I'm not sure what we want to expose there, though.  Address of vfsmount
would certainly be unique, but...   Generation number would be safer,
but we will need to be careful about uniqueness of that animal.  How much
free space do we have there?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:37               ` viro
@ 2004-04-02 19:45                 ` Linus Torvalds
  2004-04-02 20:08                   ` viro
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2004-04-02 19:45 UTC (permalink / raw)
  To: viro
  Cc: Jamie Lokier, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, 2 Apr 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> > 
> > But I agree that it might be good to also have a way to enquire about
> > mount information. One logical place for that might be "fstatfs()".  
> > We've got a few spare bytes there, so it wouldn't be impossible to do.
> 
> I'm not sure what we want to expose there, though.  Address of vfsmount
> would certainly be unique, but...   Generation number would be safer,
> but we will need to be careful about uniqueness of that animal.  How much
> free space do we have there?

The fields are architecture-specific, since we try to be compatible with 
other unixes. On x86 I think we have something like five 32-bit words free 
for future expansion. Whether this is important enough to merit taking one 
of them as a sequence number, I don't know.

I don't really know who would care. I could imagine somebody wanting to
know before-hand if a rename or hardlink will fail, but on the other hand
they could just try it, and react to the failure appropriately. A backup 
application (or file manager) should only care that two files are the 
same, which they can already see. Dunno.

So it's a question of a piece of real information that we don't export to 
user space, but whether that information is worth it, especially as it 
would be a Linux-only thing, I don't know.

Who else does bind mounts, and do they have any interfaces to show that 
information?

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:45                 ` Linus Torvalds
@ 2004-04-02 20:08                   ` viro
  0 siblings, 0 replies; 47+ messages in thread
From: viro @ 2004-04-02 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

On Fri, Apr 02, 2004 at 11:45:01AM -0800, Linus Torvalds wrote:
> So it's a question of a piece of real information that we don't export to 
> user space, but whether that information is worth it, especially as it 
> would be a Linux-only thing, I don't know.
> 
> Who else does bind mounts, and do they have any interfaces to show that 
> information?

AFAIK there's no direct analogs of that in Plan 9.  You can tell if two files
are on different filesystems, you can tell if two files are different and
you can tell if <pathname> is a mountpoint.  That's it, though.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:07           ` viro
@ 2004-04-02 20:23             ` viro
  2004-04-02 22:40               ` Trond Myklebust
  0 siblings, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Martin Schwidefsky, Andrew Morton, Carsten Otte,
	Carsten Otte, linux-fsdevel, sct, Dave Kleikamp, Neil Brown,
	Trond Myklebust

	OK...  I've got the part that deals with sys_link().  However,
we have another user of vfs_link() and it can be nastier.

	Unlike rename(), NFS link() passes fhandle of only one parent - the
object we are linking to is passed as fhandle.  So figuring out what should
be locked is going to be fun here.  Can we just say "it's stale" if link
target has zero ->i_nlink there?  If not for NFSv4 I'd just go ahead and do
that (cross-client atomicity warranties of rename() on NFS - not funny)...

	Any NFS people here?  Trond, Neil?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 19:32             ` Linus Torvalds
  2004-04-02 19:37               ` viro
@ 2004-04-02 20:40               ` Jamie Lokier
  2004-04-02 20:59                 ` Christoph Hellwig
  2004-04-02 21:08                 ` viro
  1 sibling, 2 replies; 47+ messages in thread
From: Jamie Lokier @ 2004-04-02 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: viro, David Woodhouse, Martin Schwidefsky, Andrew Morton,
	Carsten Otte, Carsten Otte, linux-fsdevel, sct, Dave Kleikamp

Linus Torvalds wrote:
> Naah. You _want_ user space to see that they are the same file, and then 
> the algorithm should be: "open+open+fstat+fstat+cmp st_dev/st_ino".

The trouble with that is the many programs which assume that if
st_nlink == 1, then the file has only one path.  This is a critical
optimisation for any program which looks at a lot of files checking
for equivalent files, and is widely assumed.  I've always thought it a
reliable basic unix assumption.

For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
You may argue that using "rsync -H" or "cp -a" on a tree that contains
bind mounts is broken by design.

Then there are occasions when you want to traverse a tree, but not
cross mounts.  Programs do that by checking whether the st_dev field
returned by stat() or fstat() on a directory is different from its parent.

For example: find -xdev, tar --one-file-system, cp -x.

You may argue that bind mounts shouldn't count for the purpose of
--one-file-system, but there should be some reasonable a way for
programs to recognise the bind mount topology, to offer the behaviour
of --one-file-system-not-crossing-bind-mounts.

Regular files can be bind-mounted.  So even if bind mounts did change
st_dev, programs which check st_dev when entering a directory wouldn't
recognise bind mounted files.

Then there are programs such as optimised Make and cacheing systems
and servers which use dnotify.  dnotify is reliable for single-linked
files, and when there are multiple links, it's still reliable if you
discover all the st_nlink paths to a file.  However, it isn't reliable
if there are any bind mounts, because you don't know whether you have
all paths to a file (without grubbing inside /proc/mounts, and that
has race conditions anyway).

The obvious strategy for such programs is to ignore bind mounts and
give incorrect results if there are any.  However it would be much
better if they could detect when a file has multiple paths that aren't
mentioned in st_nlink, so they wouldn't depend on dnotify in such cases.

> But I agree that it might be good to also have a way to enquire about
> mount information. One logical place for that might be "fstatfs()".  
> We've got a few spare bytes there, so it wouldn't be impossible to do.

Sounds like a good idea.  I appreciate Al's point about st_dev having
an actual real meaning.  I was thinking that with 64-bit dev_t, it
might be ok to reserve some bits of that to distinguish bind mounts,
so that a program can still get the underlying device id if it wants that.

There's a precedent for different views of a filesystem having
different st_dev values.  Think about loopback NFS: possibly multiple
NFS filesystems, and a real one, all referring to the same set of
files, and _all_ of them have different st_dev values.  Semantically a
bind mount does not seem so different from that.

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 20:40               ` Jamie Lokier
@ 2004-04-02 20:59                 ` Christoph Hellwig
  2004-04-02 21:09                   ` viro
  2004-04-02 23:42                   ` Jamie Lokier
  2004-04-02 21:08                 ` viro
  1 sibling, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2004-04-02 20:59 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, viro, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

On Fri, Apr 02, 2004 at 09:40:57PM +0100, Jamie Lokier wrote:
> Then there are programs such as optimised Make and cacheing systems
> and servers which use dnotify.  dnotify is reliable for single-linked
> files, and when there are multiple links, it's still reliable if you
> discover all the st_nlink paths to a file.  However, it isn't reliable
> if there are any bind mounts, because you don't know whether you have
> all paths to a file (without grubbing inside /proc/mounts, and that
> has race conditions anyway).

bind mounts aside how are these programs supposed to work with multiple
mounts of the same filesystem at different mountpoints, possibly in
totally different namespaces?  In short everything using dnotify is buggy
anyway..


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 20:40               ` Jamie Lokier
  2004-04-02 20:59                 ` Christoph Hellwig
@ 2004-04-02 21:08                 ` viro
  2004-04-03  0:39                   ` Jamie Lokier
  2004-04-05 14:07                   ` Stephen C. Tweedie
  1 sibling, 2 replies; 47+ messages in thread
From: viro @ 2004-04-02 21:08 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

On Fri, Apr 02, 2004 at 09:40:57PM +0100, Jamie Lokier wrote:
> The trouble with that is the many programs which assume that if
> st_nlink == 1, then the file has only one path.  This is a critical
> optimisation for any program which looks at a lot of files checking
> for equivalent files, and is widely assumed.  I've always thought it a
> reliable basic unix assumption.

Even aside of (in)accuracy of ->st_nlink, that assumption is obviously
racy.

> For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
> You may argue that using "rsync -H" or "cp -a" on a tree that contains
> bind mounts is broken by design.

What would you expect from cp -a in that case?  To create a link?  That
would be a bug, plain and simple...

> Then there are programs such as optimised Make and cacheing systems
> and servers which use dnotify.  dnotify is reliable for single-linked
> files, and when there are multiple links, it's still reliable if you
> discover all the st_nlink paths to a file.  However, it isn't reliable
> if there are any bind mounts, because you don't know whether you have
> all paths to a file (without grubbing inside /proc/mounts, and that
> has race conditions anyway).

dnotify is not reliable, period.  We do NOT generate any events on the
parent of target; only on parent of new link.  So "discover all the st_nlink
paths" is both racy and unreliable - you won't get notified when links are
created.

> The obvious strategy for such programs is to ignore bind mounts and
> give incorrect results if there are any.  However it would be much
> better if they could detect when a file has multiple paths that aren't
> mentioned in st_nlink, so they wouldn't depend on dnotify in such cases.

How would they detect when such paths appear in the future?  For that matter,
who said that you can see all paths in question?

> There's a precedent for different views of a filesystem having
> different st_dev values.  Think about loopback NFS: possibly multiple
> NFS filesystems, and a real one, all referring to the same set of
> files, and _all_ of them have different st_dev values.  Semantically a
> bind mount does not seem so different from that.

Except that bind mount is not different from "normal" one - same situation
as with hard links.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 20:59                 ` Christoph Hellwig
@ 2004-04-02 21:09                   ` viro
  2004-04-02 23:42                   ` Jamie Lokier
  1 sibling, 0 replies; 47+ messages in thread
From: viro @ 2004-04-02 21:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jamie Lokier, Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

On Fri, Apr 02, 2004 at 09:59:35PM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 09:40:57PM +0100, Jamie Lokier wrote:
> > Then there are programs such as optimised Make and cacheing systems
> > and servers which use dnotify.  dnotify is reliable for single-linked
> > files, and when there are multiple links, it's still reliable if you
> > discover all the st_nlink paths to a file.  However, it isn't reliable
> > if there are any bind mounts, because you don't know whether you have
> > all paths to a file (without grubbing inside /proc/mounts, and that
> > has race conditions anyway).
> 
> bind mounts aside how are these programs supposed to work with multiple
> mounts of the same filesystem at different mountpoints, possibly in
> totally different namespaces?

Or from a chroot jail, for that matter - with neither bindings nor namespaces
involved...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 20:23             ` viro
@ 2004-04-02 22:40               ` Trond Myklebust
  2004-04-02 23:06                 ` viro
  2004-04-02 23:19                 ` Trond Myklebust
  0 siblings, 2 replies; 47+ messages in thread
From: Trond Myklebust @ 2004-04-02 22:40 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp, Neil Brown

On Fri, 2004-04-02 at 15:23, viro@parcelfarce.linux.theplanet.co.uk
wrote:
> 	OK...  I've got the part that deals with sys_link().  However,
> we have another user of vfs_link() and it can be nastier.
> 
> 	Unlike rename(), NFS link() passes fhandle of only one parent - the
> object we are linking to is passed as fhandle.  So figuring out what should
> be locked is going to be fun here.  Can we just say "it's stale" if link
> target has zero ->i_nlink there?  If not for NFSv4 I'd just go ahead and do
> that (cross-client atomicity warranties of rename() on NFS - not funny)...

Wait a moment... I missed the beginning of this thread.

Like all attributes, i_nlink is *not* a well defined quantity in NFS, so
anything that depends upon its value had better be pretty damn sure that
what it is doing will have no fatal consequences for the user. What you
are proposing sounds extremely race-prone. Exactly why is it necessary?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 22:40               ` Trond Myklebust
@ 2004-04-02 23:06                 ` viro
  2004-04-02 23:23                   ` Trond Myklebust
  2004-04-02 23:19                 ` Trond Myklebust
  1 sibling, 1 reply; 47+ messages in thread
From: viro @ 2004-04-02 23:06 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp, Neil Brown

On Fri, Apr 02, 2004 at 05:40:14PM -0500, Trond Myklebust wrote:
> On Fri, 2004-04-02 at 15:23, viro@parcelfarce.linux.theplanet.co.uk
> wrote:
> > 	OK...  I've got the part that deals with sys_link().  However,
> > we have another user of vfs_link() and it can be nastier.
> > 
> > 	Unlike rename(), NFS link() passes fhandle of only one parent - the
> > object we are linking to is passed as fhandle.  So figuring out what should
> > be locked is going to be fun here.  Can we just say "it's stale" if link
> > target has zero ->i_nlink there?  If not for NFSv4 I'd just go ahead and do
> > that (cross-client atomicity warranties of rename() on NFS - not funny)...
> 
> Wait a moment... I missed the beginning of this thread.
> 
> Like all attributes, i_nlink is *not* a well defined quantity in NFS, so
> anything that depends upon its value had better be pretty damn sure that
> what it is doing will have no fatal consequences for the user. What you
> are proposing sounds extremely race-prone. Exactly why is it necessary?

Umm...   We are talking about server, not client.  The race in question is
link(foo, bar) vs. rename(splat, foo); - we can get lookup in link(2) complete
and yielding old foo just before rename(2) kills foo().  As it is, link(2)
will in effect try to resurrect the sucker.  Which doesn't end up well for
some filesystems.

Again, we are talking about server - function in question is nfsd_link().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 22:40               ` Trond Myklebust
  2004-04-02 23:06                 ` viro
@ 2004-04-02 23:19                 ` Trond Myklebust
  1 sibling, 0 replies; 47+ messages in thread
From: Trond Myklebust @ 2004-04-02 23:19 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp, Neil Brown

On Fri, 2004-04-02 at 17:40, Trond Myklebust wrote:
> Like all attributes, i_nlink is *not* a well defined quantity in NFS, so
> anything that depends upon its value had better be pretty damn sure that
> what it is doing will have no fatal consequences for the user. What you
> are proposing sounds extremely race-prone. Exactly why is it necessary?

Oh right... You are talking about the server side. Yeah: ESTALE sounds
like the correct thing to do there...

I'm not sure what you meant w.r.t. your reference to NFSv4, but I assume
you were thinking about atomicity in the case where the user does a
lookup + link in the same COMPOUND? If so, please note that only the
individual ops are guaranteed to be atomic, but the COMPOUND that
contains them is not.
IOW: NFSv4 expects the client to do something like
putfh(fhandle_foo)+savefh()+putfh(fhandle_parent_dir)+link("bar") in a
single COMPOUND. However the fact that the putfh()s succeeded and
"validated" the filehandles does not guarantee that the link() won't
fail with an ESTALE due to either "foo" or "parent_dir" having been
deleted from underneath us.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 23:06                 ` viro
@ 2004-04-02 23:23                   ` Trond Myklebust
  2004-04-03  0:53                     ` Neil Brown
  0 siblings, 1 reply; 47+ messages in thread
From: Trond Myklebust @ 2004-04-02 23:23 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp, Neil Brown

On Fri, 2004-04-02 at 18:06, viro@parcelfarce.linux.theplanet.co.uk
wrote:
> Umm...   We are talking about server, not client.

Yeah, I noticed only after I had gone though the rest of the thread.
Neil might have some comments, but I believe that what I answered in my
other reply (that ESTALE is acceptable) is correct as far as the NFS
specs go.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 20:59                 ` Christoph Hellwig
  2004-04-02 21:09                   ` viro
@ 2004-04-02 23:42                   ` Jamie Lokier
  1 sibling, 0 replies; 47+ messages in thread
From: Jamie Lokier @ 2004-04-02 23:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, viro, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 09:40:57PM +0100, Jamie Lokier wrote:
> > Then there are programs such as optimised Make and cacheing systems
> > and servers which use dnotify.  dnotify is reliable for single-linked
> > files, and when there are multiple links, it's still reliable if you
> > discover all the st_nlink paths to a file.  However, it isn't reliable
> > if there are any bind mounts, because you don't know whether you have
> > all paths to a file (without grubbing inside /proc/mounts, and that
> > has race conditions anyway).
> 
> bind mounts aside how are these programs supposed to work with multiple
> mounts of the same filesystem at different mountpoints, possibly in
> totally different namespaces?  In short everything using dnotify is buggy
> anyway..

dnotify listeners are keyed on parent inode of dentry, not (dentry,
mnt).  So they work fine with multiple mounts at different mountpoints
and in different namespaces.  The sole exception is the mountpoint
itself, which fails even with single mounts.  dnotify just doesn't
handle mountpoints.

In principle anyway.  I've never tried this.

Your point has got me thinking, though (thanks).  dnotify is fine with
bind mounts, for the same reason, again with the sole exception of the
mountpoint.

It would be good to get notification of mount changes anyway, and that
would solve the exceptions.

There is still the problem where you can find st_nlink paths and
incorrectly assume that's all the names of a file.  That's a problem
with all the examples I gave ("cp -a" etc.); nothing specifically to
do with dnotify, and it happens with any multiple mounts not just bind
mounts.  If there's a way to detect when a path walk crosses a mount
point even though st_dev doesn't change, that problem would be solved too.
O_NOMOUNT would be one method.

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 21:08                 ` viro
@ 2004-04-03  0:39                   ` Jamie Lokier
  2004-04-05 14:07                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 47+ messages in thread
From: Jamie Lokier @ 2004-04-03  0:39 UTC (permalink / raw)
  To: viro
  Cc: Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Dave Kleikamp

viro@parcelfarce.linux.theplanet.co.uk wrote:
> > The trouble with that is the many programs which assume that if
> > st_nlink == 1, then the file has only one path.  This is a critical
> > optimisation for any program which looks at a lot of files checking
> > for equivalent files, and is widely assumed.  I've always thought it a
> > reliable basic unix assumption.
> 
> Even aside of (in)accuracy of ->st_nlink, that assumption is obviously
> racy.

Yes it is, it's nevertheless a common assumption, and it used to be
accurate when that area of the filesystem was known to be not
changing.  It's still accurate when that area is known to have no bind
mounts.

(Fwiw, I'll admit that I have 1 user who has a bind mount in their
home directory and doesn't know it...)

> > For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
> > You may argue that using "rsync -H" or "cp -a" on a tree that contains
> > bind mounts is broken by design.
> 
> What would you expect from cp -a in that case?  To create a link?  That
> would be a bug, plain and simple...

In fact, cp -a, rsync -H and tar all do this: if the file has st_nlink
== 1, each mounted occurence of the file is copied.  If the file has
st_nlink != 1, each mounted occurence is hard linked together in the
destination.

I'm surprised none of them implement the optimisation where after
seeing st_nlink links, it can forget about that (dev, inode) pair.
Perhaps it's not reliable to assume that on other OSes either.

> > Then there are programs such as optimised Make and cacheing systems
> > and servers which use dnotify.  dnotify is reliable for single-linked
> > files, and when there are multiple links, it's still reliable if you
> > discover all the st_nlink paths to a file.  However, it isn't reliable
> > if there are any bind mounts, because you don't know whether you have
> > all paths to a file (without grubbing inside /proc/mounts, and that
> > has race conditions anyway).
> 
> dnotify is not reliable, period.

Everyone keeps saying how dnotify is unreliable, crap etc. despite the
multitude of good performance enhancing uses for it.  The principle is
not _intrinsically_ crap even though the implementation could do with
much improvement.

> We do NOT generate any events on the
> parent of target; only on parent of new link.

I'm stunned.  An attribute of the file is changed (nlink), by an
operation which follows the target path, and no DN_ATTRIB event is
generated.  I consider that a dnotify implementation disaster.

> So "discover all the st_nlink paths" is both racy and unreliable -
> you won't get notified when links are created.

I agree, but you _should_ be notified.  Not least because without this
dnotify is useless, and with it dnotify is reliable (provided you only
use it to watch single-linked files that aren't mount points).

> > The obvious strategy for such programs is to ignore bind mounts and
> > give incorrect results if there are any.  However it would be much
> > better if they could detect when a file has multiple paths that aren't
> > mentioned in st_nlink, so they wouldn't depend on dnotify in such cases.
> 
> How would they detect when such paths appear in the future?  For that matter,
> who said that you can see all paths in question?

First question: I didn't realise DN_ATTRIB events weren't sent when
nlink is incremented by link().  Obviously they should be, it's an
extremely important attribute for dnotify users.

Second question: it doesn't matter.  There are lots of situations
where you can't use dnotify, and you'll have to call stat() or
whatever each time you want to check if a file has changed.  That's
one of them.

>From my point of view, the useful aspect of dnotify is to avoid large
numbers of stat() calls per complex cached operation which reads from
a filesystem.  Hence mutterings of optimised Make etc.  It's not about
saving syscalls, it's an algorithmic scalability thing: O(n) becomes
O(1).  Sure, I could stop using files and store everything in a
database, but that's a huge loss of convenience.  Filesystems are
nice.  Have you ever tried to work with source code in databases? :)

> > There's a precedent for different views of a filesystem having
> > different st_dev values.  Think about loopback NFS: possibly multiple
> > NFS filesystems, and a real one, all referring to the same set of
> > files, and _all_ of them have different st_dev values.  Semantically a
> > bind mount does not seem so different from that.
> 
> Except that bind mount is not different from "normal" one - same situation
> as with hard links.

I agree, I misunderstood something: bind mounts are not much different
from multiple mounts of the same filesystem.  I was comparing with
traditional unix and old linux, where a filesystem could be mounted once.

Multiple/bind mounts have two significant differences from traditional
single mounts: 1. st_nlink doesn't reflect the number of different
paths visible to a program; 2. st_dev doesn't change when you cross a
mount point.

st_nlink affects programs like "cp -a", and we don't know what sane
semantics of "cp -a" should be anyway, so it's not really a problem.

The Emacs backup-by-copying-when-linked problem is tricker: if a file
has multiple names, we really do want backup by copying, not renaming.
But there is no way for it to know that a file has another name, so
Emacs will rename a file to make a backup, which usually destroys the
intent of the bind mount.  (It'll fail to rename a mount point, but
renaming the target of a mount succeeds, and that can be a file).

st_dev is not a problem as such.  The problem is that the mount point
isn't detectable.  In many cases, an O_NOMOUNT flag would sort that
out: the flag meaning to fail a path walk that tries to cross mounts.

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 23:23                   ` Trond Myklebust
@ 2004-04-03  0:53                     ` Neil Brown
  0 siblings, 0 replies; 47+ messages in thread
From: Neil Brown @ 2004-04-03  0:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Alexander Viro, Linus Torvalds, David Woodhouse,
	Martin Schwidefsky, Andrew Morton, Carsten Otte, Carsten Otte,
	linux-fsdevel, sct, Dave Kleikamp

On Friday April 2, trond.myklebust@fys.uio.no wrote:
> On Fri, 2004-04-02 at 18:06, viro@parcelfarce.linux.theplanet.co.uk
> wrote:
> > Umm...   We are talking about server, not client.
> 
> Yeah, I noticed only after I had gone though the rest of the thread.
> Neil might have some comments, but I believe that what I answered in my
> other reply (that ESTALE is acceptable) is correct as far as the NFS
> specs go.
> 

My only comment is :  I agree with Trond.

NeilBrown

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
  2004-04-02 21:08                 ` viro
  2004-04-03  0:39                   ` Jamie Lokier
@ 2004-04-05 14:07                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 47+ messages in thread
From: Stephen C. Tweedie @ 2004-04-05 14:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Jamie Lokier, Linus Torvalds, David Woodhouse, Martin Schwidefsky,
	Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel,
	Dave Kleikamp, Stephen Tweedie

Hi,

On Fri, 2004-04-02 at 22:08, viro@parcelfarce.linux.theplanet.co.uk
wrote:

> Even aside of (in)accuracy of ->st_nlink, that assumption is obviously
> racy.
> 
> > For example: rsync -H, cp -a, and Emacs backup-by-copying-when-linked.
> > You may argue that using "rsync -H" or "cp -a" on a tree that contains
> > bind mounts is broken by design.
> 
> What would you expect from cp -a in that case?  To create a link?  That
> would be a bug, plain and simple...

I don't see why it's a bug.  The only two reasonable behaviours "cp -a"
has available to it that I can think of are (1) copy the file twice, (2)
hard-link it.  The hard-link seems less broken than copy-twice in this
case: it preserves the "one inode, two paths" situation that was present
on the input.  Unless you think "cp -a" should be copying mount points. 
Ugh.

--Stephen

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2004-04-05 14:12 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-19 12:21 [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Carsten Otte
2004-02-19 16:53 ` Linus Torvalds
2004-02-19 17:39   ` Stephen C. Tweedie
2004-02-19 18:49     ` Andrew Morton
2004-02-19 20:28       ` Carsten Otte
2004-02-19 20:26         ` viro
2004-02-19 20:35           ` Carsten Otte
2004-02-19 20:14     ` Carsten Otte
2004-02-20  3:41       ` Andrew Morton
2004-02-19 20:19   ` Carsten Otte
     [not found] ` <20040220164325.659c4e45.akpm@osdl.org>
     [not found]   ` <200402241338.57855.cotte@freenet.de>
2004-02-24 22:55     ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2004-02-19 18:00 Martin Schwidefsky
2004-03-29 19:07 Martin Schwidefsky
2004-03-29 20:11 ` Linus Torvalds
2004-03-29 20:29   ` Dave Kleikamp
2004-03-30 11:57 Martin Schwidefsky
2004-03-30 13:39 ` David Woodhouse
2004-03-30 14:16   ` Matthew Wilcox
2004-03-30 15:51   ` Linus Torvalds
2004-04-02 16:12     ` viro
2004-04-02 18:01       ` viro
2004-04-02 18:52         ` Linus Torvalds
2004-04-02 19:02           ` Linus Torvalds
2004-04-02 19:10             ` viro
2004-04-02 19:07           ` viro
2004-04-02 20:23             ` viro
2004-04-02 22:40               ` Trond Myklebust
2004-04-02 23:06                 ` viro
2004-04-02 23:23                   ` Trond Myklebust
2004-04-03  0:53                     ` Neil Brown
2004-04-02 23:19                 ` Trond Myklebust
2004-04-02 19:17           ` Jamie Lokier
2004-04-02 19:25             ` viro
2004-04-02 19:32             ` Linus Torvalds
2004-04-02 19:37               ` viro
2004-04-02 19:45                 ` Linus Torvalds
2004-04-02 20:08                   ` viro
2004-04-02 20:40               ` Jamie Lokier
2004-04-02 20:59                 ` Christoph Hellwig
2004-04-02 21:09                   ` viro
2004-04-02 23:42                   ` Jamie Lokier
2004-04-02 21:08                 ` viro
2004-04-03  0:39                   ` Jamie Lokier
2004-04-05 14:07                   ` Stephen C. Tweedie
2004-03-30 15:07 ` Linus Torvalds
2004-04-02 16:14   ` viro
2004-03-30 15:13 Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox