public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-30 11:57 Martin Schwidefsky
  2004-03-30 13:39 ` David Woodhouse
  2004-03-30 15:07 ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-30 11:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Al Viro, Dave Kleikamp





Hi Linus,

> Is this trace all from the same inode?

Yes, its all for the same inode.

> I saw this problem a while back and fixed it in jfs_link by not allowing
> a link to succeed if i_link was zero.  I didn't realize at the time it
> was a problem in ext3.  The jfs fix:
> http://jfs.bkbits.net:8080/linux-2.5/cset@1.1380.5.2

I'm just testing the analog ext3 patch:

Index: namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
retrieving revision 1.25
diff -u -r1.25 namei.c
--- namei.c 18 Feb 2004 17:44:58 -0000    1.25
+++ namei.c 30 Mar 2004 11:48:15 -0000
@@ -2136,6 +2136,9 @@
      if (inode->i_nlink >= EXT3_LINK_MAX)
            return -EMLINK;

+     if (inode->i_nlink == 0)
+           return -ENOENT;
+
      handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
                              EXT3_INDEX_EXTRA_TRANS_BLOCKS);
      if (IS_ERR(handle))

So far it works but the race is quite hard to trigger. I'll have to stress
the system for a few hours/days.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread
* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-30 15:13 Martin Schwidefsky
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-30 15:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel, sct,
	Al Viro, Dave Kleikamp, David Woodhouse





> I'm just testing the analog ext3 patch:
>
> Index: namei.c
> ===================================================================
> RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
> retrieving revision 1.25
> diff -u -r1.25 namei.c
> --- namei.c     18 Feb 2004 17:44:58 -0000    1.25
> +++ namei.c     30 Mar 2004 11:48:15 -0000
> @@ -2136,6 +2136,9 @@
>     if (inode->i_nlink >= EXT3_LINK_MAX)
>           return -EMLINK;
>
> +   if (inode->i_nlink == 0)
> +         return -ENOENT;
> +
>     handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
>                             EXT3_INDEX_EXTRA_TRANS_BLOCKS);
>     if (IS_ERR(handle))
>
> So far it works but the race is quite hard to trigger. I'll have to stress
> the system for a few hours/days.

Ok, I compiled a lot of glibcs today which is the workload that I use to
trigger the problem. With the patch the problem disappears. Just to make
sure I removed the patch again and sure enough the machine crashed after
1/2 hour. This patch solves the problem for me. Davids patch for fs/namei.c
seems to me the generic solution for all filesystems.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread
* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-03-29 19:07 Martin Schwidefsky
  2004-03-29 20:11 ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Schwidefsky @ 2004-03-29 19:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Carsten Otte, Carsten Otte, linux-fsdevel, sct, torvalds





> Sigh.  This doesn't help at all really.  All we know is that we have an
> inode which is still on the orphan list when it shouldn't be.  Later, the
> dentry was reaped and the inode gets moved to inode_unused, generating
this
> trace.   Later still, the inode is reaped off inode_unused and bang.
>
> Could you chage it so we generate a backtrace in iput() if the inode has
an
> i_count of one (the dentry) and it is still on the orphan list?

Finally I was able to recreate the crash with my inode-trace patch applied.
The machine is in a bad mood again, the first time I am happy to see it crash..

The inode-trace patch logs every change to i_count of every inode in vfs and
ext3. I can now present the life of a broken inode:

cpu0 0.000000 sec: <alloc_inode> <new_inode> <ext3_new_inode> <ext3_create>
cpu0 0.001145 sec: <sys_unlink> <system_call>
cpu0 0.001153 sec: <ext3_orphan_add> <ext3_unlink> <vfs_unlink> <sys_unlink>
cpu0 0.001164 sec: <iput> <sys_unlink> <system_call>
cpu5 0.001174 sec: <ext3_link> <vfs_link> <sys_link> <system_call>
cpu5 0.001186 sec: <iput> <dput> <path_release> <sys_link>
cpu4 1.060962 sec: <__iget> <sync_sb_inodes> <writeback_inodes> <background_writeout>
cpu4 1.060970 sec: <iput> <sync_sb_inodes> <writeback_inodes> <background_writeout>
cpu3 1.273330 sec: <iput> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
cpu3 1.273331 sec: <iput_final> <prune_dcache> <shrink_dcache_memory> <shrink_slab>
cpu3 1.280405 sec: <destroy_inode> <prune_dcache> <prune_icache> <shrink_icache_memory>
cpu3 1.280405 sec: <ext3_destroy_inode> <destroy_inode> <dispose_list> <prune_icache>

What I find suspicious is the sys_unlink followed by ext3_orphan_add
on cpu 0 vs. ext3_link on cpu 5.

I hope this helps. If not what else can I trace?

blue skies,
   Martin

Debug-Patch:

Index: arch/s390/boot/kerntypes.c
===================================================================
RCS file: /home/cvs/linux-2.5/arch/s390/boot/kerntypes.c,v
retrieving revision 1.1
diff -u -r1.1 kerntypes.c
--- arch/s390/boot/kerntypes.c      17 Mar 2003 16:45:34 -0000    1.1
+++ arch/s390/boot/kerntypes.c      26 Mar 2004 12:50:23 -0000
@@ -51,3 +51,4 @@
  *    - struct runqueue
  */
 #include "kernel/sched.c"
+#include "arch/s390/kernel/debug.c"
Index: arch/s390/kernel/debug.c
===================================================================
RCS file: /home/cvs/linux-2.5/arch/s390/kernel/debug.c,v
retrieving revision 1.12
diff -u -r1.12 debug.c
--- arch/s390/kernel/debug.c  27 May 2003 11:33:55 -0000    1.12
+++ arch/s390/kernel/debug.c  26 Mar 2004 12:50:25 -0000
@@ -1198,3 +1198,21 @@
 EXPORT_SYMBOL(debug_sprintf_view);
 EXPORT_SYMBOL(debug_sprintf_exception);
 EXPORT_SYMBOL(debug_sprintf_event);
+
+DEFINE_PER_CPU(struct inode_log *, inode_logs);
+
+void INODE_DBF_init(void)
+{
+     struct inode_log *log;
+     int i;
+
+     log = __get_cpu_var(inode_logs);
+     for_each_cpu(i) {
+           log = (void *) __get_free_pages(GFP_KERNEL, 0);
+           log->index = 0;
+           log->entries = (void *) __get_free_pages(GFP_KERNEL, 8);
+           memset(log->entries, 0, 1024*1024);
+           per_cpu(inode_logs, i) = log;
+           printk("cpu %i log %p\n", i, log);
+     }
+}
Index: fs/inode.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/inode.c,v
retrieving revision 1.42
diff -u -r1.42 inode.c
--- fs/inode.c    11 Mar 2004 09:59:47 -0000    1.42
+++ fs/inode.c    26 Mar 2004 12:50:33 -0000
@@ -151,6 +151,7 @@
                  mapping->backing_dev_info = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
            memset(&inode->u, 0, sizeof(inode->u));
            inode->i_mapping = mapping;
+           INODE_DBF(inode, "allc");
      }
      return inode;
 }
@@ -159,6 +160,7 @@
 {
      if (inode_has_buffers(inode))
            BUG();
+     INODE_DBF(inode, "free");
      security_inode_free(inode);
      if (inode->i_sb->s_op->destroy_inode)
            inode->i_sb->s_op->destroy_inode(inode);
@@ -211,6 +213,8 @@
  */
 void __iget(struct inode * inode)
 {
+     INODE_DBF(inode,"iget");
+
      if (atomic_read(&inode->i_count)) {
            atomic_inc(&inode->i_count);
            return;
@@ -1074,6 +1073,8 @@
      struct super_operations *op = inode->i_sb->s_op;
      void (*drop)(struct inode *) = generic_drop_inode;

+     INODE_DBF(inode,"drop");
+
      if (op && op->drop_inode)
            drop = op->drop_inode;
      drop(inode);
@@ -1091,6 +1092,8 @@
      if (inode) {
            struct super_operations *op = inode->i_sb->s_op;

+           INODE_DBF(inode,"iput");
+
            if (inode->i_state == I_CLEAR)
                  BUG();

Index: fs/libfs.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/libfs.c,v
retrieving revision 1.25
diff -u -r1.25 libfs.c
--- fs/libfs.c    18 Feb 2004 17:44:57 -0000    1.25
+++ fs/libfs.c    26 Mar 2004 12:50:33 -0000
@@ -229,6 +229,7 @@

      inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
      inode->i_nlink++;
+     INODE_DBF(inode,"slnk");
      atomic_inc(&inode->i_count);
      dget(dentry);
      d_instantiate(dentry, inode);
Index: fs/namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/namei.c,v
retrieving revision 1.33
diff -u -r1.33 namei.c
--- fs/namei.c    11 Mar 2004 09:59:48 -0000    1.33
+++ fs/namei.c    26 Mar 2004 12:50:34 -0000
@@ -1726,8 +1726,10 @@
            if (nd.last.name[nd.last.len])
                  goto slashes;
            inode = dentry->d_inode;
-           if (inode)
+           if (inode) {
                  atomic_inc(&inode->i_count);
+                 INODE_DBF(inode,"ulnk");
+           }
            error = vfs_unlink(nd.dentry->d_inode, dentry);
      exit2:
            dput(dentry);
Index: fs/ext3/ialloc.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/ialloc.c,v
retrieving revision 1.22
diff -u -r1.22 ialloc.c
--- fs/ext3/ialloc.c    11 Mar 2004 09:59:50 -0000    1.22
+++ fs/ext3/ialloc.c    26 Mar 2004 12:50:35 -0000
@@ -117,6 +117,7 @@

      ino = inode->i_ino;
      ext3_debug ("freeing inode %lu\n", ino);
+     INODE_DBF(inode, "e3fr");

      /*
       * Note: we must free any quota before locking the superblock,
@@ -687,6 +688,9 @@
      }
 out:
      brelse(bitmap_bh);
+
+     INODE_DBF(inode,"oget");
+
      return inode;
 }

Index: fs/ext3/namei.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/namei.c,v
retrieving revision 1.25
diff -u -r1.25 namei.c
--- fs/ext3/namei.c     18 Feb 2004 17:44:58 -0000    1.25
+++ fs/ext3/namei.c     26 Mar 2004 12:50:35 -0000
@@ -1877,8 +1877,10 @@
       *
       * This is safe: on error we're going to ignore the orphan list
       * anyway on the next recovery. */
-     if (!err)
+     if (!err) {
            list_add(&EXT3_I(inode)->i_orphan, &EXT3_SB(sb)->s_orphan);
+           INODE_DBF(inode,"oadd");
+     }

      jbd_debug(4, "superblock will point to %ld\n", inode->i_ino);
      jbd_debug(4, "orphan inode %ld will point to %d\n",
@@ -1916,6 +1918,8 @@

      list_del_init(&ei->i_orphan);

+     INODE_DBF(inode,"odel");
+
      /* If we're on an error path, we may not have a valid
       * transaction handle with which to update the orphan list on
       * disk, but we still need to remove the inode from the linked
@@ -2146,7 +2150,10 @@

      inode->i_ctime = CURRENT_TIME;
      ext3_inc_count(handle, inode);
+
      atomic_inc(&inode->i_count);
+
+     INODE_DBF(inode,"link");

      err = ext3_add_nondir(handle, dentry, inode);
      ext3_journal_stop(handle);
Index: fs/ext3/super.c
===================================================================
RCS file: /home/cvs/linux-2.5/fs/ext3/super.c,v
retrieving revision 1.41
diff -u -r1.41 super.c
--- fs/ext3/super.c     11 Mar 2004 09:59:51 -0000    1.41
+++ fs/ext3/super.c     26 Mar 2004 12:50:35 -0000
@@ -449,6 +449,10 @@

 static void ext3_destroy_inode(struct inode *inode)
 {
+     if (!list_empty(&EXT3_I(inode)->i_orphan)) {
+           INODE_DBF(inode,"BUG!");
+           cpcmd("IPL 5E0A", NULL, NULL);
+     }
      kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }

Index: include/linux/fs.h
===================================================================
RCS file: /home/cvs/linux-2.5/include/linux/fs.h,v
retrieving revision 1.56
diff -u -r1.56 fs.h
--- include/linux/fs.h  11 Mar 2004 10:00:24 -0000    1.56
+++ include/linux/fs.h  26 Mar 2004 12:50:39 -0000
@@ -866,7 +866,6 @@
 struct super_operations {
      struct inode *(*alloc_inode)(struct super_block *sb);
      void (*destroy_inode)(struct inode *);
-     int  (*inode_busy)(struct inode *);

      void (*read_inode) (struct inode *);

@@ -1440,4 +1439,58 @@
 #endif     /* CONFIG_SECURITY */

 #endif /* __KERNEL__ */
+
+#include <asm/debug.h>
+
+struct inode_log_entry {
+     unsigned long long clock;
+     unsigned long event;
+     unsigned long inode;
+     unsigned long bc1;
+     unsigned long bc2;
+     unsigned long bc3;
+     unsigned long bc4;
+};
+
+struct inode_log {
+     struct inode_log_entry *entries;
+     unsigned int index;
+};
+
+DECLARE_PER_CPU(struct inode_log *, inode_logs);
+
+void __INODE_DBF(struct inode *inode, char *tag);
+
+static inline void INODE_DBF(struct inode *inode, char *tag)
+{
+     struct inode_log *log;
+     struct inode_log_entry *entry;
+     unsigned long bc;
+
+     log = get_cpu_var(inode_logs);
+     if (log) {
+           entry = log->entries + log->index;
+           log->index = (log->index + 1) & 0x7fff;
+           memset(&entry, sizeof(struct inode_log_entry), 0);
+           entry->clock = get_clock();
+           memcpy(&entry->event, tag, 4);
+           entry->inode = (unsigned long) inode;
+           asm volatile("lr %0,14" : "=d" (bc) );
+           entry->bc1 = bc;
+           asm volatile("lr %0,15" : "=d" (bc) );
+           bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+           if (bc) {
+                 entry->bc2 = *(unsigned long *) (bc+56);
+                 bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+                 if (bc) {
+                       entry->bc3 = *(unsigned long *) (bc+56);
+                       bc = (*(unsigned long *) bc) & PSW_ADDR_INSN;
+                       if (bc)
+                             entry->bc4 = *(unsigned long *) (bc+56);
+                 }
+           }
+     }
+     put_cpu_var(inode_logs);
+}
+
 #endif /* _LINUX_FS_H */
Index: init/main.c
===================================================================
RCS file: /home/cvs/linux-2.5/init/main.c,v
retrieving revision 1.44
diff -u -r1.44 main.c
--- init/main.c   11 Mar 2004 10:00:30 -0000    1.44
+++ init/main.c   26 Mar 2004 12:50:40 -0000
@@ -599,6 +599,7 @@
      do_pre_smp_initcalls();

      smp_init();
+     INODE_DBF_init();
      do_basic_setup();

      prepare_namespace();

Raw-Data for inode 0x0ba08d90:
bafd8d87 c850130a 616c6c63 0ba08d90 8008127a 80082060 800a34e8 800ab3ea : .....P..allc.......z.. `..4..... cpu0
bafd8d87 c897ae40 756c6e6b 0ba08d90 80077400 8001fc92 8001fc92 800495f8 : .......@ulnk......t............. cpu0
bafd8d87 c8982880 6f616464 0ba08d90 800abc88 800ac3b6 8007726e 800774d2 : ......(.oadd..............rn..t. cpu0
bafd8d87 c898d880 69707574 0ba08d90 0c2a1ec0 8007752c 8001fc92 80082bb8 : ........iput.....*....u,......+. cpu0
bafd8d87 c89982c8 6c696e6b 0ba08d90 800ac60a 800777ee 800779a6 8001fc92 : ........link..........w...y..... cpu5
bafd8d87 c89a3ac8 69707574 0ba08d90 0cb41de0 8007ef26 8007464a 80077930 : ......:.iput...........&..FJ..y0 cpu5
bafd8d88 cb563f45 69676574 0ba08d90 80089e32 80089e32 80089fc2 800495f8 : .....V?Eiget.......2...2........ cpu4
bafd8d88 cb56b405 69707574 0ba08d90 0f541ba0 80089e66 80089fc2 800495f8 : .....V..iput.....T.....f........ cpu4
bafd8d88 ff2f4280 69707574 0ba08d90 0f53fc08 8007f61e 8007fbf8 8004f5c4 : ...../B.iput.....S.............. cpu3
bafd8d88 ff2f44c0 64726f70 0ba08d90 80082cca 8007f61e 8007fbf8 8004f5c4 : ...../D.drop......,............. cpu3
bafd8d89 00e96907 66726565 0ba08d90 8008141a 80081930 80081d66 80081f50 : ......i.free...........0...f...P cpu3
bafd8d89 00e96a07 42554721 0ba08d90 0f53fb20 800814b8 80081930 80081d66 : ......j.BUG!.....S. .......0...f cpu3


^ permalink raw reply	[flat|nested] 47+ messages in thread
* Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-02-19 18:00 Martin Schwidefsky
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Schwidefsky @ 2004-02-19 18:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Carsten Otte, Carsten Otte, linux-fsdevel,
	Stephen Tweedie, Linus Torvalds






Hi Stephen,

> I may be missing some background from previous emails --- is this just a
> hypothetical race, or have we actually seen this mechanism breaking down
> in practice?

I have seen this in practice on a s390 machine with 6 processor, 256 MB main
storage and compiling the glibc. I limited the storage to 256 MB on purpose
to get the machine to do heavy swapping. In the middle of the glibc build
it generates the locales and this uses a lot of storage with -j6. The crash
itself is anywhere and it took me a while to find out that a list_add in
ext3_orphan_add accessed the list_head in an already freed inode.

For easier recreation I added a BUG_ON in ext3_destroy_inode:

diff -ruN linux-2.6.2/fs/ext3/super.c
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c   2004-02-19 12:51:35.000000000
+0100
@@ -449,6 +449,7 @@+       BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
I can easily trigger the BUG_ON on my machine with a simple rpmbuild.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure
@ 2004-02-19 12:21 Carsten Otte
  2004-02-19 16:53 ` Linus Torvalds
       [not found] ` <20040220164325.659c4e45.akpm@osdl.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Carsten Otte @ 2004-02-19 12:21 UTC (permalink / raw)
  To: linux-fsdevel, akpm, torvalds; +Cc: schwidefsky, cotte

Hi all,

recently we ran into a problem found during our 2.6. test activities on s390 
architecture. The problem occured running a glibc build on Linux 2.6.2. 
running in a z/VM virtual machine with 6 processors on a z990 Server 
(6Processor SMP, 64-bit, big endian).
We were able to identify ext3 as the cause of the problem with the following 
debugging patch:
diff -ruN linux-2.6.2/fs/ext3/super.c 
linux-2.6.2+bug_statement/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:52:01.000000000 +0100
+++ linux-2.6.2+bug_statement/fs/ext3/super.c   2004-02-19 12:51:35.000000000 
+0100
@@ -449,6 +449,7 @@
 
 static void ext3_destroy_inode(struct inode *inode)
 {
+       BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan));
        kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 

The output of the BUG_ON statement shows, that prune_icache [fs/inode.c] picks 
an inode with i_count == 0 and calls (via dispose_list) clear_inode() and 
destroy_inode(). Because the inode is still in use by ext3 internally, ext3 
later on reads from or writes to freed memory causing random behavior of the 
system. I think, this BUG_ON should go into the vanilla kernel to prevent 
data corruption.

Above was so far only reproducable in rare conditions when the system was 
under heavy memory pressure. Therefore I continued to debug this further:
The problem seemed to be related to reference counting of inodes. Therefore I 
added a debugging patch to iput_final() [fs/inode.c], that checks if an inode 
that is being dropped is still in the ext3 orphan list. This situation is 
quite easy to reproduce, and has shown one example path were this happens:
A user process calls sys_unlink() [fs/namei.c], which increments i_count, 
calls vfs_unlink [fs/namei.c] and afterwards calls iput().
vfs_unlink [fs/namei.c] works with the dentry, calls i_op->unlink() (in this 
case ext3_unlink) and returns.
ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orphan list 
before returning.
After sys_unlink() has completed, the inode is still referenced by ext3 while 
i_count has the same value like before, which triggers the problem in case 
prune_icache would now choose this inode to be freed.

Possible ways to fix above problem is change reference counting for inodes or 
make the prune_icache function aware of the internal reference to the inode 
(preferably without knowing about the internal data structures of the 
filesystem which would be a layering violation). The patch below does take 
the 2nd approach:
- adds additional super_operation s_op->inode_busy() allowing VFS to query if 
an inode is still internally referenced by the fs.
- adds ext3_inode_busy() to ext3 that checks if he inode is still internally 
referenced
- changes prune_icache to query inode_busy() in case this s_op is implemented 
by the individual fs

Patch fixing the problem:
diff -ruN linux-2.6.2/fs/ext3/super.c linux-2.6.2+ext3fix/fs/ext3/super.c
--- linux-2.6.2/fs/ext3/super.c 2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/ext3/super.c 2004-02-04 17:50:03.000000000 +0100
@@ -453,6 +453,14 @@
        kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 
+static int ext3_inode_busy(struct inode *inode)
+{
+       if (!list_empty(&EXT3_I(inode)->i_orphan)) 
+               return 1;
+       else
+               return 0;
+}
+
 static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
 {
        struct ext3_inode_info *ei = (struct ext3_inode_info *) foo;
@@ -510,6 +518,7 @@
 static struct super_operations ext3_sops = {
        .alloc_inode    = ext3_alloc_inode,
        .destroy_inode  = ext3_destroy_inode,
+       .inode_busy     = ext3_inode_busy,
        .read_inode     = ext3_read_inode,
        .write_inode    = ext3_write_inode,
        .dirty_inode    = ext3_dirty_inode,
diff -ruN linux-2.6.2/fs/inode.c linux-2.6.2+ext3fix/fs/inode.c
--- linux-2.6.2/fs/inode.c      2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/fs/inode.c      2004-01-28 09:02:25.000000000 +0100
@@ -391,6 +391,9 @@
                return 0;
        if (inode->i_data.nrpages)
                return 0;
+       if (inode->i_sb->s_op->inode_busy
+           && inode->i_sb->s_op->inode_busy(inode))
+               return 0;
        return 1;
 }
 
@@ -424,7 +427,9 @@
 
                inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-               if (inode->i_state || atomic_read(&inode->i_count)) {
+               if (inode->i_state || atomic_read(&inode->i_count)
+                   || (inode->i_sb->s_op->inode_busy 
+                        && (inode->i_sb->s_op->inode_busy(inode)))) {
                        list_move(&inode->i_list, &inode_unused);
                        continue;
                }
diff -ruN linux-2.6.2/include/linux/fs.h 
linux-2.6.2+ext3fix/include/linux/fs.h
--- linux-2.6.2/include/linux/fs.h      2004-02-19 12:46:35.000000000 +0100
+++ linux-2.6.2+ext3fix/include/linux/fs.h      2004-02-18 18:45:15.000000000 
+0100
@@ -866,6 +866,7 @@
 struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb);
        void (*destroy_inode)(struct inode *);
+       int  (*inode_busy)(struct inode *);
 
        void (*read_inode) (struct inode *);
   

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2004-04-05 14:12 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-30 11:57 [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Martin Schwidefsky
2004-03-30 13:39 ` David Woodhouse
2004-03-30 14:16   ` Matthew Wilcox
2004-03-30 15:51   ` Linus Torvalds
2004-04-02 16:12     ` viro
2004-04-02 18:01       ` viro
2004-04-02 18:52         ` Linus Torvalds
2004-04-02 19:02           ` Linus Torvalds
2004-04-02 19:10             ` viro
2004-04-02 19:07           ` viro
2004-04-02 20:23             ` viro
2004-04-02 22:40               ` Trond Myklebust
2004-04-02 23:06                 ` viro
2004-04-02 23:23                   ` Trond Myklebust
2004-04-03  0:53                     ` Neil Brown
2004-04-02 23:19                 ` Trond Myklebust
2004-04-02 19:17           ` Jamie Lokier
2004-04-02 19:25             ` viro
2004-04-02 19:32             ` Linus Torvalds
2004-04-02 19:37               ` viro
2004-04-02 19:45                 ` Linus Torvalds
2004-04-02 20:08                   ` viro
2004-04-02 20:40               ` Jamie Lokier
2004-04-02 20:59                 ` Christoph Hellwig
2004-04-02 21:09                   ` viro
2004-04-02 23:42                   ` Jamie Lokier
2004-04-02 21:08                 ` viro
2004-04-03  0:39                   ` Jamie Lokier
2004-04-05 14:07                   ` Stephen C. Tweedie
2004-03-30 15:07 ` Linus Torvalds
2004-04-02 16:14   ` viro
  -- strict thread matches above, loose matches on Subject: below --
2004-03-30 15:13 Martin Schwidefsky
2004-03-29 19:07 Martin Schwidefsky
2004-03-29 20:11 ` Linus Torvalds
2004-03-29 20:29   ` Dave Kleikamp
2004-02-19 18:00 Martin Schwidefsky
2004-02-19 12:21 Carsten Otte
2004-02-19 16:53 ` Linus Torvalds
2004-02-19 17:39   ` Stephen C. Tweedie
2004-02-19 18:49     ` Andrew Morton
2004-02-19 20:28       ` Carsten Otte
2004-02-19 20:26         ` viro
2004-02-19 20:35           ` Carsten Otte
2004-02-19 20:14     ` Carsten Otte
2004-02-20  3:41       ` Andrew Morton
2004-02-19 20:19   ` Carsten Otte
     [not found] ` <20040220164325.659c4e45.akpm@osdl.org>
     [not found]   ` <200402241338.57855.cotte@freenet.de>
2004-02-24 22:55     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox