From mboxrd@z Thu Jan 1 00:00:00 1970 From: Carsten Otte Subject: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Date: Thu, 19 Feb 2004 13:21:39 +0100 Sender: linux-fsdevel-owner@vger.kernel.org Message-ID: <200402191321.39592.cotte@freenet.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: schwidefsky@de.ibm.com, cotte@de.ibm.com Return-path: Received: from mout2.freenet.de ([194.97.50.155]:60631 "EHLO mout2.freenet.de") by vger.kernel.org with ESMTP id S267215AbUBSMVu convert rfc822-to-8bit (ORCPT ); Thu, 19 Feb 2004 07:21:50 -0500 To: linux-fsdevel@vger.kernel.org, akpm@digeo.com, torvalds@osdl.org Content-Disposition: inline List-Id: linux-fsdevel.vger.kernel.org Hi all, recently we ran into a problem found during our 2.6. test activities on= s390=20 architecture. The problem occured running a glibc build on Linux 2.6.2.= =20 running in a z/VM virtual machine with 6 processors on a z990 Server=20 (6Processor SMP, 64-bit, big endian). We were able to identify ext3 as the cause of the problem with the foll= owing=20 debugging patch: diff -ruN linux-2.6.2/fs/ext3/super.c=20 linux-2.6.2+bug_statement/fs/ext3/super.c --- linux-2.6.2/fs/ext3/super.c=A02004-02-19 12:52:01.000000000 +0100 +++ linux-2.6.2+bug_statement/fs/ext3/super.c=A0=A0=A02004-02-19 12:51:= 35.000000000=20 +0100 @@ -449,6 +449,7 @@ =A0 =A0static void ext3_destroy_inode(struct inode *inode) =A0{ +=A0=A0=A0=A0=A0=A0=A0BUG_ON (!list_empty(&EXT3_I(inode)->i_orphan)); =A0=A0=A0=A0=A0=A0=A0=A0kmem_cache_free(ext3_inode_cachep, EXT3_I(inode= )); =A0} =A0 The output of the BUG_ON statement shows, that prune_icache [fs/inode.c= ] picks=20 an inode with i_count =3D=3D 0 and calls (via dispose_list) clear_inode= () and=20 destroy_inode(). Because the inode is still in use by ext3 internally, = ext3=20 later on reads from or writes to freed memory causing random behavior o= f the=20 system. I think, this BUG_ON should go into the vanilla kernel to preve= nt=20 data corruption. Above was so far only reproducable in rare conditions when the system w= as=20 under heavy memory pressure. Therefore I continued to debug this furthe= r: The problem seemed to be related to reference counting of inodes. There= fore I=20 added a debugging patch to iput_final() [fs/inode.c], that checks if an= inode=20 that is being dropped is still in the ext3 orphan list. This situation = is=20 quite easy to reproduce, and has shown one example path were this happe= ns: A user process calls sys_unlink() [fs/namei.c], which increments i_coun= t,=20 calls vfs_unlink [fs/namei.c] and afterwards calls iput(). vfs_unlink [fs/namei.c] works with the dentry, calls i_op->unlink() (in= this=20 case ext3_unlink) and returns. ext3_unlink [FILE] decrements i_nlink and adds the inode to the s_orpha= n list=20 before returning. After sys_unlink() has completed, the inode is still referenced by ext3= while=20 i_count has the same value like before, which triggers the problem in c= ase=20 prune_icache would now choose this inode to be freed. Possible ways to fix above problem is change reference counting for ino= des or=20 make the prune_icache function aware of the internal reference to the i= node=20 (preferably without knowing about the internal data structures of the=20 filesystem which would be a layering violation). The patch below does t= ake=20 the 2nd approach: - adds additional super_operation s_op->inode_busy() allowing VFS to qu= ery if=20 an inode is still internally referenced by the fs. - adds ext3_inode_busy() to ext3 that checks if he inode is still inter= nally=20 referenced - changes prune_icache to query inode_busy() in case this s_op is imple= mented=20 by the individual fs Patch fixing the problem: diff -ruN linux-2.6.2/fs/ext3/super.c linux-2.6.2+ext3fix/fs/ext3/super= =2Ec --- linux-2.6.2/fs/ext3/super.c=A02004-02-19 12:46:35.000000000 +0100 +++ linux-2.6.2+ext3fix/fs/ext3/super.c=A02004-02-04 17:50:03.000000000= +0100 @@ -453,6 +453,14 @@ =A0=A0=A0=A0=A0=A0=A0=A0kmem_cache_free(ext3_inode_cachep, EXT3_I(inode= )); =A0} =A0 +static int ext3_inode_busy(struct inode *inode) +{ +=A0=A0=A0=A0=A0=A0=A0if (!list_empty(&EXT3_I(inode)->i_orphan))=20 +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0return 1; +=A0=A0=A0=A0=A0=A0=A0else +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0return 0; +} + =A0static void init_once(void * foo, kmem_cache_t * cachep, unsigned lo= ng flags) =A0{ =A0=A0=A0=A0=A0=A0=A0=A0struct ext3_inode_info *ei =3D (struct ext3_ino= de_info *) foo; @@ -510,6 +518,7 @@ =A0static struct super_operations ext3_sops =3D { =A0=A0=A0=A0=A0=A0=A0=A0.alloc_inode=A0=A0=A0=A0=3D ext3_alloc_inode, =A0=A0=A0=A0=A0=A0=A0=A0.destroy_inode=A0=A0=3D ext3_destroy_inode, +=A0=A0=A0=A0=A0=A0=A0.inode_busy=A0=A0=A0=A0=A0=3D ext3_inode_busy, =A0=A0=A0=A0=A0=A0=A0=A0.read_inode=A0=A0=A0=A0=A0=3D ext3_read_inode, =A0=A0=A0=A0=A0=A0=A0=A0.write_inode=A0=A0=A0=A0=3D ext3_write_inode, =A0=A0=A0=A0=A0=A0=A0=A0.dirty_inode=A0=A0=A0=A0=3D ext3_dirty_inode, diff -ruN linux-2.6.2/fs/inode.c linux-2.6.2+ext3fix/fs/inode.c --- linux-2.6.2/fs/inode.c=A0=A0=A0=A0=A0=A02004-02-19 12:46:35.0000000= 00 +0100 +++ linux-2.6.2+ext3fix/fs/inode.c=A0=A0=A0=A0=A0=A02004-01-28 09:02:25= =2E000000000 +0100 @@ -391,6 +391,9 @@ =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0return 0; =A0=A0=A0=A0=A0=A0=A0=A0if (inode->i_data.nrpages) =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0return 0; +=A0=A0=A0=A0=A0=A0=A0if (inode->i_sb->s_op->inode_busy +=A0=A0=A0=A0=A0=A0=A0 =A0 =A0&& inode->i_sb->s_op->inode_busy(inode)) +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0return 0; =A0=A0=A0=A0=A0=A0=A0=A0return 1; =A0} =A0 @@ -424,7 +427,9 @@ =A0 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0inode =3D list_entry(in= ode_unused.prev, struct inode, i_list); =A0 -=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0if (inode->i_state || ato= mic_read(&inode->i_count)) { +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0if (inode->i_state || ato= mic_read(&inode->i_count) +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =A0 =A0|| (inode->i_sb->= s_op->inode_busy=20 +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 = && (inode->i_sb->s_op->inode_busy(inode)))) { =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= list_move(&inode->i_list, &inode_unused); =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= continue; =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0} diff -ruN linux-2.6.2/include/linux/fs.h=20 linux-2.6.2+ext3fix/include/linux/fs.h --- linux-2.6.2/include/linux/fs.h=A0=A0=A0=A0=A0=A02004-02-19 12:46:35= =2E000000000 +0100 +++ linux-2.6.2+ext3fix/include/linux/fs.h=A0=A0=A0=A0=A0=A02004-02-18 = 18:45:15.000000000=20 +0100 @@ -866,6 +866,7 @@ =A0struct super_operations { =A0 =A0 =A0=A0=A0=A0struct inode *(*alloc_inode)(struct super_block *sb= ); =A0=A0=A0=A0=A0=A0=A0=A0void (*destroy_inode)(struct inode *); +=A0=A0=A0=A0=A0=A0=A0int =A0(*inode_busy)(struct inode *); =A0 =A0=A0=A0=A0=A0=A0=A0=A0void (*read_inode) (struct inode *); =A0 =A0 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html