From mboxrd@z Thu Jan 1 00:00:00 1970 From: Carsten Otte Subject: Re: [PATCH] ext3 [linux-2.6.2.]: accessing already freed inodes when under memory pressure Date: Thu, 19 Feb 2004 21:14:50 +0100 Sender: linux-fsdevel-owner@vger.kernel.org Message-ID: <200402192114.55214.cotte@freenet.de> References: <200402191321.39592.cotte@freenet.de> <1077212393.2070.571.camel@sisko.scot.redhat.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: linux-fsdevel@vger.kernel.org, Andrew Morton , schwidefsky@de.ibm.com, cotte@de.ibm.com, Stephen Tweedie Return-path: Received: from mout0.freenet.de ([194.97.50.131]:58849 "EHLO mout0.freenet.de") by vger.kernel.org with ESMTP id S267533AbUBSULW convert rfc822-to-8bit (ORCPT ); Thu, 19 Feb 2004 15:11:22 -0500 To: "Stephen C. Tweedie" , Linus Torvalds In-Reply-To: <1077212393.2070.571.camel@sisko.scot.redhat.com> Content-Description: clearsigned data Content-Disposition: inline List-Id: linux-fsdevel.vger.kernel.org Stephen wrote: > The orphan list entry should persist only as long > as inode->i_count is raised, and if it's raised, there should be no risk > of the struct getting reused. I tried the following *dirty* debugging patch that triggers in iput_final() exactly when i_count is lowered to zero while the orphan list entry does still persist. I have not been able to startup my system, panic output shows the sys_unlink() path -like fully described in the initial mail starting this thread- causes running into the panic() diff -ruN linux-2.6.3/fs/inode.c linux-2.6.3+debug/fs/inode.c --- linux-2.6.3/fs/inode.c 2004-01-28 09:02:25.000000000 +0100 +++ linux-2.6.3+debug/fs/inode.c 2004-02-19 21:03:26.000000000 +0100 @@ -1058,6 +1058,96 @@ generic_forget_inode(inode); } +struct ext3_inode_info { + __u32 i_data[15]; + __u32 i_flags; +#ifdef EXT3_FRAGMENTS + __u32 i_faddr; + __u8 i_frag_no; + __u8 i_frag_size; +#endif + __u32 i_file_acl; + __u32 i_dir_acl; + __u32 i_dtime; + + /* + * i_block_group is the number of the block group which contains + * this file's inode. Constant across the lifetime of the inode, + * it is ued for making block allocation decisions - we try to + * place a file's data blocks near its inode block, and new inodes + * near to their parent directory's inode. + */ + __u32 i_block_group; + __u32 i_state; /* Dynamic state flags for ext3 */ + + /* + * i_next_alloc_block is the logical (file-relative) number of the + * most-recently-allocated block in this file. Yes, it is misnamed. + * We use this for detecting linearly ascending allocation requests. + */ + __u32 i_next_alloc_block; + + /* + * i_next_alloc_goal is the *physical* companion to i_next_alloc_block. + * it the the physical block number of the block which was most-recently + * allocated to this file. This give us the goal (target) for the next + * allocation when we detect linearly ascending requests. + */ + __u32 i_next_alloc_goal; +#ifdef EXT3_PREALLOCATE + __u32 i_prealloc_block; + __u32 i_prealloc_count; +#endif + __u32 i_dir_start_lookup; +#ifdef CONFIG_EXT3_FS_XATTR + /* + * Extended attributes can be read independently of the main file + * data. Taking i_sem even when reading would cause contention + * between readers of EAs and writers of regular file data, so + * instead we synchronize on xattr_sem when reading or changing + * EAs. + */ + struct rw_semaphore xattr_sem; +#endif +#ifdef CONFIG_EXT3_FS_POSIX_ACL + struct posix_acl *i_acl; + struct posix_acl *i_default_acl; +#endif + + struct list_head i_orphan; /* unlinked but open inodes */ + + /* + * i_disksize keeps track of what the inode size is ON DISK, not + * in memory. During truncate, i_size is set to the new size by + * the VFS prior to calling ext3_truncate(), but the filesystem won't + * set i_disksize to 0 until the truncate is actually under way. + * + * The intent is that i_disksize always represents the blocks which + * are used by this file. This allows recovery to restart truncate + * on orphans if we crash during truncate. We actually write i_disksize + * into the on-disk inode when writing inodes out, instead of i_size. + * + * The only time when i_disksize and i_size may be different is when + * a truncate is in progress. The only things which change i_disksize + * are ext3_get_block (growth) and ext3_truncate (shrinkth). + */ + loff_t i_disksize; + + /* + * truncate_sem is for serialising ext3_truncate() against + * ext3_getblock(). In the 2.4 ext2 design, great chunks of inode's + * data tree are chopped off during truncate. We can't do that in + * ext3 because whenever we perform intermediate commits during + * truncate, the inode and all the metadata blocks *must* be in a + * consistent state which allows truncation of the orphans to restart + * during recovery. Hence we must fix the get_block-vs-truncate race + * by other means, so we have truncate_sem. + */ + struct semaphore truncate_sem; + struct inode vfs_inode; +}; + + /* * Called when we're dropping the last reference * to an inode. @@ -1073,6 +1163,15 @@ { struct super_operations *op = inode->i_sb->s_op; void (*drop)(struct inode *) = generic_drop_inode; + struct ext3_inode_info* ext3_i; + + if (!strcmp (inode->i_sb->s_type->name, "ext3")) { + // this is an ext3 inode. check if it is still orphan + ext3_i = container_of (inode, struct ext3_inode_info, vfs_inode); + if (!list_empty (&ext3_i->i_orphan)) { + panic ("iput_final: found inode in ext3's orphan list, am asked to drop it \n"); + } + } if (op && op->drop_inode) drop = op->drop_inode;