* [patch 12/16] fix race between writeback and unlink
@ 2002-06-01 8:43 Andrew Morton
2002-06-01 16:42 ` Linus Torvalds
0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2002-06-01 8:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: lkml
Fixes a race between unlink and writeback: on the sys_sync() and
pdflush paths the caller does not have a reference against the inode.
So run __iget prior to dropping inode_lock.
Oleg Drokin reported this and seems to believe that it fixes the
crashes he was observing. But I was never able to reproduce them..
=====================================
--- 2.5.19/fs/fs-writeback.c~sync-race Sat Jun 1 01:18:12 2002
+++ 2.5.19-akpm/fs/fs-writeback.c Sat Jun 1 01:18:12 2002
@@ -245,17 +245,19 @@ static void sync_sb_inodes(struct super_
if ((sync_mode == WB_SYNC_LAST) && (head->prev == head))
really_sync = 1;
+ BUG_ON(inode->i_state & I_FREEING);
+ __iget(inode);
__writeback_single_inode(inode, really_sync, nr_to_write);
-
if (sync_mode == WB_SYNC_HOLD) {
mapping->dirtied_when = jiffies;
list_del(&inode->i_list);
list_add(&inode->i_list, &inode->i_sb->s_dirty);
}
-
if (current_is_pdflush())
writeback_release(bdi);
-
+ spin_unlock(&inode_lock);
+ iput(inode);
+ spin_lock(&inode_lock);
if (nr_to_write && *nr_to_write <= 0)
break;
}
-
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [patch 12/16] fix race between writeback and unlink 2002-06-01 8:43 [patch 12/16] fix race between writeback and unlink Andrew Morton @ 2002-06-01 16:42 ` Linus Torvalds 2002-06-01 19:19 ` Andrew Morton 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2002-06-01 16:42 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml On Sat, 1 Jun 2002, Andrew Morton wrote: > > So run __iget prior to dropping inode_lock. This part looks horrible: + spin_unlock(&inode_lock); + iput(inode); + spin_lock(&inode_lock); Why not just split up the code inside iput(), and then just do if (atomic_dec(&inode->i_count)) final_iput(inode); where final_iput() _wants_ the spinlock held already? That's basically what "iput()" will end up doing, except for that "put_inode()" thing, which is just a horrible hack anyway. So get rid of "put_inode()", and replace it with a new one that takes the place of the if (!inode->i_nlink) { ... delete .. } else { .. free .. } and makes that one be a "i_op->drop_inode" thing that defaults to the current "delete if i_nlink is zero, free it if i_nlink is not zero and nobody uses it". The general VFS layer really shouldn't have assigned that strogn a meaning to "i_nlink" anyway, it's not for the VFS layer to decide (and it only causes problems for any non-UNIX-on-a-disk filesystems). Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-01 16:42 ` Linus Torvalds @ 2002-06-01 19:19 ` Andrew Morton 2002-06-01 20:04 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Andrew Morton @ 2002-06-01 19:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: lkml Linus Torvalds wrote: > > On Sat, 1 Jun 2002, Andrew Morton wrote: > > > > So run __iget prior to dropping inode_lock. > > This part looks horrible: > > + spin_unlock(&inode_lock); > + iput(inode); > + spin_lock(&inode_lock); Yup. The inode refcounting APIs are really awkward. Note how I recently added dopey code in ext2_put_inode() to only drop the prealloc window on the "final" iput(). > Why not just split up the code inside iput(), and then just do > > if (atomic_dec(&inode->i_count)) > final_iput(inode); > > where final_iput() _wants_ the spinlock held already? > > That's basically what "iput()" will end up doing, except for that > "put_inode()" thing, which is just a horrible hack anyway. > > So get rid of "put_inode()", and replace it with a new one that takes the > place of the > > if (!inode->i_nlink) { > ... delete .. > } else { > .. free .. > } > > and makes that one be a "i_op->drop_inode" thing that defaults to the > current "delete if i_nlink is zero, free it if i_nlink is not zero and > nobody uses it". > > The general VFS layer really shouldn't have assigned that strogn a meaning > to "i_nlink" anyway, it's not for the VFS layer to decide (and it only > causes problems for any non-UNIX-on-a-disk filesystems). > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc could do with a spring clean. Make it a bit more conventional. I'll discuss with Al when he resurfaces. - ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-01 19:19 ` Andrew Morton @ 2002-06-01 20:04 ` William Lee Irwin III 2002-06-01 22:25 ` Andrew Morton 2002-06-03 4:27 ` [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) Linus Torvalds 2002-06-03 22:10 ` [patch 12/16] fix race between writeback and unlink Chris Mason 2 siblings, 1 reply; 27+ messages in thread From: William Lee Irwin III @ 2002-06-01 20:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, lkml Linus Torvalds wrote: >> The general VFS layer really shouldn't have assigned that strogn a meaning >> to "i_nlink" anyway, it's not for the VFS layer to decide (and it only >> causes problems for any non-UNIX-on-a-disk filesystems). On Sat, Jun 01, 2002 at 12:19:36PM -0700, Andrew Morton wrote: > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc > could do with a spring clean. Make it a bit more conventional. I'll > discuss with Al when he resurfaces. I'm somewhat concerned about the protection of ->i_size, since that appears to be accessed in generic_file_read() without any protection against writers to the field. From a quick glance at current 2.5 (it looks like 2.4 has this too) it looks like it's written to by vmtruncate() through notify_change() with the ->i_sem and BKL held at the moment, but generic_file_read() doesn't take either before reading it, and there may be still other writers. I also don't see the anything like read_barrier_depends() for lockless algorithms or any atomic reads. Even on machines with extremely strong memory consistency models like i386, as loff_t is long long, it would seem possible to catch a partial update and see an entirely bogus ->i_size value. It also appears ->i_size is used to provide some protection against reads of truncated pages, which may be unreliable without some protection of ->i_size. If these issues are not what I believe them to be I would be more than happy to have these impressions corrected. Cheers, Bill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-01 20:04 ` William Lee Irwin III @ 2002-06-01 22:25 ` Andrew Morton 0 siblings, 0 replies; 27+ messages in thread From: Andrew Morton @ 2002-06-01 22:25 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linus Torvalds, lkml William Lee Irwin III wrote: > > Linus Torvalds wrote: > >> The general VFS layer really shouldn't have assigned that strogn a meaning > >> to "i_nlink" anyway, it's not for the VFS layer to decide (and it only > >> causes problems for any non-UNIX-on-a-disk filesystems). > > On Sat, Jun 01, 2002 at 12:19:36PM -0700, Andrew Morton wrote: > > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc > > could do with a spring clean. Make it a bit more conventional. I'll > > discuss with Al when he resurfaces. > > I'm somewhat concerned about the protection of ->i_size, since that > appears to be accessed in generic_file_read() without any protection > against writers to the field. From a quick glance at current 2.5 (it > looks like 2.4 has this too) it looks like it's written to by > vmtruncate() through notify_change() with the ->i_sem and BKL held at > the moment, but generic_file_read() doesn't take either before reading > it, and there may be still other writers. truncate and write change i_size, under i_sem. The i_size test on the read path doesn't really need to be there, I suspect. It handles the window where i_size has been decreased by truncate but the filesystem hasn't finished truncating the blocks yet. It also optimises reads outside the end of file - no point in calling into the filesystem to try to map blocks which aren't there. > I also don't see the anything > like read_barrier_depends() for lockless algorithms or any atomic reads. > Even on machines with extremely strong memory consistency models like > i386, as loff_t is long long, it would seem possible to catch a partial > update and see an entirely bogus ->i_size value. That's true. sys_stat() also could see a confusing intermediate value. A while back Ingo and Linus were tossing around possible solutions to this based on x86 compare-and-exchange operations, but nothing conclusive came out of it. It's a "known bug". - ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-01 19:19 ` Andrew Morton 2002-06-01 20:04 ` William Lee Irwin III @ 2002-06-03 4:27 ` Linus Torvalds 2002-06-03 16:26 ` Andreas Dilger 2002-06-03 19:09 ` Chris Mason 2002-06-03 22:10 ` [patch 12/16] fix race between writeback and unlink Chris Mason 2 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-03 4:27 UTC (permalink / raw) To: Andrew Morton, Alexander Viro; +Cc: lkml On Sat, 1 Jun 2002, Andrew Morton wrote: > > > Why not just split up the code inside iput(), and then just do > > > > if (atomic_dec(&inode->i_count)) > > final_iput(inode); > > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc > could do with a spring clean. Make it a bit more conventional. I'll > discuss with Al when he resurfaces. This is a first cut at cleaning up "iput()" and getting rid of some of the magic VFS-level behaviour of the i_nlink field which many filesystems do not actually want - as shown by the number of "force_delete" users out there. It does not change any real behaviour, but it splits up the "iput()" behaviour into several functions ("common_delete_inode()", "common_forget_inode()" and "common_drop_inode()"), and adds a place for a low-level filesystem to hook into the behaviour at inode drop time, through the "drop_inode" superblock operation. This part of the diff probably explains it best + drop_inode: called when the last access to the inode is dropped, + with the inode_lock spinlock held. + + This method should be either NULL (normal unix filesystem + semantics) or "common_delete_inode" (for filesystems that do not + want to cache inodes - causing "delete_inode" to always be + called regardless of the value of i_nlink) + + The "common_delete_inode()" behaviour is equivalent to the + old practice of using "force_delete" in the put_inode() case, + but does not have the races that the "force_delete()" approach + had. This removes _most_ of the "put_inode()" users, although filesystems can (and do) still use it for other things - notably to drop pre-allocations etc. Comments? Linus ---- ===== Documentation/filesystems/Locking 1.29 vs edited ===== --- 1.29/Documentation/filesystems/Locking Fri May 31 18:18:14 2002 +++ edited/Documentation/filesystems/Locking Sun Jun 2 20:58:05 2002 @@ -88,6 +88,7 @@ void (*read_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); @@ -102,6 +103,7 @@ read_inode: yes (see below) write_inode: no put_inode: no +drop_inode: no !!!inode_lock!!! delete_inode: no clear_inode: no put_super: yes yes maybe (see below) ===== Documentation/filesystems/vfs.txt 1.1 vs edited ===== --- 1.1/Documentation/filesystems/vfs.txt Tue Feb 5 09:40:36 2002 +++ edited/Documentation/filesystems/vfs.txt Sun Jun 2 21:17:23 2002 @@ -178,6 +178,7 @@ void (*read_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); int (*notify_change) (struct dentry *, struct iattr *); void (*put_super) (struct super_block *); @@ -203,6 +204,19 @@ put_inode: called when the VFS inode is removed from the inode cache. This method is optional + + drop_inode: called when the last access to the inode is dropped, + with the inode_lock spinlock held. + + This method should be either NULL (normal unix filesystem + semantics) or "common_delete_inode" (for filesystems that do not + want to cache inodes - causing "delete_inode" to always be + called regardless of the value of i_nlink) + + The "common_delete_inode()" behaviour is equivalent to the + old practice of using "force_delete" in the put_inode() case, + but does not have the races that the "force_delete()" approach + had. delete_inode: called when the VFS wants to delete an inode ===== drivers/hotplug/pci_hotplug_core.c 1.14 vs edited ===== --- 1.14/drivers/hotplug/pci_hotplug_core.c Thu May 9 13:44:41 2002 +++ edited/drivers/hotplug/pci_hotplug_core.c Sun Jun 2 20:12:36 2002 @@ -290,7 +290,7 @@ static struct super_operations pcihpfs_ops = { statfs: simple_statfs, - put_inode: force_delete, + drop_inode: common_delete_inode, }; static int pcihpfs_fill_super(struct super_block *sb, void *data, int silent) ===== drivers/usb/core/inode.c 1.31 vs edited ===== --- 1.31/drivers/usb/core/inode.c Wed May 22 09:25:48 2002 +++ edited/drivers/usb/core/inode.c Sun Jun 2 20:13:58 2002 @@ -298,7 +298,7 @@ static struct super_operations usbfs_ops = { statfs: simple_statfs, - put_inode: force_delete, + drop_inode: common_delete_inode, }; static int usbfs_fill_super(struct super_block *sb, void *data, int silent) ===== fs/binfmt_misc.c 1.13 vs edited ===== --- 1.13/fs/binfmt_misc.c Thu May 23 09:08:34 2002 +++ edited/fs/binfmt_misc.c Sun Jun 2 19:48:46 2002 @@ -621,7 +621,7 @@ static struct super_operations s_ops = { statfs: simple_statfs, - put_inode: force_delete, + drop_inode: common_delete_inode, clear_inode: bm_clear_inode, }; ===== fs/inode.c 1.60 vs edited ===== --- 1.60/fs/inode.c Fri May 31 18:18:09 2002 +++ edited/fs/inode.c Sun Jun 2 20:11:02 2002 @@ -782,6 +782,97 @@ spin_unlock(&inode_lock); } +void common_delete_inode(struct inode *inode) +{ + struct super_operations *op = inode->i_sb->s_op; + + list_del(&inode->i_hash); + INIT_LIST_HEAD(&inode->i_hash); + list_del(&inode->i_list); + INIT_LIST_HEAD(&inode->i_list); + inode->i_state|=I_FREEING; + inodes_stat.nr_inodes--; + spin_unlock(&inode_lock); + + if (inode->i_data.nrpages) + truncate_inode_pages(&inode->i_data, 0); + + if (op && op->delete_inode) { + void (*delete)(struct inode *) = op->delete_inode; + if (!is_bad_inode(inode)) + DQUOT_INIT(inode); + /* s_op->delete_inode internally recalls clear_inode() */ + delete(inode); + } else + clear_inode(inode); + if (inode->i_state != I_CLEAR) + BUG(); + destroy_inode(inode); +} +EXPORT_SYMBOL(common_delete_inode); + +static void common_forget_inode(struct inode *inode) +{ + struct super_block *sb = inode->i_sb; + + if (!list_empty(&inode->i_hash)) { + if (!(inode->i_state & (I_DIRTY|I_LOCK))) { + list_del(&inode->i_list); + list_add(&inode->i_list, &inode_unused); + } + inodes_stat.nr_unused++; + spin_unlock(&inode_lock); + if (!sb || (sb->s_flags & MS_ACTIVE)) + return; + write_inode_now(inode, 1); + spin_lock(&inode_lock); + inodes_stat.nr_unused--; + list_del_init(&inode->i_hash); + } + list_del_init(&inode->i_list); + inode->i_state|=I_FREEING; + inodes_stat.nr_inodes--; + spin_unlock(&inode_lock); + if (inode->i_data.nrpages) + truncate_inode_pages(&inode->i_data, 0); + clear_inode(inode); + destroy_inode(inode); +} + +/* + * Normal UNIX filesystem behaviour: delete the + * inode when the usage count drops to zero, and + * i_nlink is zero. + */ +static void common_drop_inode(struct inode *inode) +{ + if (!inode->i_nlink) + common_delete_inode(inode); + else + common_forget_inode(inode); +} + +/* + * Called when we're dropping the last reference + * to an inode. + * + * Call the FS "drop()" function, defaulting to + * the legacy UNIX filesystem behaviour.. + * + * NOTE! NOTE! NOTE! We're called with the inode lock + * held, and the drop function is supposed to release + * the lock! + */ +static inline void iput_final(struct inode *inode) +{ + struct super_operations *op = inode->i_sb->s_op; + void (*drop)(struct inode *) = common_drop_inode; + + if (op && op->drop_inode) + drop = op->drop_inode; + drop(inode); +} + /** * iput - put an inode * @inode: inode to put @@ -793,77 +884,17 @@ void iput(struct inode *inode) { if (inode) { - struct super_block *sb = inode->i_sb; - struct super_operations *op = NULL; + struct super_operations *op = inode->i_sb->s_op; if (inode->i_state == I_CLEAR) BUG(); - if (sb && sb->s_op) - op = sb->s_op; if (op && op->put_inode) op->put_inode(inode); - if (!atomic_dec_and_lock(&inode->i_count, &inode_lock)) - return; - - if (!inode->i_nlink) { - list_del(&inode->i_hash); - INIT_LIST_HEAD(&inode->i_hash); - list_del(&inode->i_list); - INIT_LIST_HEAD(&inode->i_list); - inode->i_state|=I_FREEING; - inodes_stat.nr_inodes--; - spin_unlock(&inode_lock); - - if (inode->i_data.nrpages) - truncate_inode_pages(&inode->i_data, 0); - - if (op && op->delete_inode) { - void (*delete)(struct inode *) = op->delete_inode; - if (!is_bad_inode(inode)) - DQUOT_INIT(inode); - /* s_op->delete_inode internally recalls clear_inode() */ - delete(inode); - } else - clear_inode(inode); - if (inode->i_state != I_CLEAR) - BUG(); - } else { - if (!list_empty(&inode->i_hash)) { - if (!(inode->i_state & (I_DIRTY|I_LOCK))) { - list_del(&inode->i_list); - list_add(&inode->i_list, &inode_unused); - } - inodes_stat.nr_unused++; - spin_unlock(&inode_lock); - if (!sb || (sb->s_flags & MS_ACTIVE)) - return; - write_inode_now(inode, 1); - spin_lock(&inode_lock); - inodes_stat.nr_unused--; - list_del_init(&inode->i_hash); - } - list_del_init(&inode->i_list); - inode->i_state|=I_FREEING; - inodes_stat.nr_inodes--; - spin_unlock(&inode_lock); - if (inode->i_data.nrpages) - truncate_inode_pages(&inode->i_data, 0); - clear_inode(inode); - } - destroy_inode(inode); + if (atomic_dec_and_lock(&inode->i_count, &inode_lock)) + iput_final(inode); } -} - -void force_delete(struct inode *inode) -{ - /* - * Kill off unused inodes ... iput() will unhash and - * delete the inode if we set i_nlink to zero. - */ - if (atomic_read(&inode->i_count) == 1) - inode->i_nlink = 0; } /** ===== fs/devfs/base.c 1.39 vs edited ===== --- 1.39/fs/devfs/base.c Tue May 14 09:27:48 2002 +++ edited/fs/devfs/base.c Sun Jun 2 20:15:00 2002 @@ -2576,7 +2576,7 @@ static struct super_operations devfs_sops = { - put_inode: force_delete, + drop_inode: common_delete_inode, clear_inode: devfs_clear_inode, statfs: simple_statfs, }; ===== fs/driverfs/inode.c 1.22 vs edited ===== --- 1.22/fs/driverfs/inode.c Fri May 31 18:18:10 2002 +++ edited/fs/driverfs/inode.c Sun Jun 2 19:49:16 2002 @@ -442,7 +442,7 @@ static struct super_operations driverfs_ops = { statfs: simple_statfs, - put_inode: force_delete, + drop_inode: common_delete_inode, }; static int driverfs_fill_super(struct super_block *sb, void *data, int silent) ===== fs/ncpfs/inode.c 1.19 vs edited ===== --- 1.19/fs/ncpfs/inode.c Thu May 23 09:08:34 2002 +++ edited/fs/ncpfs/inode.c Sun Jun 2 20:14:34 2002 @@ -84,7 +84,7 @@ { alloc_inode: ncp_alloc_inode, destroy_inode: ncp_destroy_inode, - put_inode: force_delete, + drop_inode: common_delete_inode, delete_inode: ncp_delete_inode, put_super: ncp_put_super, statfs: ncp_statfs, ===== fs/proc/inode.c 1.13 vs edited ===== --- 1.13/fs/proc/inode.c Mon May 20 06:38:28 2002 +++ edited/fs/proc/inode.c Sun Jun 2 19:59:25 2002 @@ -121,7 +121,7 @@ alloc_inode: proc_alloc_inode, destroy_inode: proc_destroy_inode, read_inode: proc_read_inode, - put_inode: force_delete, + drop_inode: common_delete_inode, delete_inode: proc_delete_inode, statfs: simple_statfs, }; ===== fs/ramfs/inode.c 1.19 vs edited ===== --- 1.19/fs/ramfs/inode.c Fri May 31 18:18:10 2002 +++ edited/fs/ramfs/inode.c Sun Jun 2 19:59:56 2002 @@ -277,7 +277,7 @@ static struct super_operations ramfs_ops = { statfs: simple_statfs, - put_inode: force_delete, + drop_inode: common_delete_inode, }; static int ramfs_fill_super(struct super_block * sb, void * data, int silent) ===== fs/smbfs/inode.c 1.23 vs edited ===== --- 1.23/fs/smbfs/inode.c Thu May 23 09:08:34 2002 +++ edited/fs/smbfs/inode.c Sun Jun 2 20:14:19 2002 @@ -94,7 +94,7 @@ { alloc_inode: smb_alloc_inode, destroy_inode: smb_destroy_inode, - put_inode: force_delete, + drop_inode: common_delete_inode, delete_inode: smb_delete_inode, put_super: smb_put_super, statfs: smb_statfs, ===== include/linux/fs.h 1.131 vs edited ===== --- 1.131/include/linux/fs.h Fri May 31 18:18:14 2002 +++ edited/include/linux/fs.h Sun Jun 2 20:49:29 2002 @@ -800,6 +800,7 @@ void (*dirty_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); @@ -1183,10 +1184,10 @@ extern void inode_init_once(struct inode *); extern void iput(struct inode *); -extern void force_delete(struct inode *); extern struct inode * igrab(struct inode *); extern ino_t iunique(struct super_block *, ino_t); extern int inode_needs_sync(struct inode *inode); +extern void common_delete_inode(struct inode *inode); extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); extern struct inode * iget_locked(struct super_block *, unsigned long); ===== kernel/ksyms.c 1.95 vs edited ===== --- 1.95/kernel/ksyms.c Fri May 31 18:18:10 2002 +++ edited/kernel/ksyms.c Sun Jun 2 19:39:17 2002 @@ -140,7 +140,6 @@ EXPORT_SYMBOL(iunique); EXPORT_SYMBOL(iput); EXPORT_SYMBOL(inode_init_once); -EXPORT_SYMBOL(force_delete); EXPORT_SYMBOL(follow_up); EXPORT_SYMBOL(follow_down); EXPORT_SYMBOL(lookup_mnt); ===== mm/shmem.c 1.52 vs edited ===== --- 1.52/mm/shmem.c Fri May 31 18:18:13 2002 +++ edited/mm/shmem.c Sun Jun 2 19:46:02 2002 @@ -1483,7 +1483,7 @@ remount_fs: shmem_remount_fs, #endif delete_inode: shmem_delete_inode, - put_inode: force_delete, + drop_inode: common_delete_inode, put_super: shmem_put_super, }; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 4:27 ` [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) Linus Torvalds @ 2002-06-03 16:26 ` Andreas Dilger 2002-06-03 16:47 ` Linus Torvalds 2002-06-03 19:09 ` Chris Mason 1 sibling, 1 reply; 27+ messages in thread From: Andreas Dilger @ 2002-06-03 16:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Alexander Viro, lkml On Jun 02, 2002 21:27 -0700, Linus Torvalds wrote: > This is a first cut at cleaning up "iput()" and getting rid of some of the > magic VFS-level behaviour of the i_nlink field which many filesystems do > not actually want - as shown by the number of "force_delete" users out > there. > > It does not change any real behaviour, but it splits up the "iput()" > behaviour into several functions ("common_delete_inode()", > "common_forget_inode()" and "common_drop_inode()"), and adds a place for a > low-level filesystem to hook into the behaviour at inode drop time, > through the "drop_inode" superblock operation. If I had one minor note it would be to rename "common_*()" to "generic_*()" to match the other VFS helper routines. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 16:26 ` Andreas Dilger @ 2002-06-03 16:47 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-03 16:47 UTC (permalink / raw) To: Andreas Dilger; +Cc: Andrew Morton, Alexander Viro, lkml On Mon, 3 Jun 2002, Andreas Dilger wrote: > > If I had one minor note it would be to rename "common_*()" to "generic_*()" > to match the other VFS helper routines. Good point. Will do. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 4:27 ` [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) Linus Torvalds 2002-06-03 16:26 ` Andreas Dilger @ 2002-06-03 19:09 ` Chris Mason 2002-06-03 19:34 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Chris Mason @ 2002-06-03 19:09 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Alexander Viro, lkml On Mon, 2002-06-03 at 00:27, Linus Torvalds wrote: > > > On Sat, 1 Jun 2002, Andrew Morton wrote: > > > > > Why not just split up the code inside iput(), and then just do > > > > > > if (atomic_dec(&inode->i_count)) > > > final_iput(inode); > > > > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc > > could do with a spring clean. Make it a bit more conventional. I'll > > discuss with Al when he resurfaces. > > This is a first cut at cleaning up "iput()" and getting rid of some of the > magic VFS-level behaviour of the i_nlink field which many filesystems do > not actually want - as shown by the number of "force_delete" users out > there. > > It does not change any real behaviour, but it splits up the "iput()" > behaviour into several functions ("common_delete_inode()", > "common_forget_inode()" and "common_drop_inode()"), and adds a place for a > low-level filesystem to hook into the behaviour at inode drop time, > through the "drop_inode" superblock operation. Now that is kinda neat, calling it with the inode lock held lets me move some things out of reiserfs_file_release which need i_sem, and move them into a less expensive drop_inode call without grabbing the semaphore. -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 19:09 ` Chris Mason @ 2002-06-03 19:34 ` Linus Torvalds 2002-06-03 19:49 ` Chris Mason 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2002-06-03 19:34 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, Alexander Viro, lkml On 3 Jun 2002, Chris Mason wrote: > > Now that is kinda neat, calling it with the inode lock held lets me move > some things out of reiserfs_file_release which need i_sem, and move them > into a less expensive drop_inode call without grabbing the semaphore. CAREFUL! If you make real per-FS use of this, and aren't just using the standard ones, you need to be very very careful. In particular, you get called with the inode lock held, but you would have to drop the lock yourself after having removed the inode from the hash chains etc. I'd like people to avoid playing too many games in this area, the locking and the exact semantics of "drop_inode" are rather nasty. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 19:34 ` Linus Torvalds @ 2002-06-03 19:49 ` Chris Mason 2002-06-03 19:55 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Chris Mason @ 2002-06-03 19:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Alexander Viro, lkml On Mon, 2002-06-03 at 15:34, Linus Torvalds wrote: > > On 3 Jun 2002, Chris Mason wrote: > > > > Now that is kinda neat, calling it with the inode lock held lets me move > > some things out of reiserfs_file_release which need i_sem, and move them > > into a less expensive drop_inode call without grabbing the semaphore. > > CAREFUL! > > If you make real per-FS use of this, and aren't just using the standard > ones, you need to be very very careful. In particular, you get called with > the inode lock held, but you would have to drop the lock yourself after > having removed the inode from the hash chains etc. I'd like people to > avoid playing too many games in this area, the locking and the exact > semantics of "drop_inode" are rather nasty. Right, I don't want too much in there. There are a few things I need to do when I know nobody else is messing with the inode, and I'm using i_sem to provide that now. put_inode doesn't do what I need because knfsd might iget his way into the mess. I'm talking a very limited set of operations followed by calling the generic functions. I might not do it at all if I can't get them safe when called under the spin lock. -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) 2002-06-03 19:49 ` Chris Mason @ 2002-06-03 19:55 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-03 19:55 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, Alexander Viro, lkml On 3 Jun 2002, Chris Mason wrote: > > I'm talking a very limited set of operations followed by calling the > generic functions. I might not do it at all if I can't get them safe > when called under the spin lock. Ok, that should be reasonably "portable" (ie it won't break horribly and silently in the future when something changes in inode-land). Just doing a few ops (knowing you're under the inode lock) and then calling "generic_drop_inode()" should be fine. [ Except right now only the "generic_delete_inode()" thing is exported, so you'd need to export the other generic stuff, but that was kind of my plan anyway, I just don't wan tto do it until there is some real need. ] Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-01 19:19 ` Andrew Morton 2002-06-01 20:04 ` William Lee Irwin III 2002-06-03 4:27 ` [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) Linus Torvalds @ 2002-06-03 22:10 ` Chris Mason 2002-06-03 22:19 ` Linus Torvalds 2 siblings, 1 reply; 27+ messages in thread From: Chris Mason @ 2002-06-03 22:10 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, lkml On Sat, 2002-06-01 at 15:19, Andrew Morton wrote: > Linus Torvalds wrote: > > > > On Sat, 1 Jun 2002, Andrew Morton wrote: > > > > > > So run __iget prior to dropping inode_lock. > > > > This part looks horrible: > > > > + spin_unlock(&inode_lock); > > + iput(inode); > > + spin_lock(&inode_lock); > > Yup. The inode refcounting APIs are really awkward. Note how I recently > added dopey code in ext2_put_inode() to only drop the prealloc window on > the "final" iput(). Hmmm, a quick glance makes the test in ext2_put_inode look unsafe. iput calls put_inode before decrementing i_count. So, nothing stops 5 iput callers from all deciding i_count > 2 and leaving the preallocation blocks hanging. Also, a knfsd triggered iget/iput pair should hit the same race with an put_inode call. Or am I missing something? -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-03 22:10 ` [patch 12/16] fix race between writeback and unlink Chris Mason @ 2002-06-03 22:19 ` Linus Torvalds 2002-06-03 22:30 ` Andrew Morton 2002-06-03 22:36 ` Chris Mason 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-03 22:19 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, lkml On 3 Jun 2002, Chris Mason wrote: > > Or am I missing something? No. I think that in the long run we really would want all of the writeback preallocation should happen in the "struct file", not in "struct inode". And they should be released at file close ("release()"), not at iput() time. I _think_ that right now nfsd doesn't cache file opens (only inodes), so this could be a performance issue for nfsd, but it might be possible to change how nfsd acts. And it would be a _lot_ cleaner to do it at the file level. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-03 22:19 ` Linus Torvalds @ 2002-06-03 22:30 ` Andrew Morton 2002-06-04 18:47 ` Linus Torvalds 2002-06-03 22:36 ` Chris Mason 1 sibling, 1 reply; 27+ messages in thread From: Andrew Morton @ 2002-06-03 22:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Mason, lkml Linus Torvalds wrote: > > On 3 Jun 2002, Chris Mason wrote: > > > > Or am I missing something? > > No. I think that in the long run we really would want all of the writeback > preallocation should happen in the "struct file", not in "struct inode". > And they should be released at file close ("release()"), not at iput() > time. That would be a lot nicer. But why does ext2_put_inode() even exist? We're already throwing away the prealloc window in ext2_release_file? I guess for shared mappings over spares files: if all file handles have closed off, we still need to make allocations against that inode, yes? - ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-03 22:30 ` Andrew Morton @ 2002-06-04 18:47 ` Linus Torvalds 2002-06-04 20:15 ` Andrew Morton 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2002-06-04 18:47 UTC (permalink / raw) To: Andrew Morton; +Cc: Chris Mason, lkml On Mon, 3 Jun 2002, Andrew Morton wrote: > > But why does ext2_put_inode() even exist? We're already throwing > away the prealloc window in ext2_release_file? I guess for > shared mappings over spares files: if all file handles have > closed off, we still need to make allocations against that > inode, yes? Shared mappings still hold the "struct file" open (you have "vma->vm_file->f_dentry->d_inode"), so you still have the file handle while the mapping is open. I assume that the reason is that _any_ block allocation will trigegr pre-alloc, which means that we have preallocation for things like directories etc too - which really do not have a "struct file" associated with them. Whether that is really worth it is unclear, but it also means that ext2 doesn't have to pass down the "struct file" to the lower levels at all, as it keeps all pre-alloc stuff in the inode. On the whole it's probably a mistake, but my point is that it's likely a mistake that is hard to fix. Which is why I didn't even try to fix ext2 to not use "put_inode" and the prealloc dropping there.. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 18:47 ` Linus Torvalds @ 2002-06-04 20:15 ` Andrew Morton 2002-06-04 20:23 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Andrew Morton @ 2002-06-04 20:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Mason, lkml Linus Torvalds wrote: > > On Mon, 3 Jun 2002, Andrew Morton wrote: > > > > But why does ext2_put_inode() even exist? We're already throwing > > away the prealloc window in ext2_release_file? I guess for > > shared mappings over spares files: if all file handles have > > closed off, we still need to make allocations against that > > inode, yes? > > Shared mappings still hold the "struct file" open (you have > "vma->vm_file->f_dentry->d_inode"), so you still have the file handle > while the mapping is open. But after the vma has been destroyed (the application did exit()), the dirty pagecache data against the sparse mapped file still hasn't been written, and still doesn't have a disk mapping. So in this case, we have an inode which still has pending block allocations which has no struct file's pointing at it. Or am I wrong? > I assume that the reason is that _any_ block allocation will trigegr > pre-alloc, which means that we have preallocation for things like > directories etc too - which really do not have a "struct file" associated > with them. > > Whether that is really worth it is unclear, but it also means that ext2 > doesn't have to pass down the "struct file" to the lower levels at all, as > it keeps all pre-alloc stuff in the inode. The current preallocation will screw up is when there's a large-and-sparse file which has blocks being allocated against it at two or more offsets. And those allocations are for a large number of blocks, and they are proceeding slowly. Something like: for (i = 0; i < 16; i++) { pwrite(fd, buf, 4096, offset1 + i * 4096); pwrite(fd, buf, 4096, offset2 + i * 4096); } this would cause preallocation thrashing, and those blocks would be laid out as ABABABAB. The same would happen if the prealloc information is per-file as well, of course. The app would have to be using different file descriptors to avoid it. So from a file-layout point of view, I don't see a lot of benefit in moving the preallocation info into struct file. (Delayed allocation plus writeback-time sorting would fix it all up. With delayed allocate, the prealloc code can visit the bitbucket). - ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 20:15 ` Andrew Morton @ 2002-06-04 20:23 ` Linus Torvalds 2002-06-04 20:40 ` Andrew Morton 2002-06-04 22:05 ` Craig Milo Rogers 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-04 20:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Chris Mason, lkml On Tue, 4 Jun 2002, Andrew Morton wrote: > > But after the vma has been destroyed (the application did exit()), > the dirty pagecache data against the sparse mapped file still > hasn't been written, and still doesn't have a disk mapping. > > So in this case, we have an inode which still has pending block > allocations which has no struct file's pointing at it. Or > am I wrong? I think you're right.. > The current preallocation will screw up is when there's a > large-and-sparse file which has blocks being allocated against it > at two or more offsets. And those allocations are for a large number > of blocks, and they are proceeding slowly. Sure. However, I don't think it should come as any surprise to anybody that trying to write to two different points in the same file is a bad idea. _regardless_ of whether you do pre-allocation or not, and whether the pre-allocation is on the inode or file level. I'd still love to see a "fast and slightly stupid" allocator for both blocks and inodes, and have some infrastructure to do run-time defragging in the background. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 20:23 ` Linus Torvalds @ 2002-06-04 20:40 ` Andrew Morton 2002-06-04 21:37 ` Linus Torvalds 2002-06-04 22:05 ` Craig Milo Rogers 1 sibling, 1 reply; 27+ messages in thread From: Andrew Morton @ 2002-06-04 20:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Mason, lkml Linus Torvalds wrote: > > ... > I'd still love to see a "fast and slightly stupid" allocator for both > blocks and inodes, and have some infrastructure to do run-time defragging > in the background. > I think runtime defrag could yield really good benefits. In particular it would allow us to find_group_dir(), and always put directories in the same blockgroup as their parent (big speedups for the `untar-a-kernel-tree' workload). There's a patch at http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch which provides a simple `relocate page' ioctl for ext3 files. It relocates a page's blocks. The operation is fully journalled and pagecache-coherent. So you can turn off the power in the middle of a defrag operation and the fs will come back just fine. It doesn't make any attempt to relocate inodes. If the page relocation attempt fails then it just returns -EAGAIN and userspace gets to worry about what to do. I simply have not had the time to do anything about the userspace program which drives that ioctl. So if there's anyone out there who has a little time on their hands... - ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 20:40 ` Andrew Morton @ 2002-06-04 21:37 ` Linus Torvalds 2002-06-04 22:04 ` Benjamin LaHaise ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-04 21:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Chris Mason, lkml On Tue, 4 Jun 2002, Andrew Morton wrote: > > There's a patch at > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch > which provides a simple `relocate page' ioctl for ext3 files. That's a good start, but before even egtting that far there is some need for a way to get a picture of the FS layout in a reasonably fs-independent way. Sure, bmap() actually does part of this (the "where are my blocks" part), but right now there is no way to query the FS for the "where can I put blocks" part. You can do it with direct disk access and knowledge of the FS internals, but it should not be all that hard to add some simple interface to get a "block usage byte array" kind of thing (more efficient than doing bmap on all files, _and_ can tell about blocks reserved for inodes, superblocks and other special uses), which together with a user-level interface to "preallocate" and your "relocate page" should actually make it possible to make a fairly FS-independent defragmenter. Add a nice graphical front-end, and you can make it a useful screen-saver. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 21:37 ` Linus Torvalds @ 2002-06-04 22:04 ` Benjamin LaHaise 2002-06-04 22:08 ` Andrew Morton 2002-07-07 20:38 ` Riley Williams 2 siblings, 0 replies; 27+ messages in thread From: Benjamin LaHaise @ 2002-06-04 22:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Chris Mason, lkml On Tue, Jun 04, 2002 at 02:37:26PM -0700, Linus Torvalds wrote: > That's a good start, but before even egtting that far there is some need > for a way to get a picture of the FS layout in a reasonably fs-independent > way. Al Viro actually came up with a beautiful mechanism for it: present the filesystem's metadata as a filesystem itself. Blocks within a particular inode can then be viewed with cat, and suggested replacements can be made via echo. Such a best could be called ext2metafs and work cooperatively with the existing filesystem code. It's quite an elegant design that has utility beyond the simple case of defragmenting. > Add a nice graphical front-end, and you can make it a useful screen-saver. Heh, that would be excellent. -ben -- "You will be reincarnated as a toad; and you will be much happier." ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 21:37 ` Linus Torvalds 2002-06-04 22:04 ` Benjamin LaHaise @ 2002-06-04 22:08 ` Andrew Morton 2002-07-07 20:38 ` Riley Williams 2 siblings, 0 replies; 27+ messages in thread From: Andrew Morton @ 2002-06-04 22:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Mason, lkml Linus Torvalds wrote: > > On Tue, 4 Jun 2002, Andrew Morton wrote: > > > > There's a patch at > > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch > > which provides a simple `relocate page' ioctl for ext3 files. > > That's a good start, but before even egtting that far there is some need > for a way to get a picture of the FS layout in a reasonably fs-independent > way. > > Sure, bmap() actually does part of this (the "where are my blocks" part), > but right now there is no way to query the FS for the "where can I put > blocks" part. Jeff Garzik was working on that a while back - a separate filesystem which provides a "metadata view" of a real filesytem. So you can poke around and find all these things out. In theory, different filesystems should be able to offer the same view. > You can do it with direct disk access and knowledge of the FS internals, > but it should not be all that hard to add some simple interface to get a > "block usage byte array" kind of thing (more efficient than doing bmap on > all files, _and_ can tell about blocks reserved for inodes, superblocks > and other special uses), which together with a user-level interface to > "preallocate" and your "relocate page" should actually make it possible to > make a fairly FS-independent defragmenter. The e2fsprogs package includes a `libe2fs' library which offers APIs for accessing the fs internals. It's exactly what you say - direct disk access and knowledge of internals. So that plus the try_to_relocate_page() ioctl is a shortest-path route to a defragmenter for ext3, and only ext3. I wasn't aiming very high here ;) A totally different way of performing defrag could be to copy the entire fs from one partition to a different one, with kernel support for providing coherency while the copy is in progress. It's basically a union/translucent mount with COW. Swizzle the backing blockdev, drop the disk mappings from all incore pages, renumber the inode without breaking stuff... (OK, I've talked myself out of it ;/) It's not super efficient, and it does require the provisioning of a bounce disk, but it would use infrastructure which would be useful for other stuff and it is fs-agnostic. - ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 21:37 ` Linus Torvalds 2002-06-04 22:04 ` Benjamin LaHaise 2002-06-04 22:08 ` Andrew Morton @ 2002-07-07 20:38 ` Riley Williams 2 siblings, 0 replies; 27+ messages in thread From: Riley Williams @ 2002-07-07 20:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Hi Linus. > You can do it with direct disk access and knowledge of the FS > internals, but it should not be all that hard to add some simple > interface to get a "block usage byte array" kind of thing (more > efficient than doing bmap on all files, _and_ can tell about blocks > reserved for inodes, superblocks and other special uses), which > together with a user-level interface to "preallocate" and your > "relocate page" should actually make it possible to make a fairly > FS-independent defragmenter. Would I be right in thinking that this would be an array with three possible values for each element... FIXED Block not movable. MOVABLE Block in use and movable. UNUSED Block not in use. ...with the defragmenter basically exchanging MOVABLE and UNUSED blocks to get the files in an unfragmented state. The FIXED value would handle things like the superblock, inode blocks and other special uses, and all other blocks would be either MOVABLE or UNUSED as appropriate. Another option might be to split MOVABLE into DIRECTORY and FILE values instead, but whether that would be useful is questionable at best. > Add a nice graphical front-end, and you can make it a useful > screen-saver. As long as it runs on systems without X-windows available rather than being limited to only run under X... My preference would be for a tool similar to `make menuconfig` for this sort of utility. Best wishes from Riley. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 20:23 ` Linus Torvalds 2002-06-04 20:40 ` Andrew Morton @ 2002-06-04 22:05 ` Craig Milo Rogers 2002-06-04 22:08 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Craig Milo Rogers @ 2002-06-04 22:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Chris Mason, lkml >Sure. However, I don't think it should come as any surprise to anybody >that trying to write to two different points in the same file is a bad >idea. Databases? Craig Milo Rogers ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-04 22:05 ` Craig Milo Rogers @ 2002-06-04 22:08 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2002-06-04 22:08 UTC (permalink / raw) To: Craig Milo Rogers; +Cc: Andrew Morton, Chris Mason, lkml On Tue, 4 Jun 2002, Craig Milo Rogers wrote: > > >Sure. However, I don't think it should come as any surprise to anybody > >that trying to write to two different points in the same file is a bad > >idea. > > Databases? They uniformly (as far as I know) preallocate all the data blocks. Exactly to avoid the block layout issue - they want the data blocks as contiguous as possible. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-03 22:19 ` Linus Torvalds 2002-06-03 22:30 ` Andrew Morton @ 2002-06-03 22:36 ` Chris Mason 2002-06-03 22:47 ` Andrew Morton 1 sibling, 1 reply; 27+ messages in thread From: Chris Mason @ 2002-06-03 22:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, lkml On Mon, 2002-06-03 at 18:19, Linus Torvalds wrote: > > > On 3 Jun 2002, Chris Mason wrote: > > > > Or am I missing something? > > No. I think that in the long run we really would want all of the writeback > preallocation should happen in the "struct file", not in "struct inode". > And they should be released at file close ("release()"), not at iput() > time. reiserfs does preallocation and tail packing in release() right now, but do we really need preallocation at all once delayed allocation is stable? -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 12/16] fix race between writeback and unlink 2002-06-03 22:36 ` Chris Mason @ 2002-06-03 22:47 ` Andrew Morton 0 siblings, 0 replies; 27+ messages in thread From: Andrew Morton @ 2002-06-03 22:47 UTC (permalink / raw) To: Chris Mason; +Cc: Linus Torvalds, lkml Chris Mason wrote: > > On Mon, 2002-06-03 at 18:19, Linus Torvalds wrote: > > > > > > On 3 Jun 2002, Chris Mason wrote: > > > > > > Or am I missing something? > > > > No. I think that in the long run we really would want all of the writeback > > preallocation should happen in the "struct file", not in "struct inode". > > And they should be released at file close ("release()"), not at iput() > > time. > > reiserfs does preallocation and tail packing in release() right now, but > do we really need preallocation at all once delayed allocation is > stable? > well.. Will delayed allocation even exist? Grafting reservations onto ext2 had a few awkward moments in the area of handling ENOSPC, and generally a lot of the benefits of that code (ie: avoiding buffers) have been whittled away by other means. So right now I don't see a strong reason for adding delayed allocation support into the core kernel for ext2. If another fs comes up and creates a demand for core kernel support then fine. (Thinking XFS here). Probably I should resurrect the ext2 code as something for you and Steve to poke at. Plus ext3 needs prealloc as well. Performing that within the live bitmaps like ext2 is really awkward. I have half-a-patch which performs area reservations in ext3 via a per-private-inode "start, length" tuple. These zones never get near being written to disk. That seems a reasonable approach for ext3, but it does use the inode. If that info was inside struct file, writepage() would get rather confused - it doesn't have a file*. - ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2002-07-07 20:36 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-06-01 8:43 [patch 12/16] fix race between writeback and unlink Andrew Morton 2002-06-01 16:42 ` Linus Torvalds 2002-06-01 19:19 ` Andrew Morton 2002-06-01 20:04 ` William Lee Irwin III 2002-06-01 22:25 ` Andrew Morton 2002-06-03 4:27 ` [RFC] iput() cleanup (was Re: [patch 12/16] fix race between writeback and unlink) Linus Torvalds 2002-06-03 16:26 ` Andreas Dilger 2002-06-03 16:47 ` Linus Torvalds 2002-06-03 19:09 ` Chris Mason 2002-06-03 19:34 ` Linus Torvalds 2002-06-03 19:49 ` Chris Mason 2002-06-03 19:55 ` Linus Torvalds 2002-06-03 22:10 ` [patch 12/16] fix race between writeback and unlink Chris Mason 2002-06-03 22:19 ` Linus Torvalds 2002-06-03 22:30 ` Andrew Morton 2002-06-04 18:47 ` Linus Torvalds 2002-06-04 20:15 ` Andrew Morton 2002-06-04 20:23 ` Linus Torvalds 2002-06-04 20:40 ` Andrew Morton 2002-06-04 21:37 ` Linus Torvalds 2002-06-04 22:04 ` Benjamin LaHaise 2002-06-04 22:08 ` Andrew Morton 2002-07-07 20:38 ` Riley Williams 2002-06-04 22:05 ` Craig Milo Rogers 2002-06-04 22:08 ` Linus Torvalds 2002-06-03 22:36 ` Chris Mason 2002-06-03 22:47 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox