* [PATCH 0/7] vfs: notify_changes() error handling @ 2010-02-19 19:47 Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 1/7] mm: add nofail version of vmtruncate() and inode_setattr() Dmitry Monakhov 2010-02-22 10:35 ` [PATCH 0/7] vfs: notify_changes() error handling Jan Kara 0 siblings, 2 replies; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov Current inode attr setting path looks like follows ret = inode_change_ok() if(ret) goto out; /* perform private-fs specific logic here */ if(ia_valid & ATTR_UID || ...) ret = vfs_dq_transfer() /* more private-fs specific logic for example update on_disk data structures. */ ret = inode_setattr() In fact inode_setattr() call vmtruncate() which may fail in number of reasons IS_SWAPFILE, RLIMIT_FSIZE. After this many filesystem is unable to rollback changes. And just live inode in inconsistent state. We may check IS_SWAPFILE at the very beginning(currently it is not checked), but RLIMIT_FSIZE may changed under our feet. In order make things straight. Let's divide vmtruncate() in to two parts which perform all checks, and second which can not fail. After this notify_change() perform all necessary checks inside inode_change_ok() and simply call nofail version of vmtruncate(). ##TEST_CASE_BEGIN: # dd if=/dev/zero of=/mnt/file bs=1M count=10 # mkswap /mnt/file # swapon /mnt/file # ./truncate /mnt/file 0 # sync # echo "stat file" | debugfs /dev/sda9 debugfs 1.41.9 (22-Aug-2009) Inode: 2746 Type: regular Mode: 0644 Flags: 0x80000 Generation: 571328532 Version: 0x00000000:00000001 User: 0 Group: 0 Size: 0 ##### ON_DISK_SIZE IS ZERO ^^^^^^^^^^^^ File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 20480 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4b7ee6cc:d4bbee04 -- Fri Feb 19 22:30:20 2010 atime: 0x4b7ee6d3:91fb9b9c -- Fri Feb 19 22:30:27 2010 mtime: 0x4b7ee6cc:d4bbee04 -- Fri Feb 19 22:30:20 2010 crtime: 0x4b7eb3c4:01f8d494 -- Fri Feb 19 18:52:36 2010 dtime: 0x00000aba -- Thu Jan 1 03:45:46 1970 Size of extra inode fields: 28 EXTENTS: (0-2559): 112640-115199 ##TEST_CASE_END ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/7] mm: add nofail version of vmtruncate() and inode_setattr() 2010-02-19 19:47 [PATCH 0/7] vfs: notify_changes() error handling Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 2/7] vfs: inode_change_ok have to perform all necessery checks Dmitry Monakhov 2010-02-22 10:35 ` [PATCH 0/7] vfs: notify_changes() error handling Jan Kara 1 sibling, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov vmtrucate may fail due to IS_SWAPFILE flag or due to RLIMIT_FSIZE. In some situations it is not acceptable behaviour. Off course we can check IS_SWAPFILE before but what about RLIMIT_FSIZE? Let's divide vmtruncate in two parts. One which perform necessery checks and second(__vmtruncate) which can not fail. In this case caller is responsible for necessery checks before calling no_fail vmtruncate version. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/attr.c | 22 ++++++++++++++++------ include/linux/fs.h | 1 + include/linux/mm.h | 1 + mm/truncate.c | 24 +++++++++++++++++------- 4 files changed, 35 insertions(+), 13 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 96d394b..b1cc391 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -105,16 +105,13 @@ out_big: } EXPORT_SYMBOL(inode_newsize_ok); -int inode_setattr(struct inode * inode, struct iattr * attr) +void __inode_setattr(struct inode * inode, struct iattr * attr) { unsigned int ia_valid = attr->ia_valid; if (ia_valid & ATTR_SIZE && - attr->ia_size != i_size_read(inode)) { - int error = vmtruncate(inode, attr->ia_size); - if (error) - return error; - } + attr->ia_size != i_size_read(inode)) + __vmtruncate(inode, attr->ia_size); if (ia_valid & ATTR_UID) inode->i_uid = attr->ia_uid; @@ -138,6 +135,19 @@ int inode_setattr(struct inode * inode, struct iattr * attr) } mark_inode_dirty(inode); +} +EXPORT_SYMBOL(__inode_setattr); + +int inode_setattr(struct inode * inode, struct iattr * attr) +{ + int error; + if (attr->ia_valid & ATTR_SIZE && + attr->ia_size != i_size_read(inode)) { + error = inode_newsize_ok(inode, attr->ia_size); + if (error) + return error; + } + __inode_setattr(inode, attr); return 0; } EXPORT_SYMBOL(inode_setattr); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9147ca8..18b7be8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2373,6 +2373,7 @@ extern int buffer_migrate_page(struct address_space *, extern int inode_change_ok(const struct inode *, struct iattr *); extern int inode_newsize_ok(const struct inode *, loff_t offset); +extern void __inode_setattr(struct inode *, struct iattr *); extern int __must_check inode_setattr(struct inode *, struct iattr *); extern void file_update_time(struct file *file); diff --git a/include/linux/mm.h b/include/linux/mm.h index 2265f28..d9fd667 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -808,6 +808,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, } extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new); +extern void __vmtruncate(struct inode *inode, loff_t offset); extern int vmtruncate(struct inode *inode, loff_t offset); extern int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end); diff --git a/mm/truncate.c b/mm/truncate.c index 342deee..9253a5c 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -542,7 +542,8 @@ void truncate_pagecache(struct inode *inode, loff_t old, loff_t new) EXPORT_SYMBOL(truncate_pagecache); /** - * vmtruncate - unmap mappings "freed" by truncate() syscall + * __vmtruncate - unmap mappings "freed" by truncate() syscall + * Caller must perform necessary checks before. * @inode: inode of the file used * @offset: file offset to start truncating * @@ -550,20 +551,29 @@ EXPORT_SYMBOL(truncate_pagecache); * between the file and the memory map for a potential last * incomplete page. Ugly, but necessary. */ -int vmtruncate(struct inode *inode, loff_t offset) +void __vmtruncate(struct inode *inode, loff_t offset) { loff_t oldsize; - int error; - error = inode_newsize_ok(inode, offset); - if (error) - return error; oldsize = inode->i_size; i_size_write(inode, offset); truncate_pagecache(inode, oldsize, offset); if (inode->i_op->truncate) inode->i_op->truncate(inode); +} +EXPORT_SYMBOL(__vmtruncate); - return error; +/** + * vmtruncate - unmap mappings "freed" by truncate() syscall + * @inode: inode of the file used + * @offset: file offset to start truncating + */ +int vmtruncate(struct inode *inode, loff_t offset) +{ + int error = inode_newsize_ok(inode, offset); + if (error) + return error; + __vmtruncate(inode, offset); + return 0; } EXPORT_SYMBOL(vmtruncate); -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/7] vfs: inode_change_ok have to perform all necessery checks 2010-02-19 19:47 ` [PATCH 1/7] mm: add nofail version of vmtruncate() and inode_setattr() Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 3/7] vfs: do not allow inode_setattr() to fail after vfs_dq_transfer() Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov Usually this is the only function which called before inode attributes manipulation. In fact inode_change_ok() performs only posix checks. But it new_size check is also important. Otherwise we mail fail in very late stage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/attr.c | 19 +++++++++++++++++-- include/linux/fs.h | 3 ++- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index b1cc391..cc2a801 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -18,7 +18,7 @@ /* Taken over from the old code... */ /* POSIX UID/GID verification for setting inode attributes. */ -int inode_change_ok(const struct inode *inode, struct iattr *attr) +int inode_change_posix_ok(const struct inode *inode, struct iattr *attr) { int retval = -EPERM; unsigned int ia_valid = attr->ia_valid; @@ -60,7 +60,7 @@ fine: error: return retval; } -EXPORT_SYMBOL(inode_change_ok); +EXPORT_SYMBOL(inode_change_posix_ok); /** * inode_newsize_ok - may this inode be truncated to a given size @@ -105,6 +105,21 @@ out_big: } EXPORT_SYMBOL(inode_newsize_ok); +/* + * General verification for setting inode attributes. + */ +int inode_change_ok(const struct inode *inode, struct iattr *attr) +{ + int ret; + ret = inode_change_posix_ok(inode, attr); + if (ret) + return ret; + if (attr->ia_valid & ATTR_SIZE) + ret = inode_newsize_ok(inode, attr->ia_size); + return ret; +} +EXPORT_SYMBOL(inode_change_ok); + void __inode_setattr(struct inode * inode, struct iattr * attr) { unsigned int ia_valid = attr->ia_valid; diff --git a/include/linux/fs.h b/include/linux/fs.h index 18b7be8..eee26ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2371,8 +2371,9 @@ extern int buffer_migrate_page(struct address_space *, #define buffer_migrate_page NULL #endif -extern int inode_change_ok(const struct inode *, struct iattr *); +extern int inode_change_posix_ok(const struct inode *, struct iattr *); extern int inode_newsize_ok(const struct inode *, loff_t offset); +extern int inode_change_ok(const struct inode *, struct iattr *); extern void __inode_setattr(struct inode *, struct iattr *); extern int __must_check inode_setattr(struct inode *, struct iattr *); -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/7] vfs: do not allow inode_setattr() to fail after vfs_dq_transfer() 2010-02-19 19:47 ` [PATCH 2/7] vfs: inode_change_ok have to perform all necessery checks Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 4/7] ext2: use nofail variant of inode_setattr() Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov After quota was transfered, inode_setattr() may fail. This is tricky situation because it may be impossible to roll-back quota changes. But we have already done all necessery changes in inode_change_ok() so it is safe to use nofail version of inode_seattr() here. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/attr.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index cc2a801..c1ae61c 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -243,7 +243,11 @@ int notify_change(struct dentry * dentry, struct iattr * attr) error = vfs_dq_transfer(inode, attr) ? -EDQUOT : 0; if (!error) - error = inode_setattr(inode, attr); + /* + * All necessery check already done, it is + * safe to use nofail version here. + */ + __inode_setattr(inode, attr); } } -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/7] ext2: use nofail variant of inode_setattr() 2010-02-19 19:47 ` [PATCH 3/7] vfs: do not allow inode_setattr() to fail after vfs_dq_transfer() Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 5/7] ext3: " Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov Since inode_change_ok() is now responsible for all necessery checks we may call __inode_setattr() which can not fail. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext2/inode.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 71b032c..8542b20 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1463,7 +1463,11 @@ int ext2_setattr(struct dentry *dentry, struct iattr *iattr) if (error) return error; } - error = inode_setattr(inode, iattr); + /* + * All necessery check already done in inode_check_ok(), + * it is safe to use nofail version here. + */ + __inode_setattr(inode, iattr); if (!error && (iattr->ia_valid & ATTR_MODE)) error = ext2_acl_chmod(inode); return error; -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/7] ext3: use nofail variant of inode_setattr() 2010-02-19 19:47 ` [PATCH 4/7] ext2: use nofail variant of inode_setattr() Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 6/7] ext4: " Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov After we updated i_disk_size and stop the journal it is too late for error handling from inode_setattr(). Since inode_change_ok() is now responsible for all necessery checks we may call __inode_setattr() which can not fail. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext3/inode.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 455e6e6..196aee3 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -3184,8 +3184,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr) error = rc; ext3_journal_stop(handle); } - - rc = inode_setattr(inode, attr); + /* + * All necessery check already done in inode_check_ok(), + * it is safe to use nofail version here. + */ + __inode_setattr(inode, attr); if (!rc && (ia_valid & ATTR_MODE)) rc = ext3_acl_chmod(inode); -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 6/7] ext4: use nofail variant of inode_setattr() 2010-02-19 19:47 ` [PATCH 5/7] ext3: " Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 7/7] ocfs2: " Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov After we updated i_disk_size and stop the journal it is too late for error handling from inode_setattr(). Since inode_change_ok() is now responsible for all necessery checks we may call __inode_setattr() which can not fail. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/inode.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 218ea0b..1748265 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5436,8 +5436,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) } } } - - rc = inode_setattr(inode, attr); + /* + * All necessery check already done in inode_check_ok(), + * it is safe to use nofail version here. + */ + __inode_setattr(inode, attr); /* If inode_setattr's call to ext4_truncate failed to get a * transaction handle at all, we need to clean up the in-core -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 7/7] ocfs2: use nofail variant of inode_setattr() 2010-02-19 19:47 ` [PATCH 6/7] ext4: " Dmitry Monakhov @ 2010-02-19 19:47 ` Dmitry Monakhov 0 siblings, 0 replies; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-19 19:47 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel, Dmitry Monakhov After we updated i_disk_size and stop the journal it is too late for error handling from inode_setattr(). Since inode_change_ok() is now responsible for all necessery checks we may call __inode_setattr() which can not fail. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ocfs2/file.c | 15 +-------------- 1 files changed, 1 insertions(+), 14 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 06ccf6a..4aed46b 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1065,20 +1065,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) goto bail_unlock; } } - - /* - * This will intentionally not wind up calling vmtruncate(), - * since all the work for a size change has been done above. - * Otherwise, we could get into problems with truncate as - * ip_alloc_sem is used there to protect against i_size - * changes. - */ - status = inode_setattr(inode, attr); - if (status < 0) { - mlog_errno(status); - goto bail_commit; - } - + __inode_setattr(inode, attr); status = ocfs2_mark_inode_dirty(handle, inode, bh); if (status < 0) mlog_errno(status); -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/7] vfs: notify_changes() error handling 2010-02-19 19:47 [PATCH 0/7] vfs: notify_changes() error handling Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 1/7] mm: add nofail version of vmtruncate() and inode_setattr() Dmitry Monakhov @ 2010-02-22 10:35 ` Jan Kara 2010-02-22 11:15 ` Nick Piggin 1 sibling, 1 reply; 13+ messages in thread From: Jan Kara @ 2010-02-22 10:35 UTC (permalink / raw) To: Dmitry Monakhov; +Cc: linux-kernel, linux-fsdevel, npiggin Hi Dmitry, > Current inode attr setting path looks like follows > > ret = inode_change_ok() > if(ret) > goto out; > /* > perform private-fs specific logic here > */ > if(ia_valid & ATTR_UID || ...) > ret = vfs_dq_transfer() > > /* > more private-fs specific logic > for example update on_disk data structures. > */ > > ret = inode_setattr() > > In fact inode_setattr() call vmtruncate() which may fail in number > of reasons IS_SWAPFILE, RLIMIT_FSIZE. After this many filesystem is > unable to rollback changes. And just live inode in inconsistent > state. We may check IS_SWAPFILE at the very beginning(currently it > is not checked), but RLIMIT_FSIZE may changed under our feet. > In order make things straight. Let's divide vmtruncate() in to > two parts which perform all checks, and second which can not fail. > After this notify_change() perform all necessary checks inside > inode_change_ok() and simply call nofail version of vmtruncate(). Actually, there are more problems than these in the truncate path. Some filesystems can decide to fail truncate only in their .truncate method but that is called only after i_size is set which is too late. Nick Piggin has a patch set which was addressing this problem and should be basically a superset of your changes. But I'm not sure whether the patch series is available somewhere or what it's current status. Nick? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/7] vfs: notify_changes() error handling 2010-02-22 10:35 ` [PATCH 0/7] vfs: notify_changes() error handling Jan Kara @ 2010-02-22 11:15 ` Nick Piggin 2010-02-22 13:30 ` Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Nick Piggin @ 2010-02-22 11:15 UTC (permalink / raw) To: Jan Kara; +Cc: Dmitry Monakhov, linux-kernel, linux-fsdevel On Mon, Feb 22, 2010 at 11:35:17AM +0100, Jan Kara wrote: > Hi Dmitry, > > > Current inode attr setting path looks like follows > > > > ret = inode_change_ok() > > if(ret) > > goto out; > > /* > > perform private-fs specific logic here > > */ > > if(ia_valid & ATTR_UID || ...) > > ret = vfs_dq_transfer() > > > > /* > > more private-fs specific logic > > for example update on_disk data structures. > > */ > > > > ret = inode_setattr() > > > > In fact inode_setattr() call vmtruncate() which may fail in number > > of reasons IS_SWAPFILE, RLIMIT_FSIZE. After this many filesystem is > > unable to rollback changes. And just live inode in inconsistent > > state. We may check IS_SWAPFILE at the very beginning(currently it > > is not checked), but RLIMIT_FSIZE may changed under our feet. > > In order make things straight. Let's divide vmtruncate() in to > > two parts which perform all checks, and second which can not fail. > > After this notify_change() perform all necessary checks inside > > inode_change_ok() and simply call nofail version of vmtruncate(). > Actually, there are more problems than these in the truncate path. Some > filesystems can decide to fail truncate only in their .truncate method but that > is called only after i_size is set which is too late. Nick Piggin has a patch > set which was addressing this problem and should be basically a superset of > your changes. But I'm not sure whether the patch series is available somewhere > or what it's current status. Nick? I think Al is happy with it in principle, I changed it as he suggested. Maybe it was put on hold for other reasons. Anyway, here is the core patch again. It now is basically just adding more helpers, so it's not so intrusive. Al, what are your thoughts on merging? It doesn't appear to conflict with the -vfs tree. Dmitry, this doesn't solve your quota problem, but they obviously clash rather heavily. Do you see any problems with the way my patch goes? -- From: Nick Piggin <npiggin@suse.de> Subject: [PATCH] truncate: introduce new sequence Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Nick Piggin <npiggin@suse.de> --- Documentation/filesystems/vfs.txt | 7 +- fs/attr.c | 50 ++++++++++++--- fs/buffer.c | 123 ++++++++++++++++++++++++++++++-------- fs/direct-io.c | 61 ++++++++++++------ fs/libfs.c | 83 +++++++++++++++++++++++++ include/linux/buffer_head.h | 9 ++ include/linux/fs.h | 27 ++++++++ mm/truncate.c | 10 +-- 8 files changed, 307 insertions(+), 63 deletions(-) Index: linux-2.6/Documentation/filesystems/vfs.txt =================================================================== --- linux-2.6.orig/Documentation/filesystems/vfs.txt +++ linux-2.6/Documentation/filesystems/vfs.txt @@ -401,11 +401,16 @@ otherwise noted. started might not be in the page cache at the end of the walk). - truncate: called by the VFS to change the size of a file. The + truncate: Deprecated. This will not be called if ->setsize is defined. + Called by the VFS to change the size of a file. The i_size field of the inode is set to the desired size by the VFS before this method is called. This method is called by the truncate(2) system call and related functionality. + Note: ->truncate and vmtruncate are deprecated. Do not add new + instances/calls of these. Filesystems should be converted to do their + truncate sequence via ->setattr(). + permission: called by the VFS to check for access rights on a POSIX-like filesystem. Index: linux-2.6/fs/attr.c =================================================================== --- linux-2.6.orig/fs/attr.c +++ linux-2.6/fs/attr.c @@ -68,14 +68,14 @@ EXPORT_SYMBOL(inode_change_ok); * @offset: the new size to assign to the inode * @Returns: 0 on success, -ve errno on failure * + * inode_newsize_ok must be called with i_mutex held. + * * inode_newsize_ok will check filesystem limits and ulimits to check that the * new inode size is within limits. inode_newsize_ok will also send SIGXFSZ * when necessary. Caller must not proceed with inode size change if failure is * returned. @inode must be a file (not directory), with appropriate * permissions to allow truncate (inode_newsize_ok does NOT check these * conditions). - * - * inode_newsize_ok must be called with i_mutex held. */ int inode_newsize_ok(const struct inode *inode, loff_t offset) { @@ -105,17 +105,25 @@ out_big: } EXPORT_SYMBOL(inode_newsize_ok); -int inode_setattr(struct inode * inode, struct iattr * attr) +/** + * generic_setattr - copy simple metadata updates into the generic inode + * @inode: the inode to be updated + * @attr: the new attributes + * + * generic_setattr must be called with i_mutex held. + * + * generic_setattr updates the inode's metadata with that specified + * in attr. Noticably missing is inode size update, which is more complex + * as it requires pagecache updates. See simple_setsize. + * + * The inode is not marked as dirty after this operation. The rationale is + * that for "simple" filesystems, the struct inode is the inode storage. + * The caller is free to mark the inode dirty afterwards if needed. + */ +void generic_setattr(struct inode *inode, const struct iattr *attr) { unsigned int ia_valid = attr->ia_valid; - if (ia_valid & ATTR_SIZE && - attr->ia_size != i_size_read(inode)) { - int error = vmtruncate(inode, attr->ia_size); - if (error) - return error; - } - if (ia_valid & ATTR_UID) inode->i_uid = attr->ia_uid; if (ia_valid & ATTR_GID) @@ -136,6 +144,28 @@ int inode_setattr(struct inode * inode, mode &= ~S_ISGID; inode->i_mode = mode; } +} +EXPORT_SYMBOL(generic_setattr); + +/* + * note this function is deprecated, the new truncate sequence should be + * used instead -- see eg. simple_setsize, generic_setattr. + */ +int inode_setattr(struct inode *inode, const struct iattr *attr) +{ + unsigned int ia_valid = attr->ia_valid; + + if (ia_valid & ATTR_SIZE && + attr->ia_size != i_size_read(inode)) { + int error; + + error = vmtruncate(inode, attr->ia_size); + if (error) + return error; + } + + generic_setattr(inode, attr); + mark_inode_dirty(inode); return 0; Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1957,14 +1957,11 @@ static int __block_commit_write(struct i } /* - * block_write_begin takes care of the basic task of block allocation and - * bringing partial write blocks uptodate first. - * - * If *pagep is not NULL, then block_write_begin uses the locked page - * at *pagep rather than allocating its own. In this case, the page will - * not be unlocked or deallocated on failure. + * Filesystems implementing the new truncate sequence should use the + * _newtrunc postfix variant which won't incorrectly call vmtruncate. + * The filesystem needs to handle block truncation upon failure. */ -int block_write_begin(struct file *file, struct address_space *mapping, +int block_write_begin_newtrunc(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata, get_block_t *get_block) @@ -2000,20 +1997,50 @@ int block_write_begin(struct file *file, unlock_page(page); page_cache_release(page); *pagep = NULL; - - /* - * prepare_write() may have instantiated a few blocks - * outside i_size. Trim these off again. Don't need - * i_size_read because we hold i_mutex. - */ - if (pos + len > inode->i_size) - vmtruncate(inode, inode->i_size); } } out: return status; } +EXPORT_SYMBOL(block_write_begin_newtrunc); + +/* + * block_write_begin takes care of the basic task of block allocation and + * bringing partial write blocks uptodate first. + * + * If *pagep is not NULL, then block_write_begin uses the locked page + * at *pagep rather than allocating its own. In this case, the page will + * not be unlocked or deallocated on failure. + */ +int block_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata, + get_block_t *get_block) +{ + int ret; + + ret = block_write_begin_newtrunc(file, mapping, pos, len, flags, + pagep, fsdata, get_block); + + /* + * prepare_write() may have instantiated a few blocks + * outside i_size. Trim these off again. Don't need + * i_size_read because we hold i_mutex. + * + * Filesystems which pass down their own page also cannot + * call into vmtruncate here because it would lead to lock + * inversion problems (*pagep is locked). This is a further + * example of where the old truncate sequence is inadequate. + */ + if (unlikely(ret) && *pagep == NULL) { + loff_t isize = mapping->host->i_size; + if (pos + len > isize) + vmtruncate(mapping->host, isize); + } + + return ret; +} EXPORT_SYMBOL(block_write_begin); int block_write_end(struct file *file, struct address_space *mapping, @@ -2332,7 +2359,7 @@ out: * For moronic filesystems that do not allow holes in file. * We may have to extend the file. */ -int cont_write_begin(struct file *file, struct address_space *mapping, +int cont_write_begin_newtrunc(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata, get_block_t *get_block, loff_t *bytes) @@ -2353,11 +2380,30 @@ int cont_write_begin(struct file *file, } *pagep = NULL; - err = block_write_begin(file, mapping, pos, len, + err = block_write_begin_newtrunc(file, mapping, pos, len, flags, pagep, fsdata, get_block); out: return err; } +EXPORT_SYMBOL(cont_write_begin_newtrunc); + +int cont_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata, + get_block_t *get_block, loff_t *bytes) +{ + int ret; + + ret = cont_write_begin_newtrunc(file, mapping, pos, len, flags, + pagep, fsdata, get_block, bytes); + if (unlikely(ret)) { + loff_t isize = mapping->host->i_size; + if (pos + len > isize) + vmtruncate(mapping->host, isize); + } + + return ret; +} EXPORT_SYMBOL(cont_write_begin); int block_prepare_write(struct page *page, unsigned from, unsigned to, @@ -2389,7 +2435,7 @@ EXPORT_SYMBOL(block_commit_write); * * We are not allowed to take the i_mutex here so we have to play games to * protect against truncate races as the page could now be beyond EOF. Because - * vmtruncate() writes the inode size before removing pages, once we have the + * truncate writes the inode size before removing pages, once we have the * page lock we can determine safely if the page is beyond EOF. If it is not * beyond EOF, then the page is guaranteed safe against truncation until we * unlock the page. @@ -2472,10 +2518,11 @@ static void attach_nobh_buffers(struct p } /* - * On entry, the page is fully not uptodate. - * On exit the page is fully uptodate in the areas outside (from,to) + * Filesystems implementing the new truncate sequence should use the + * _newtrunc postfix variant which won't incorrectly call vmtruncate. + * The filesystem needs to handle block truncation upon failure. */ -int nobh_write_begin(struct file *file, struct address_space *mapping, +int nobh_write_begin_newtrunc(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata, get_block_t *get_block) @@ -2508,8 +2555,8 @@ int nobh_write_begin(struct file *file, unlock_page(page); page_cache_release(page); *pagep = NULL; - return block_write_begin(file, mapping, pos, len, flags, pagep, - fsdata, get_block); + return block_write_begin_newtrunc(file, mapping, pos, len, + flags, pagep, fsdata, get_block); } if (PageMappedToDisk(page)) @@ -2613,8 +2660,34 @@ out_release: page_cache_release(page); *pagep = NULL; - if (pos + len > inode->i_size) - vmtruncate(inode, inode->i_size); + return ret; +} +EXPORT_SYMBOL(nobh_write_begin_newtrunc); + +/* + * On entry, the page is fully not uptodate. + * On exit the page is fully uptodate in the areas outside (from,to) + */ +int nobh_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata, + get_block_t *get_block) +{ + int ret; + + ret = nobh_write_begin_newtrunc(file, mapping, pos, len, flags, + pagep, fsdata, get_block); + + /* + * prepare_write() may have instantiated a few blocks + * outside i_size. Trim these off again. Don't need + * i_size_read because we hold i_mutex. + */ + if (unlikely(ret)) { + loff_t isize = mapping->host->i_size; + if (pos + len > isize) + vmtruncate(mapping->host, isize); + } return ret; } Index: linux-2.6/fs/direct-io.c =================================================================== --- linux-2.6.orig/fs/direct-io.c +++ linux-2.6/fs/direct-io.c @@ -1087,27 +1087,8 @@ direct_io_worker(int rw, struct kiocb *i return ret; } -/* - * This is a library function for use by filesystem drivers. - * - * The locking rules are governed by the flags parameter: - * - if the flags value contains DIO_LOCKING we use a fancy locking - * scheme for dumb filesystems. - * For writes this function is called under i_mutex and returns with - * i_mutex held, for reads, i_mutex is not held on entry, but it is - * taken and dropped again before returning. - * For reads and writes i_alloc_sem is taken in shared mode and released - * on I/O completion (which may happen asynchronously after returning to - * the caller). - * - * - if the flags value does NOT contain DIO_LOCKING we don't use any - * internal locking but rather rely on the filesystem to synchronize - * direct I/O reads/writes versus each other and truncate. - * For reads and writes both i_mutex and i_alloc_sem are not held on - * entry and are never taken. - */ ssize_t -__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, +__blockdev_direct_IO_newtrunc(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, int flags) @@ -1199,9 +1180,46 @@ __blockdev_direct_IO(int rw, struct kioc retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); +out: + return retval; +} +EXPORT_SYMBOL(__blockdev_direct_IO_newtrunc); + +/* + * This is a library function for use by filesystem drivers. + * + * The locking rules are governed by the flags parameter: + * - if the flags value contains DIO_LOCKING we use a fancy locking + * scheme for dumb filesystems. + * For writes this function is called under i_mutex and returns with + * i_mutex held, for reads, i_mutex is not held on entry, but it is + * taken and dropped again before returning. + * For reads and writes i_alloc_sem is taken in shared mode and released + * on I/O completion (which may happen asynchronously after returning to + * the caller). + * + * - if the flags value does NOT contain DIO_LOCKING we don't use any + * internal locking but rather rely on the filesystem to synchronize + * direct I/O reads/writes versus each other and truncate. + * For reads and writes both i_mutex and i_alloc_sem are not held on + * entry and are never taken. + */ +ssize_t +__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, const struct iovec *iov, loff_t offset, + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, + int flags) +{ + ssize_t retval; + + retval = __blockdev_direct_IO_newtrunc(rw, iocb, inode, bdev, iov, + offset, nr_segs, get_block, end_io, flags); /* * In case of error extending write may have instantiated a few * blocks outside i_size. Trim these off again for DIO_LOCKING. + * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in + * their own manner. This is a further example of where the old + * truncate sequence is inadequate. * * NOTE: filesystems with their own locking have to handle this * on their own. @@ -1209,12 +1227,13 @@ __blockdev_direct_IO(int rw, struct kioc if (flags & DIO_LOCKING) { if (unlikely((rw & WRITE) && retval < 0)) { loff_t isize = i_size_read(inode); + loff_t end = offset + iov_length(iov, nr_segs); + if (end > isize) vmtruncate(inode, isize); } } -out: return retval; } EXPORT_SYMBOL(__blockdev_direct_IO); Index: linux-2.6/fs/libfs.c =================================================================== --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -7,6 +7,7 @@ #include <linux/pagemap.h> #include <linux/mount.h> #include <linux/vfs.h> +#include <linux/quotaops.h> #include <linux/mutex.h> #include <linux/exportfs.h> #include <linux/writeback.h> @@ -329,6 +330,88 @@ int simple_rename(struct inode *old_dir, return 0; } +/** + * simple_setsize - handle core mm and vfs requirements for file size change + * @inode: inode + * @newsize: new file size + * + * Returns 0 on success, -error on failure. + * + * simple_setsize must be called with inode_mutex held. + * + * simple_setsize will check that the requested new size is OK (see + * inode_newsize_ok), and then will perform the necessary i_size update + * and pagecache truncation (if necessary). It will be typically be called + * from the filesystem's setattr function when ATTR_SIZE is passed in. + * + * The inode itself must have correct permissions and attributes to allow + * i_size to be changed, this function then just checks that the new size + * requested is valid. + * + * In the case of simple in-memory filesystems with inodes stored solely + * in the inode cache, and file data in the pagecache, nothing more needs + * to be done to satisfy a truncate request. Filesystems with on-disk + * blocks for example will need to free them in the case of truncate, in + * that case it may be easier not to use simple_setsize (but each of its + * components will likely be required at some point to update pagecache + * and inode etc). + */ +int simple_setsize(struct inode *inode, loff_t newsize) +{ + loff_t oldsize; + int error; + + error = inode_newsize_ok(inode, newsize); + if (error) + return error; + + oldsize = inode->i_size; + i_size_write(inode, newsize); + truncate_pagecache(inode, oldsize, newsize); + + return error; +} +EXPORT_SYMBOL(simple_setsize); + +/** + * simple_setattr - setattr for simple in-memory filesystem + * @dentry: dentry + * @iattr: iattr structure + * + * Returns 0 on success, -error on failure. + * + * simple_setattr implements setattr for an in-memory filesystem which + * does not store its own file data or metadata (eg. uses the page cache + * and inode cache as its data store). + */ +int simple_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + int error; + + error = inode_change_ok(inode, iattr); + if (error) + return error; + + if ((iattr->ia_valid & ATTR_UID && iattr->ia_uid != inode->i_uid) || + (iattr->ia_valid & ATTR_GID && iattr->ia_gid != inode->i_gid)) { + error = vfs_dq_transfer(inode, iattr) ? -EDQUOT : 0; + if (error) + return error; + } + + if (iattr->ia_valid & ATTR_SIZE) { + error = simple_setsize(inode, iattr->ia_size); + if (error) + return error; + } + + generic_setattr(inode, iattr); + + return error; +} +EXPORT_SYMBOL(simple_setattr); + int simple_readpage(struct file *file, struct page *page) { clear_highpage(page); Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -2244,6 +2244,10 @@ static inline int xip_truncate_page(stru #endif #ifdef CONFIG_BLOCK +ssize_t __blockdev_direct_IO_newtrunc(int rw, struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, const struct iovec *iov, loff_t offset, + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, + int lock_type); ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, @@ -2257,6 +2261,24 @@ enum { DIO_SKIP_HOLES = 0x02, }; +static inline ssize_t blockdev_direct_IO_newtrunc(int rw, struct kiocb *iocb, + struct inode *inode, struct block_device *bdev, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, get_block_t get_block, + dio_iodone_t end_io) +{ + return __blockdev_direct_IO_newtrunc(rw, iocb, inode, bdev, iov, offset, + nr_segs, get_block, end_io, + DIO_LOCKING | DIO_SKIP_HOLES); +} + +static inline ssize_t blockdev_direct_IO_no_locking_newtrunc(int rw, struct kiocb *iocb, + struct inode *inode, struct block_device *bdev, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, get_block_t get_block, + dio_iodone_t end_io) +{ + return __blockdev_direct_IO_newtrunc(rw, iocb, inode, bdev, iov, offset, + nr_segs, get_block, end_io, 0); +} static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, unsigned long nr_segs, get_block_t get_block, @@ -2327,12 +2349,14 @@ extern int dcache_dir_open(struct inode extern int dcache_dir_close(struct inode *, struct file *); extern loff_t dcache_dir_lseek(struct file *, loff_t, int); extern int dcache_readdir(struct file *, void *, filldir_t); +extern int simple_setattr(struct dentry *dentry, struct iattr *attr); extern int simple_getattr(struct vfsmount *, struct dentry *, struct kstat *); extern int simple_statfs(struct dentry *, struct kstatfs *); extern int simple_link(struct dentry *, struct inode *, struct dentry *); extern int simple_unlink(struct inode *, struct dentry *); extern int simple_rmdir(struct inode *, struct dentry *); extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int simple_setsize(struct inode *inode, loff_t newsize); extern int simple_sync_file(struct file *, struct dentry *, int); extern int simple_empty(struct dentry *); extern int simple_readpage(struct file *file, struct page *page); @@ -2367,7 +2391,8 @@ extern int buffer_migrate_page(struct ad extern int inode_change_ok(const struct inode *, struct iattr *); extern int inode_newsize_ok(const struct inode *, loff_t offset); -extern int __must_check inode_setattr(struct inode *, struct iattr *); +extern int __must_check inode_setattr(struct inode *, const struct iattr *); +extern void generic_setattr(struct inode *inode, const struct iattr *attr); extern void file_update_time(struct file *file); Index: linux-2.6/mm/truncate.c =================================================================== --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -547,18 +547,18 @@ EXPORT_SYMBOL(truncate_pagecache); * NOTE! We have to be ready to update the memory sharing * between the file and the memory map for a potential last * incomplete page. Ugly, but necessary. + * + * This function is deprecated and simple_setsize or truncate_pagecache + * should be used instead. */ int vmtruncate(struct inode *inode, loff_t offset) { - loff_t oldsize; int error; - error = inode_newsize_ok(inode, offset); + error = simple_setsize(inode, offset); if (error) return error; - oldsize = inode->i_size; - i_size_write(inode, offset); - truncate_pagecache(inode, oldsize, offset); + if (inode->i_op->truncate) inode->i_op->truncate(inode); Index: linux-2.6/include/linux/buffer_head.h =================================================================== --- linux-2.6.orig/include/linux/buffer_head.h +++ linux-2.6/include/linux/buffer_head.h @@ -203,6 +203,9 @@ int block_write_full_page_endio(struct p int block_read_full_page(struct page*, get_block_t*); int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, unsigned long from); +int block_write_begin_newtrunc(struct file *, struct address_space *, + loff_t, unsigned, unsigned, + struct page **, void **, get_block_t*); int block_write_begin(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page **, void **, get_block_t*); @@ -214,6 +217,9 @@ int generic_write_end(struct file *, str struct page *, void *); void page_zero_new_buffers(struct page *page, unsigned from, unsigned to); int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*); +int cont_write_begin_newtrunc(struct file *, struct address_space *, loff_t, + unsigned, unsigned, struct page **, void **, + get_block_t *, loff_t *); int cont_write_begin(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page **, void **, get_block_t *, loff_t *); @@ -225,6 +231,9 @@ void block_sync_page(struct page *); sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); int block_truncate_page(struct address_space *, loff_t, get_block_t *); int file_fsync(struct file *, struct dentry *, int); +int nobh_write_begin_newtrunc(struct file *, struct address_space *, + loff_t, unsigned, unsigned, + struct page **, void **, get_block_t*); int nobh_write_begin(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page **, void **, get_block_t*); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/7] vfs: notify_changes() error handling 2010-02-22 11:15 ` Nick Piggin @ 2010-02-22 13:30 ` Dmitry Monakhov 2010-02-22 14:56 ` Nick Piggin 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-22 13:30 UTC (permalink / raw) To: Nick Piggin; +Cc: Jan Kara, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 3313 bytes --] Nick Piggin <npiggin@suse.de> writes: > On Mon, Feb 22, 2010 at 11:35:17AM +0100, Jan Kara wrote: >> Hi Dmitry, >> >> > Current inode attr setting path looks like follows >> > >> > ret = inode_change_ok() >> > if(ret) >> > goto out; >> > /* >> > perform private-fs specific logic here >> > */ >> > if(ia_valid & ATTR_UID || ...) >> > ret = vfs_dq_transfer() >> > >> > /* >> > more private-fs specific logic >> > for example update on_disk data structures. >> > */ >> > >> > ret = inode_setattr() >> > >> > In fact inode_setattr() call vmtruncate() which may fail in number >> > of reasons IS_SWAPFILE, RLIMIT_FSIZE. After this many filesystem is >> > unable to rollback changes. And just live inode in inconsistent >> > state. We may check IS_SWAPFILE at the very beginning(currently it >> > is not checked), but RLIMIT_FSIZE may changed under our feet. >> > In order make things straight. Let's divide vmtruncate() in to >> > two parts which perform all checks, and second which can not fail. >> > After this notify_change() perform all necessary checks inside >> > inode_change_ok() and simply call nofail version of vmtruncate(). >> Actually, there are more problems than these in the truncate path. Some >> filesystems can decide to fail truncate only in their .truncate method but that >> is called only after i_size is set which is too late. Nick Piggin has a patch >> set which was addressing this problem and should be basically a superset of >> your changes. But I'm not sure whether the patch series is available somewhere >> or what it's current status. Nick? > > I think Al is happy with it in principle, I changed it as he suggested. > Maybe it was put on hold for other reasons. Anyway, here is the core > patch again. It now is basically just adding more helpers, so it's not > so intrusive. > > Al, what are your thoughts on merging? It doesn't appear to conflict > with the -vfs tree. > > Dmitry, this doesn't solve your quota problem, but they obviously clash > rather heavily. Do you see any problems with the way my patch goes? In fact i dont tried to solve quota issue. I just want to understand why error paths in my code becomes so huge and why they so differ from existing code, now i do understand why :) As soon as i understand this patch add changes only for core part. And later other filesystems will handle the rest. I am agree with it, vmtruncate is nightmare. But imho also we have to solve generic inode attr check/set issue fs_XXX_setattr() { ... ret = inode_size_ok(inode, attr) if (ret) goto bad; if(private_fs_update_on_disk_data(new_size)) goto bad; if(simple_setsize(inode,new_size)) { /* We still can get in to this because RLIMIT_FSIZE may be * changed after inode_size_ok(); * So we have to roll back all fs-speciffic data which may be * paintfull or impossible */ ret = private_fs_update_on_disk_data(old_size) BUG_ON(ret) } } So my purpose is: 1) to move inode_size_ok check in to inode_change_ok() 2) Introduce simple_setsize_nocheck() which just do it's work after all checks was already done. And call simple_setsize_nocheck() inside fsXXX_setattr instead of simple_setsize(). Patch prepared agains yours "truncate: introduce new sequence" [-- Attachment #2: 0001-vfs-inode_change_ok-have-to-perform-all-necessery-ch.patch --] [-- Type: text/plain, Size: 4514 bytes --] >From a876d30dd06dfa772ca2a2184416762843fc3708 Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov <dmonakhov@openvz.org> Date: Mon, 22 Feb 2010 16:15:57 +0300 Subject: [PATCH] vfs: inode_change_ok have to perform all necessery checks Usually this is the only function which called before inode attributes manipulation. In fact inode_change_ok performs only posix checks. But it new_size check is also important. Otherwise we mail fail in very late stage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/attr.c | 20 ++++++++++++++++++-- fs/libfs.c | 26 +++++++++++++++++++------- include/linux/fs.h | 5 +++-- 3 files changed, 40 insertions(+), 11 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 1bca479..007c706 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -18,7 +18,7 @@ /* Taken over from the old code... */ /* POSIX UID/GID verification for setting inode attributes. */ -int inode_change_ok(const struct inode *inode, struct iattr *attr) +int inode_change_posix_ok(const struct inode *inode, struct iattr *attr) { int retval = -EPERM; unsigned int ia_valid = attr->ia_valid; @@ -60,7 +60,7 @@ fine: error: return retval; } -EXPORT_SYMBOL(inode_change_ok); +EXPORT_SYMBOL(inode_change_posix_ok); /** * inode_newsize_ok - may this inode be truncated to a given size @@ -105,6 +105,22 @@ out_big: } EXPORT_SYMBOL(inode_newsize_ok); +/* + * General verification for setting inode attributes. + */ +int inode_change_ok(const struct inode *inode, struct iattr *attr) +{ + int ret; + ret = inode_change_posix_ok(inode, attr); + if (ret) + return ret; + if (attr->ia_valid & ATTR_SIZE) + ret = inode_newsize_ok(inode, attr->ia_size); + return ret; +} +EXPORT_SYMBOL(inode_change_ok); + + /** * generic_setattr - copy simple metadata updates into the generic inode * @inode: the inode to be updated diff --git a/fs/libfs.c b/fs/libfs.c index 338575b..2aa89d7 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -329,6 +329,23 @@ int simple_rename(struct inode *old_dir, struct dentry *old_dentry, return 0; } +/** + * simple_setsize_nocheck - handle core mm and vfs requirements for file + * size change + * @inode: inode + * @newsize: new file size + * simple_setsize_nocheck must be called with inode_mutex held. + * + * Caller is responsible for all appropriate checks + */ +void simple_setsize_nocheck(struct inode *inode, loff_t newsize) +{ + loff_t oldsize; + oldsize = inode->i_size; + i_size_write(inode, newsize); + truncate_pagecache(inode, oldsize, newsize); +} +EXPORT_SYMBOL(simple_setsize_nocheck); /** * simple_setsize - handle core mm and vfs requirements for file size change @@ -358,18 +375,13 @@ int simple_rename(struct inode *old_dir, struct dentry *old_dentry, */ int simple_setsize(struct inode *inode, loff_t newsize) { - loff_t oldsize; int error; error = inode_newsize_ok(inode, newsize); if (error) return error; - - oldsize = inode->i_size; - i_size_write(inode, newsize); - truncate_pagecache(inode, oldsize, newsize); - - return error; + simple_setsize_nocheck(inode, newsize); + return 0; } EXPORT_SYMBOL(simple_setsize); diff --git a/include/linux/fs.h b/include/linux/fs.h index f5638e5..1097e28 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2361,6 +2361,7 @@ extern int simple_unlink(struct inode *, struct dentry *); extern int simple_rmdir(struct inode *, struct dentry *); extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); extern int simple_setsize(struct inode *inode, loff_t newsize); +extern void simple_setsize_nocheck(struct inode *inode, loff_t newsize); extern int simple_sync_file(struct file *, struct dentry *, int); extern int simple_empty(struct dentry *); extern int simple_readpage(struct file *file, struct page *page); @@ -2395,11 +2396,11 @@ extern int buffer_migrate_page(struct address_space *, #define buffer_migrate_page NULL #endif -extern int inode_change_ok(const struct inode *, struct iattr *); +extern int inode_change_posix_ok(const struct inode *, struct iattr *); extern int inode_newsize_ok(const struct inode *, loff_t offset); +extern int inode_change_ok(const struct inode *, struct iattr *); extern int __must_check inode_setattr(struct inode *, const struct iattr *); extern void generic_setattr(struct inode *inode, const struct iattr *attr); - extern void file_update_time(struct file *file); extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt); -- 1.6.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/7] vfs: notify_changes() error handling 2010-02-22 13:30 ` Dmitry Monakhov @ 2010-02-22 14:56 ` Nick Piggin 2010-02-22 17:37 ` Dmitry Monakhov 0 siblings, 1 reply; 13+ messages in thread From: Nick Piggin @ 2010-02-22 14:56 UTC (permalink / raw) To: Dmitry Monakhov; +Cc: Jan Kara, linux-kernel, linux-fsdevel On Mon, Feb 22, 2010 at 04:30:26PM +0300, Dmitry Monakhov wrote: > Nick Piggin <npiggin@suse.de> writes: > > > On Mon, Feb 22, 2010 at 11:35:17AM +0100, Jan Kara wrote: > >> > After this notify_change() perform all necessary checks inside > >> > inode_change_ok() and simply call nofail version of vmtruncate(). > >> Actually, there are more problems than these in the truncate path. Some > >> filesystems can decide to fail truncate only in their .truncate method but that > >> is called only after i_size is set which is too late. Nick Piggin has a patch > >> set which was addressing this problem and should be basically a superset of > >> your changes. But I'm not sure whether the patch series is available somewhere > >> or what it's current status. Nick? > > > > I think Al is happy with it in principle, I changed it as he suggested. > > Maybe it was put on hold for other reasons. Anyway, here is the core > > patch again. It now is basically just adding more helpers, so it's not > > so intrusive. > > > > Al, what are your thoughts on merging? It doesn't appear to conflict > > with the -vfs tree. > > > > Dmitry, this doesn't solve your quota problem, but they obviously clash > > rather heavily. Do you see any problems with the way my patch goes? > In fact i dont tried to solve quota issue. I just want to understand > why error paths in my code becomes so huge and why they so differ > from existing code, now i do understand why :) Oh, but you attempted it (partially?) by moving the inode size check into inode_change_ok(). > As soon as i understand this patch add changes only for core part. > And later other filesystems will handle the rest. Yes. Most of them we have converted to the new sequence in subsequent patches. From that point it should be easier to improve error handling. > I am agree with it, vmtruncate is nightmare. > But imho also we have to solve generic inode attr check/set issue > fs_XXX_setattr() > { > ... > ret = inode_size_ok(inode, attr) > if (ret) > goto bad; > if(private_fs_update_on_disk_data(new_size)) > goto bad; > if(simple_setsize(inode,new_size)) { > /* We still can get in to this because RLIMIT_FSIZE may be > * changed after inode_size_ok(); > * So we have to roll back all fs-speciffic data which may be > * paintfull or impossible > */ > ret = private_fs_update_on_disk_data(old_size) > BUG_ON(ret) > } > } > So my purpose is: > 1) to move inode_size_ok check in to inode_change_ok() > 2) Introduce simple_setsize_nocheck() which just do it's work > after all checks was already done. > And call simple_setsize_nocheck() inside fsXXX_setattr instead > of simple_setsize(). > > Patch prepared agains yours "truncate: introduce new sequence" Hmm, I wonder if it would be safer to rename the function if changing behaviour like this so it loudly breaks external modules. Possibly this breaks nfsd usage? But as a general rule it's a good idea to do the easy checks first, so I can't see the harm in moving inode_newsize_ok checks as early as possible. > >From a876d30dd06dfa772ca2a2184416762843fc3708 Mon Sep 17 00:00:00 2001 > From: Dmitry Monakhov <dmonakhov@openvz.org> > Date: Mon, 22 Feb 2010 16:15:57 +0300 > Subject: [PATCH] vfs: inode_change_ok have to perform all necessery checks > > Usually this is the only function which called before inode > attributes manipulation. In fact inode_change_ok performs only > posix checks. But it new_size check is also important. Otherwise > we mail fail in very late stage. > > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> > --- > fs/attr.c | 20 ++++++++++++++++++-- > fs/libfs.c | 26 +++++++++++++++++++------- > include/linux/fs.h | 5 +++-- > 3 files changed, 40 insertions(+), 11 deletions(-) > > diff --git a/fs/attr.c b/fs/attr.c > index 1bca479..007c706 100644 > --- a/fs/attr.c > +++ b/fs/attr.c > @@ -18,7 +18,7 @@ > /* Taken over from the old code... */ > > /* POSIX UID/GID verification for setting inode attributes. */ > -int inode_change_ok(const struct inode *inode, struct iattr *attr) > +int inode_change_posix_ok(const struct inode *inode, struct iattr *attr) > { > int retval = -EPERM; > unsigned int ia_valid = attr->ia_valid; > @@ -60,7 +60,7 @@ fine: > error: > return retval; > } > -EXPORT_SYMBOL(inode_change_ok); > +EXPORT_SYMBOL(inode_change_posix_ok); > > /** > * inode_newsize_ok - may this inode be truncated to a given size > @@ -105,6 +105,22 @@ out_big: > } > EXPORT_SYMBOL(inode_newsize_ok); > > +/* > + * General verification for setting inode attributes. > + */ > +int inode_change_ok(const struct inode *inode, struct iattr *attr) > +{ > + int ret; > + ret = inode_change_posix_ok(inode, attr); > + if (ret) > + return ret; > + if (attr->ia_valid & ATTR_SIZE) > + ret = inode_newsize_ok(inode, attr->ia_size); > + return ret; > +} > +EXPORT_SYMBOL(inode_change_ok); > + > + > /** > * generic_setattr - copy simple metadata updates into the generic inode > * @inode: the inode to be updated > diff --git a/fs/libfs.c b/fs/libfs.c > index 338575b..2aa89d7 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -329,6 +329,23 @@ int simple_rename(struct inode *old_dir, struct dentry *old_dentry, > > return 0; > } > +/** > + * simple_setsize_nocheck - handle core mm and vfs requirements for file > + * size change > + * @inode: inode > + * @newsize: new file size > + * simple_setsize_nocheck must be called with inode_mutex held. > + * > + * Caller is responsible for all appropriate checks > + */ > +void simple_setsize_nocheck(struct inode *inode, loff_t newsize) > +{ > + loff_t oldsize; > + oldsize = inode->i_size; > + i_size_write(inode, newsize); > + truncate_pagecache(inode, oldsize, newsize); > +} > +EXPORT_SYMBOL(simple_setsize_nocheck); > > /** > * simple_setsize - handle core mm and vfs requirements for file size change > @@ -358,18 +375,13 @@ int simple_rename(struct inode *old_dir, struct dentry *old_dentry, > */ > int simple_setsize(struct inode *inode, loff_t newsize) > { > - loff_t oldsize; > int error; > > error = inode_newsize_ok(inode, newsize); > if (error) > return error; > - > - oldsize = inode->i_size; > - i_size_write(inode, newsize); > - truncate_pagecache(inode, oldsize, newsize); > - > - return error; > + simple_setsize_nocheck(inode, newsize); > + return 0; > } > EXPORT_SYMBOL(simple_setsize); If we move the inode_newsize_ok check up to inode_change_ok position, then I think just the non-checking variant should be kept. Or it's now simple enough that it can just be open-coded. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/7] vfs: notify_changes() error handling 2010-02-22 14:56 ` Nick Piggin @ 2010-02-22 17:37 ` Dmitry Monakhov 0 siblings, 0 replies; 13+ messages in thread From: Dmitry Monakhov @ 2010-02-22 17:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Jan Kara, linux-kernel, linux-fsdevel Nick Piggin <npiggin@suse.de> writes: > On Mon, Feb 22, 2010 at 04:30:26PM +0300, Dmitry Monakhov wrote: >> Nick Piggin <npiggin@suse.de> writes: >> >> > On Mon, Feb 22, 2010 at 11:35:17AM +0100, Jan Kara wrote: >> >> > After this notify_change() perform all necessary checks inside >> >> > inode_change_ok() and simply call nofail version of vmtruncate(). >> >> Actually, there are more problems than these in the truncate path. Some >> >> filesystems can decide to fail truncate only in their .truncate method but that >> >> is called only after i_size is set which is too late. Nick Piggin has a patch >> >> set which was addressing this problem and should be basically a superset of >> >> your changes. But I'm not sure whether the patch series is available somewhere >> >> or what it's current status. Nick? >> > >> > I think Al is happy with it in principle, I changed it as he suggested. >> > Maybe it was put on hold for other reasons. Anyway, here is the core >> > patch again. It now is basically just adding more helpers, so it's not >> > so intrusive. >> > >> > Al, what are your thoughts on merging? It doesn't appear to conflict >> > with the -vfs tree. >> > >> > Dmitry, this doesn't solve your quota problem, but they obviously clash >> > rather heavily. Do you see any problems with the way my patch goes? >> In fact i dont tried to solve quota issue. I just want to understand >> why error paths in my code becomes so huge and why they so differ >> from existing code, now i do understand why :) > > Oh, but you attempted it (partially?) by moving the inode size > check into inode_change_ok(). > >> As soon as i understand this patch add changes only for core part. >> And later other filesystems will handle the rest. > > Yes. Most of them we have converted to the new sequence in > subsequent patches. From that point it should be easier to improve > error handling. > >> I am agree with it, vmtruncate is nightmare. >> But imho also we have to solve generic inode attr check/set issue >> fs_XXX_setattr() >> { >> ... >> ret = inode_size_ok(inode, attr) >> if (ret) >> goto bad; >> if(private_fs_update_on_disk_data(new_size)) >> goto bad; >> if(simple_setsize(inode,new_size)) { >> /* We still can get in to this because RLIMIT_FSIZE may be >> * changed after inode_size_ok(); >> * So we have to roll back all fs-speciffic data which may be >> * paintfull or impossible >> */ >> ret = private_fs_update_on_disk_data(old_size) >> BUG_ON(ret) >> } >> } >> So my purpose is: >> 1) to move inode_size_ok check in to inode_change_ok() >> 2) Introduce simple_setsize_nocheck() which just do it's work >> after all checks was already done. >> And call simple_setsize_nocheck() inside fsXXX_setattr instead >> of simple_setsize(). >> >> Patch prepared agains yours "truncate: introduce new sequence" > > Hmm, I wonder if it would be safer to rename the function if > changing behaviour like this so it loudly breaks external modules. > Yeah. Since your patch is mandatory as a start point. And i'm trying to solve slightly different issue. Which result in core patch pending. Let it goes in to vfs tree as is. Later i'll convert all fsXXX_setattr() to sane attribute checks logic. ACK from me. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2010-02-22 17:37 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-19 19:47 [PATCH 0/7] vfs: notify_changes() error handling Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 1/7] mm: add nofail version of vmtruncate() and inode_setattr() Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 2/7] vfs: inode_change_ok have to perform all necessery checks Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 3/7] vfs: do not allow inode_setattr() to fail after vfs_dq_transfer() Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 4/7] ext2: use nofail variant of inode_setattr() Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 5/7] ext3: " Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 6/7] ext4: " Dmitry Monakhov 2010-02-19 19:47 ` [PATCH 7/7] ocfs2: " Dmitry Monakhov 2010-02-22 10:35 ` [PATCH 0/7] vfs: notify_changes() error handling Jan Kara 2010-02-22 11:15 ` Nick Piggin 2010-02-22 13:30 ` Dmitry Monakhov 2010-02-22 14:56 ` Nick Piggin 2010-02-22 17:37 ` Dmitry Monakhov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).