* strange ext{3,4}_settattr logic
@ 2008-03-15 16:07 Dmitri Monakhov
2008-03-15 23:05 ` Andreas Dilger
0 siblings, 1 reply; 8+ messages in thread
From: Dmitri Monakhov @ 2008-03-15 16:07 UTC (permalink / raw)
To: linux-ext4
Hello.
I've found what ext3_setattr() code has some strange logic. I'm talking
about truncate path.
int ext3_setattr(struct dentry *dentry, struct iattr *attr)
{
...
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
<<< This is shrinking case, and according to function comments:
<<< "In particular, we want to make sure that when the VFS
<<< * shrinks i_size, we put the inode on the orphan list and modify
<<< * i_disksize immediately"
<<< we about to write i_disksize. But WHY do we have to do it explicitly?
<<< Later inode_setattr() will call ext3_truncate() which will do it
<<< this work for us.
handle = ext3_journal_start(inode, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
goto err_out;
}
error = ext3_orphan_add(handle, inode);
EXT3_I(inode)->i_disksize = attr->ia_size;
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
error = rc;
ext3_journal_stop(handle);
}
rc = inode_setattr(inode, attr);
<<< Now the most interesting question. What we have to do now in
<<< case of error? We are in tricky situation. Truncate not happened,
<<< and blocks visible to the user, but i_disksize was already written,
<<< so later memory reclaiming/ read_inode will result in unexpected
<<< updating i_size.
/* If inode_setattr's call to ext3_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
* orphan list manually. */
<<< Following code will remove inode only from in memory(because handle = NULL)
<<< orphan list. Please someone explain me what this lines suppose to do
<<< actually.
if (inode->i_nlink)
ext3_orphan_del(NULL, inode);
...
}
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: strange ext{3,4}_settattr logic 2008-03-15 16:07 strange ext{3,4}_settattr logic Dmitri Monakhov @ 2008-03-15 23:05 ` Andreas Dilger 2008-03-15 23:54 ` Andreas Dilger 0 siblings, 1 reply; 8+ messages in thread From: Andreas Dilger @ 2008-03-15 23:05 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-ext4 On Mar 15, 2008 19:07 +0300, Dmitri Monakhov wrote: > I've found what ext3_setattr() code has some strange logic. I'm talking > about truncate path. > > int ext3_setattr(struct dentry *dentry, struct iattr *attr) > { > ... > if (S_ISREG(inode->i_mode) && > attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) { > handle_t *handle; > <<< This is shrinking case, and according to function comments: > <<< "In particular, we want to make sure that when the VFS > <<< * shrinks i_size, we put the inode on the orphan list and modify > <<< * i_disksize immediately" > <<< we about to write i_disksize. But WHY do we have to do it explicitly? > <<< Later inode_setattr() will call ext3_truncate() which will do it > <<< this work for us. The reason that i_disksize is written to disk here immediately is that the journal is stopped. Once that is done then in case of a crash the orphan recovery code will detect the unfinished truncate and complete it before mounting the filesystem. Without this it is possible to get a partial truncate after a crash because the truncate may span several transactions due to the potentially large number of blocks that need to be modified. What is important with ext3 is that because e2fsck is not run on each boot whatever is on disk needs to be consistent after a crash. If there is a file being truncated or unlinked that needs to be completed after a crash or the blocks will be leaked. To ensure this happens, there is a singly-linked list of inodes on the disk called the "orphan list" that keeps track of all inodes currently undergoing truncate or unlink. After a crash the kernel or e2fsck will walk this list and finish the truncate or unlink of the inode, freeing the blocks. > rc = inode_setattr(inode, attr); > <<< Now the most interesting question. What we have to do now in > <<< case of error? We are in tricky situation. Truncate not happened, > <<< and blocks visible to the user, but i_disksize was already written, > <<< so later memory reclaiming/ read_inode will result in unexpected > <<< updating i_size. The only ways inode_setattr() can fail are: - expanding vmtruncate hits EFBIG, but we checked that above - shrinking vmtruncate on a swapfile returns ETXTBUSY. This was added after the ext3_setattr() code was written. If the ext3_truncate() or mark_inode_dirty() call fails, it does not return an error code. For ext3 the only way this can fail is if the journal is aborted, which means the filesystem is already in read-only mode and nothing can be done to clean up the truncate until the next mount, at which point the orphan recovery code discussed above will finish the operation. > /* If inode_setattr's call to ext3_truncate failed to get a > * transaction handle at all, we need to clean up the in-core > * orphan list manually. */ > <<< Following code will remove inode only from in memory(because handle = NULL) > <<< orphan list. Please someone explain me what this lines suppose to do > <<< actually. > if (inode->i_nlink) > ext3_orphan_del(NULL, inode); This will only be important in the case of a failed operation above. The ext3_truncate() code will normally have already removed the inode from the orphan list when it is finished, but we aren't sure whether that code was called so we need to do it again here (it is safe to call even if the inode is not on the list) to ensure we don't hit a J_ASSERT() that the orphan list is empty in the unmount code (ext3_put_super()). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-15 23:05 ` Andreas Dilger @ 2008-03-15 23:54 ` Andreas Dilger 2008-03-16 0:23 ` Andreas Dilger 0 siblings, 1 reply; 8+ messages in thread From: Andreas Dilger @ 2008-03-15 23:54 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-ext4 On Mar 16, 2008 07:05 +0800, Andreas Dilger wrote: > On Mar 15, 2008 19:07 +0300, Dmitri Monakhov wrote: > > I've found what ext3_setattr() code has some strange logic. I'm talking > > about truncate path. > > > > int ext3_setattr(struct dentry *dentry, struct iattr *attr) > > { > > ... > > if (S_ISREG(inode->i_mode) && > > attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) { > > handle_t *handle; > > <<< This is shrinking case, and according to function comments: > > <<< "In particular, we want to make sure that when the VFS > > <<< * shrinks i_size, we put the inode on the orphan list and modify > > <<< * i_disksize immediately" > > <<< we about to write i_disksize. But WHY do we have to do it explicitly? > > <<< Later inode_setattr() will call ext3_truncate() which will do it > > <<< this work for us. > > The reason that i_disksize is written to disk here immediately is that the > journal is stopped. Once that is done then in case of a crash the orphan > recovery code will detect the unfinished truncate and complete it before > mounting the filesystem. > > > rc = inode_setattr(inode, attr); > > <<< Now the most interesting question. What we have to do now in > > <<< case of error? We are in tricky situation. Truncate not happened, > > <<< and blocks visible to the user, but i_disksize was already written, > > <<< so later memory reclaiming/ read_inode will result in unexpected > > <<< updating i_size. > > The only ways inode_setattr() can fail are: > - expanding vmtruncate hits EFBIG, but we checked that above > - shrinking vmtruncate on a swapfile returns ETXTBUSY. This was added > after the ext3_setattr() code was written. I forgot to add that we need to handle the IS_SWAPFILE() case properly. Granted, it probably isn't a very common problem, but the IS_SWAPFILE() check was added explicitly because of clueless users. Also, very few users run with a swapfile in any case, most use a swap partition. It would seem that if you have a swapfile, try to truncate it to 0 (which will fail with -ETXTBUSY) and then unmount the filesystem the size will be truncated to 0: [root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=100 100+0 records in 100+0 records out [root@lin-cli1 tests]# mkswap /tmp/foo Setting up swapspace version 1, size = 405 kB [root@lin-cli1 tests]# swapon /tmp/foo [root@lin-cli1 tests]# swapon -s Filename Type Size Used Priority /dev/hda5 partition 1052216 136 -1 /tmp/foo file 392 0 -2 [root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=1 seek=99 dd: advancing past 405504 bytes in output file `/tmp/foo': Text file busy [root@lin-cli1 tests]# > /tmp/foo bash: /tmp/foo: Text file busy [root@lin-cli1 tests]# ls -l /tmp/foo 404 -rw-r--r-- 1 root root 409600 Mar 15 17:12 /tmp/foo [root@lin-cli1 tests]# swapoff /tmp/foo [root@lin-cli1 tests]# df /tmp Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 20641788 8997964 10595184 46% / [root@lin-cli1 tests]# debugfs /dev/hda1 debugfs 1.40.2.cfs4 (12-Jul-2007) debugfs: stat /tmp/foo Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation: 2276953 732 User: 0 Group: 0 Size: 0 <<< *** oops *** File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 808 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008 atime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008 mtime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008 BLOCKS: (0):2689410, (1-11):2689460-2689470, (IND):2689471, (12-54):2689472-2689514, (55 -59):2689516-2689520, (60-63):2689532-2689535, (64-87):2689544-2689567, (88-96): 2689592-2689600, (97-99):2689619-2689621 TOTAL: 101 debugfs: quit [root@lin-cli1 tests]# e2fsck -fn /dev/hda1 e2fsck 1.40.2.cfs4 (12-Jul-2007) Warning! /dev/hda1 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. Pass 1: Checking inodes, blocks, and sizes Inode 1346639, i_size is 0, should be 409600. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /: ********** WARNING: Filesystem still has errors ********** /: 337862/2626560 files (6.9% non-contiguous), 2354290/5242880 blocks Until e2fsck was actually run on the filesystem (which would do the right thing and fix the file size) the swap file would have 0 length and fail to start. It probably isn't fatal for most systems to run without swap these days, but let's see if this would cause a boot-time failure or if it is silently ignored (on a RHEL4 system): [root@lin-cli1 tests]# > /tmp/foo [root@lin-cli1 tests]# debugfs /dev/hda1 debugfs 1.40.2.cfs4 (12-Jul-2007) debugfs: stat /tmp/foo Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation: 2276953 732 User: 0 Group: 0 Size: 0 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x47dc5a45 -- Sat Mar 15 17:22:45 2008 atime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008 mtime: 0x47dc5a45 -- Sat Mar 15 17:22:45 2008 BLOCKS: debugfs: quit [root@lin-cli1 tests]# swapon /tmp/foo swapon: /tmp/foo: Invalid argument [root@lin-cli1 tests]# echo $? 255 [root@lin-cli1 tests]# grep -C 3 swapon /etc/rc.sysinit # Start up swapping. update_boot_stage RCswap action $"Enabling swap space: " swapon -a -e # Set up binfmt_misc /bin/mount -t binfmt_misc none /proc/sys/fs/binfmt_misc > /dev/null 2>&1 So it appears the error would be ignored, and eventually (because e2fsck will run periodically on filesystems, unless the user turns this off) it would be repaired properly. Still, it should be fixed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-15 23:54 ` Andreas Dilger @ 2008-03-16 0:23 ` Andreas Dilger 2008-03-16 11:39 ` Dmitri Monakhov 0 siblings, 1 reply; 8+ messages in thread From: Andreas Dilger @ 2008-03-16 0:23 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-ext4 On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote: > A call to inode_setattr() can fail by trying a shrinking vmtruncate() > on a swapfile, which returns ETXTBUSY. This was added after the > ext3_setattr() code was written. > > We need to handle the IS_SWAPFILE() case properly. > Granted, it probably isn't a very common problem, but the IS_SWAPFILE() > check was added explicitly because of clueless users, so it must be hit > occasionally in real life. > > It would seem that if you have a swapfile, try to truncate it to 0 (which > will fail with -ETXTBUSY) and then unmount the filesystem the size will > be truncated to 0. It is also possible to directly write to a swapfile > and corrupt memory, or read from a swapfile and access potentially sensitive > information. Dmitri, (or anyone) can you please give this patch a try. It should be OK, but I'm hesitant to test it on my test box because I'm out of town and if it fails to reboot I can't fix it for a week. It _should_ still be possible to chown, rename, ls, etc the swapfile, can you please verify that in addition to the simple test in my previous email. Note that the "dd" test I did previously was incorrect since "dd" was trying to do a truncate before the write, it needs to be run with conv=notrunc, and then the write succeeds: [root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=1 seek=99 dd: advancing past 405504 bytes in output file `/tmp/foo': Text file busy [root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo conv=notrunc bs=4k count=1 seek=99 1+0 records in 1+0 records out [root@lin-cli1 tests]# ls -l /tmp/foo 404 -rw-r--r-- 1 root root 409600 Mar 15 17:56 /tmp/foo [root@lin-cli1 tests]# debugfs -R "stat /tmp/foo" /dev/hda1 debugfs 1.40.2.cfs4 (12-Jul-2007) Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation: 2276953 732 User: 0 Group: 0 Size: 405504 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 808 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x47dc6211 -- Sat Mar 15 17:56:01 2008 atime: 0x47dc5f06 -- Sat Mar 15 17:43:02 2008 mtime: 0x47dc6211 -- Sat Mar 15 17:56:01 2008 dtime: 0x00148c4f -- Fri Jan 16 07:03:59 1970 BLOCKS: (0):2688759, (1-11):2688800-2688810, (IND):2688811, (12-44):2688812-2688844, (45):2689410, (46-99):2689460-2689513 TOTAL: 101 --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800 +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800 @@ -233,6 +233,10 @@ int permission(struct inode *inode, int if (nd) mnt = nd->mnt; + /* Don't allow direct read, write, or truncate on a swapfile */ + if (IS_SWAPFILE(inode)) + return -ETXTBUSY; + if (mask & MAY_WRITE) { umode_t mode = inode->i_mode; Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-16 0:23 ` Andreas Dilger @ 2008-03-16 11:39 ` Dmitri Monakhov 2008-03-16 15:22 ` Andreas Dilger 2008-03-17 8:24 ` Jens Axboe 0 siblings, 2 replies; 8+ messages in thread From: Dmitri Monakhov @ 2008-03-16 11:39 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 3549 bytes --] I've added Jens because he may be also interesting in this topic. On 08:23 Sun 16 Mar , Andreas Dilger wrote: > On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote: > > A call to inode_setattr() can fail by trying a shrinking vmtruncate() > > on a swapfile, which returns ETXTBUSY. This was added after the > > ext3_setattr() code was written. > > > > We need to handle the IS_SWAPFILE() case properly. > > Granted, it probably isn't a very common problem, but the IS_SWAPFILE() > > check was added explicitly because of clueless users, so it must be hit > > occasionally in real life. > > > > It would seem that if you have a swapfile, try to truncate it to 0 (which > > will fail with -ETXTBUSY) and then unmount the filesystem the size will > > be truncated to 0. It is also possible to directly write to a swapfile > > and corrupt memory, or read from a swapfile and access potentially sensitive > > information. In fact i've triggered this issue while working on fast_loop device implementation which was proposed by Jens (http://lkml.org/lkml/2008/1/9/50). In fast_loop device use swapfile approach (submitting bio-s directly to underlying block device). I think this idea will be included in mainstream loop device sooner or later. But approach has several issues: One of the most important is effective control mechanism over truncates for lower file, this issue was missed in Jens patch set. This mechanism probably have to have following options. #1: Shrink truncates must be denied. #2: Expand truncates may be allowed. This is good because most of non plain disk image formats (qcow, vmdk, and etc) are growing while adding new data blocks. #3: Allow exclusive owner for file, for example only one user(loop_thread in this case) may truncate file. Provide something similar to bd_claim feature. without this feature on-line shrinking of disk image looks like this: mutex_lock(&inode->i_mutex); inode->i_flags &= ~S_SWAPFILE; mutex_unlock(&inode->i_mutex); /* Perform shrinking truncate. This is absolutely racy operation because * some one else also may perform truncate at this time*/ do_truncate(inode, size); mutex_lock(&inode->i_mutex); inode->i_flags |= S_SWAPFILE; mutex_unlock(&inode->i_mutex); S_SWAPFILE inode option currently equals to #1, and #2. What's why i want use this flag for fast_loop devices. > > Dmitri, (or anyone) > can you please give this patch a try. It should be OK, but I'm hesitant > to test it on my test box because I'm out of town and if it fails to > reboot I can't fix it for a week. It _should_ still be possible to chown, > rename, ls, etc the swapfile, can you please verify that in addition > to the simple test in my previous email. > [snip] > > > > --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800 > +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800 > @@ -233,6 +233,10 @@ int permission(struct inode *inode, int > if (nd) > mnt = nd->mnt; > > + /* Don't allow direct read, write, or truncate on a swapfile */ Your patch disallow expand truncates (#2) which is definitely not good. In fact if file was opened before S_SWAPFILE flag was setted when it is possible to directly read, write from file. > + if (IS_SWAPFILE(inode)) > + return -ETXTBUSY; > + > if (mask & MAY_WRITE) { > umode_t mode = inode->i_mode; BTW it is impossible to disable swapfile with your patch [root@ts63 tmp]# swapoff /vz/swap swapoff: /vz/swap: Text file busy IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached: [-- Attachment #2: swap_flag.patch --] [-- Type: text/plain, Size: 463 bytes --] diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index eb95670..0f74beb 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -3032,6 +3032,10 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr) attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) { handle_t *handle; + if (IS_SWAPFILE(inode)) { + rc = -ETXTBSY; + goto err_out; + } handle = ext3_journal_start(inode, 3); if (IS_ERR(handle)) { error = PTR_ERR(handle); ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-16 11:39 ` Dmitri Monakhov @ 2008-03-16 15:22 ` Andreas Dilger 2008-03-16 17:48 ` Dmitri Monakhov 2008-03-17 8:24 ` Jens Axboe 1 sibling, 1 reply; 8+ messages in thread From: Andreas Dilger @ 2008-03-16 15:22 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: linux-ext4, Jens Axboe On Mar 16, 2008 14:39 +0300, Dmitri Monakhov wrote: > > --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800 > > +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800 > > @@ -233,6 +233,10 @@ int permission(struct inode *inode, int > > if (nd) > > mnt = nd->mnt; > > > > + /* Don't allow direct read, write, or truncate on a swapfile */ > > Your patch disallow expand truncates (#2) which is definitely not good. How often is that done though? Since this is only for a swapfile then it isn't needed. > In fact if file was opened before S_SWAPFILE flag was setted when it is > possible to directly read, write from file. I assume that is even a more rare case. I was thinking alternately to do a "deny_write" on the swapfile during swapon() so that this would fail in more cases. > > + if (IS_SWAPFILE(inode)) > > + return -ETXTBUSY; > > + > > if (mask & MAY_WRITE) { > > umode_t mode = inode->i_mode; > > BTW it is impossible to disable swapfile with your patch > [root@ts63 tmp]# swapoff /vz/swap > swapoff: /vz/swap: Text file busy I thought some bug like this might appear. > IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached: No, that still means every other filesystem is broken. Since the current filesystem code doesn't know anything about IS_SWAPFILE I'd rather keep it that way. I think it is better to move my proposed IS_SWAPFILE() check into "MAY_WRITE" and "MAY_EXEC" cases. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-16 15:22 ` Andreas Dilger @ 2008-03-16 17:48 ` Dmitri Monakhov 0 siblings, 0 replies; 8+ messages in thread From: Dmitri Monakhov @ 2008-03-16 17:48 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4, Jens Axboe On 23:22 Sun 16 Mar , Andreas Dilger wrote: > On Mar 16, 2008 14:39 +0300, Dmitri Monakhov wrote: > > > --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800 > > > +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800 > > > @@ -233,6 +233,10 @@ int permission(struct inode *inode, int > > > if (nd) > > > mnt = nd->mnt; > > > > > > + /* Don't allow direct read, write, or truncate on a swapfile */ > > > > Your patch disallow expand truncates (#2) which is definitely not good. > > How often is that done though? Since this is only for a swapfile then > it isn't needed. I hope what this technique may be used for implementing non plain device image formats support in loop device driver, currently all such drivers implemented in userspace ( for example qcow: http://brick.kernel.dk/git/?p=qemu.git;a=blob;h=730ae67ed8f7095ca4d22e80aca57678be18fb6c;hb=2fd2f67411787297687e4eecbb1db76ac7d06e45;f=block-qcow.c) which cause bad performance and reliability. > > > In fact if file was opened before S_SWAPFILE flag was setted when it is > > possible to directly read, write from file. > > I assume that is even a more rare case. I was thinking alternately to > do a "deny_write" on the swapfile during swapon() so that this would > fail in more cases. > > > > + if (IS_SWAPFILE(inode)) > > > + return -ETXTBUSY; > > > + > > > if (mask & MAY_WRITE) { > > > umode_t mode = inode->i_mode; > > > > BTW it is impossible to disable swapfile with your patch > > [root@ts63 tmp]# swapoff /vz/swap > > swapoff: /vz/swap: Text file busy > > I thought some bug like this might appear. > > > IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached: > > No, that still means every other filesystem is broken. Since the current > filesystem code doesn't know anything about IS_SWAPFILE I'd rather keep > it that way. I think it is better to move my proposed IS_SWAPFILE() check > into "MAY_WRITE" and "MAY_EXEC" cases. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: strange ext{3,4}_settattr logic 2008-03-16 11:39 ` Dmitri Monakhov 2008-03-16 15:22 ` Andreas Dilger @ 2008-03-17 8:24 ` Jens Axboe 1 sibling, 0 replies; 8+ messages in thread From: Jens Axboe @ 2008-03-17 8:24 UTC (permalink / raw) To: Dmitri Monakhov; +Cc: Andreas Dilger, linux-ext4 On Sun, Mar 16 2008, Dmitri Monakhov wrote: > I've added Jens because he may be also interesting in this topic. > On 08:23 Sun 16 Mar , Andreas Dilger wrote: > > On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote: > > > A call to inode_setattr() can fail by trying a shrinking vmtruncate() > > > on a swapfile, which returns ETXTBUSY. This was added after the > > > ext3_setattr() code was written. > > > > > > We need to handle the IS_SWAPFILE() case properly. > > > Granted, it probably isn't a very common problem, but the IS_SWAPFILE() > > > check was added explicitly because of clueless users, so it must be hit > > > occasionally in real life. > > > > > > It would seem that if you have a swapfile, try to truncate it to 0 (which > > > will fail with -ETXTBUSY) and then unmount the filesystem the size will > > > be truncated to 0. It is also possible to directly write to a swapfile > > > and corrupt memory, or read from a swapfile and access potentially sensitive > > > information. > In fact i've triggered this issue while working on fast_loop device > implementation which was proposed by Jens (http://lkml.org/lkml/2008/1/9/50). > In fast_loop device use swapfile approach (submitting bio-s directly to > underlying block device). I think this idea will be included in mainstream > loop device sooner or later. But approach has several issues: > One of the most important is effective control mechanism over truncates for > lower file, this issue was missed in Jens patch set. > This mechanism probably have to have following options. > #1: Shrink truncates must be denied. > #2: Expand truncates may be allowed. This is good because most of non plain > disk image formats (qcow, vmdk, and etc) are growing while adding new data > blocks. > #3: Allow exclusive owner for file, for example only one user(loop_thread in > this case) may truncate file. Provide something similar to bd_claim feature. > without this feature on-line shrinking of disk image looks like this: > > mutex_lock(&inode->i_mutex); > inode->i_flags &= ~S_SWAPFILE; > mutex_unlock(&inode->i_mutex); > > /* Perform shrinking truncate. This is absolutely racy operation because > * some one else also may perform truncate at this time*/ > do_truncate(inode, size); > > mutex_lock(&inode->i_mutex); > inode->i_flags |= S_SWAPFILE; > mutex_unlock(&inode->i_mutex); > > > S_SWAPFILE inode option currently equals to #1, and #2. What's why i want > use this flag for fast_loop devices. Neat, I didn't consider that. Mainly because I had (knowingly) ignored the exclusive owner bit so far. I have included your suggestion in the loop-fastfs and loop-extent_map branches, thanks. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-03-17 8:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-15 16:07 strange ext{3,4}_settattr logic Dmitri Monakhov
2008-03-15 23:05 ` Andreas Dilger
2008-03-15 23:54 ` Andreas Dilger
2008-03-16 0:23 ` Andreas Dilger
2008-03-16 11:39 ` Dmitri Monakhov
2008-03-16 15:22 ` Andreas Dilger
2008-03-16 17:48 ` Dmitri Monakhov
2008-03-17 8:24 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox