* [bug report] tune2fs: filesystem inconsistency occurs by concurrent write @ 2023-06-25 16:00 zhanchengbin 2023-06-26 2:17 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: zhanchengbin @ 2023-06-25 16:00 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26 Hi Tytso, Tune2fs does not recognize writes to the manipulated filesystem in another namespace, there will be two simultaneous write operations on a block, resulting in filesystem inconsistencies. The operation is as follows: first terminal second terminal mkfs.ext4 /dev/sdb; mount /dev/sdb /test-sdb; dd if=/dev/zero of=/test-sdb/test1 bs=1M count=100; unshare -m; umount; gdb tune2fs; b io_channel_write_byte r -e remount-ro /dev/sdb c(Write a byte of old data into the cache) exit; (gdb finish) tune2fs -l /dev/sdb; tune2fs 1.46.4 (18-Aug-2021) tune2fs: Superblock checksum does not match superblock while trying to open /dev/sdb Couldn't find valid filesystem superblock. - bin. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-06-25 16:00 [bug report] tune2fs: filesystem inconsistency occurs by concurrent write zhanchengbin @ 2023-06-26 2:17 ` Theodore Ts'o 2023-06-26 11:56 ` zhanchengbin 2023-07-04 8:35 ` zhanchengbin 0 siblings, 2 replies; 9+ messages in thread From: Theodore Ts'o @ 2023-06-26 2:17 UTC (permalink / raw) To: zhanchengbin; +Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26 On Mon, Jun 26, 2023 at 12:00:08AM +0800, zhanchengbin wrote: > Tune2fs does not recognize writes to the manipulated filesystem in another > namespace, there will be two simultaneous write operations on a > block, resulting in filesystem inconsistencies. What you are reporting has nothing to do with namespaces, since "tune2fs -e remount-ro /dev/sdb" is something which is allowed regardless of whether the file system is mounted. What reproduction is effectively doing is trying to set up a race between when tune2fs writes a byte to update to update the errors behavior, and when the actual unmount of the file system happens (e.g., when the last namespace unmounts the file system). At that point, the kernel is going to be updating the superblock as part of the unmount, and then it calculates the superblock, and then it writes out the superblock. If the tune2fs races with the unmount, it's possible for the tune2fs update of the error beavhiour bit, and the update of the superblock checksum, to race with the kernel's final update of the superblock, includinig its attempt to update the checksum. There are some workarounds to this, but ultimately, we need to replace the ad-hoc modification of the block device by tune2fs with some ioctls which specifically update superblock when the file system mounted. As far as whether or not tune2fs can detect if the file system is mounted, what we can do is check to see if the block device is busy. If it is mounted in some other namespace, we won't be able to see it mounted in /proc/self/mounts, but we can see that it's not possible to open the block device with O_EXCL. Compare: root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc Device /dev/vdc reports flags 31 /dev/vdc is apparently in use. /dev/vdc is mounted. /dev/vdc is mounted on /vdc. and then "unshare -m" in another terminal, followed by umount /dev/vdc in the first terminal: root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc Device /dev/vdc reports flags 10 /dev/vdc is apparently in use. ... and then after we exit the last mount namespace which was keeping /dev/vdc mounted: root@kvm-xfstests:~# [ 2409.811328] EXT4-fs (vdc): unmounting filesystem bdc026fd-85a8-4ccf-94f8-961487000293. root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc Device /dev/vdc reports flags 00 Cheers, - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-06-26 2:17 ` Theodore Ts'o @ 2023-06-26 11:56 ` zhanchengbin 2023-07-04 8:35 ` zhanchengbin 1 sibling, 0 replies; 9+ messages in thread From: zhanchengbin @ 2023-06-26 11:56 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26 Thanks for your explanation, it's a great idea. - bin. On 2023/6/26 10:17, Theodore Ts'o wrote: > On Mon, Jun 26, 2023 at 12:00:08AM +0800, zhanchengbin wrote: >> Tune2fs does not recognize writes to the manipulated filesystem in another >> namespace, there will be two simultaneous write operations on a >> block, resulting in filesystem inconsistencies. > > What you are reporting has nothing to do with namespaces, since > "tune2fs -e remount-ro /dev/sdb" is something which is allowed > regardless of whether the file system is mounted. What reproduction > is effectively doing is trying to set up a race between when tune2fs > writes a byte to update to update the errors behavior, and when the > actual unmount of the file system happens (e.g., when the last > namespace unmounts the file system). At that point, the kernel is > going to be updating the superblock as part of the unmount, and then > it calculates the superblock, and then it writes out the superblock. > > If the tune2fs races with the unmount, it's possible for the tune2fs > update of the error beavhiour bit, and the update of the superblock > checksum, to race with the kernel's final update of the superblock, > includinig its attempt to update the checksum. > > There are some workarounds to this, but ultimately, we need to replace > the ad-hoc modification of the block device by tune2fs with some > ioctls which specifically update superblock when the file system > mounted. > > As far as whether or not tune2fs can detect if the file system is > mounted, what we can do is check to see if the block device is busy. > If it is mounted in some other namespace, we won't be able to see it > mounted in /proc/self/mounts, but we can see that it's not possible to > open the block device with O_EXCL. > > Compare: > > root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc > Device /dev/vdc reports flags 31 > /dev/vdc is apparently in use. > /dev/vdc is mounted. > /dev/vdc is mounted on /vdc. > > and then "unshare -m" in another terminal, followed by umount /dev/vdc > in the first terminal: > > root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc > Device /dev/vdc reports flags 10 > /dev/vdc is apparently in use. > > ... and then after we exit the last mount namespace which was keeping > /dev/vdc mounted: > > root@kvm-xfstests:~# [ 2409.811328] EXT4-fs (vdc): unmounting filesystem bdc026fd-85a8-4ccf-94f8-961487000293. > root@kvm-xfstests:~# /vtmp/tst_ismounted /dev/vdc > Device /dev/vdc reports flags 00 > > Cheers, > > - Ted > > . > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-06-26 2:17 ` Theodore Ts'o 2023-06-26 11:56 ` zhanchengbin @ 2023-07-04 8:35 ` zhanchengbin 2023-07-04 19:33 ` Theodore Ts'o 1 sibling, 1 reply; 9+ messages in thread From: zhanchengbin @ 2023-07-04 8:35 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin Hi Tytso, Referring to you kind and insightful advice, I have come up with three solutions, listed as follows: 1) Add interface blkdev_bh_read and blkdev_bh_write in blkdev_ioctl. Lock the bh_lock when blkdev_bh_read is called, then change the value of the data and write back it, and finally unlock_buffer; This solution ensures that there is no coupling between the block layer and the file system layer, but the interfaces are invoked by user processes, and since user processes are subject to scheduling, there may be potential performance overhead. Additionally, if user process is killed before releasing the bh_lock, it can result in a deadlock situation. 2) Add interface write_super in blkdev_ioctl. In this case, a hook is used in the write_super to invoke the commit_super function of each file system. During the mount process, the commit_super function of the respective filesystem is saved into the hook. Since the metadata cache is stored through bdev, it is sufficient to memcpy the data to bh->b_data and then flush it to disk. This solution allows each file system to implement its own method of flushing the superblock. However, it introduces coupling between the block layer and the filesystem layer, and there needs to be a place to store this commit_super hook function pointer. Is it stored in the block_device? It seems a little bit strange to me. 3) Add interface write_super in ext4_ioctl. The mount point is obtained through the block device, and open a random file in the file system, the superblock is passed to the kernel through ioctl of the file, and finally, the superblock is flushed to disk. Due to the inherent risks associated with granting user space write permissions to the superblock, it is deemed unsafe to utilize the entire superblock as provided by user space. Instead, the superblock should be validated, followed by selective data writing and recording. Of course, it is dangerous to directly modify the data in the super block, so I plan to add a flag in the super block to record that this modification is from the user state. This solution has no coupling with other layers, but uses the ioctl of ordinary files. Personally speaking, I favour the third solution the most, what are your opinions? If you already have other schemes, I will be more than delighted to discuss it with you. Looking foward to hearing from you soon! Thanks, - bin. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-07-04 8:35 ` zhanchengbin @ 2023-07-04 19:33 ` Theodore Ts'o 2023-07-08 7:29 ` zhanchengbin 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2023-07-04 19:33 UTC (permalink / raw) To: zhanchengbin; +Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin On Tue, Jul 04, 2023 at 04:35:43PM +0800, zhanchengbin wrote: > 3) Add interface write_super in ext4_ioctl. The mount point is > obtained through the block device, and open a random file in the > file system, the superblock is passed to the kernel through ioctl of > the file, and finally, the superblock is flushed to disk. Due to the > inherent risks associated with granting user space write permissions > to the superblock, it is deemed unsafe to utilize the entire > superblock as provided by user space. Instead, the superblock should > be validated, followed by selective data writing and recording. Of > course, it is dangerous to directly modify the data in the super > block, so I plan to add a flag in the super block to record that > this modification is from the user state. > Personally speaking, I favour the third solution the most, what are > your opinions? If you already have other schemes, I will be more > than delighted to discuss it with you. Looking foward to hearing > from you soon I agree that the third solution (or some variant thereof) is the best. The first two involve changes to the block layer, and in addition the problems you've outlined, different file systems have file systems in different superblocks in locations on disk, and the superblock may be differently sized, etc. The problem we are trying to solve is also fairly unique to ext2/3/4, since many other file systems either have their own specialized ioctls, or they may simply not allow any file system tuning operations while the file system is mounted. The problem though with a write_super() ioctl though is that while it solves the problem of false positive complaints from syzbot (at least according to my opinion, sane threat models --- the syzbot folks seem to disagree about what a sane threat model ought to be), it doesn't solve the *other* problem which is that the superblock can also be modified by the kernel, so there are some inherent race conditions that can occcur when userspace and kernel are trying to modify the superblock at the same time. This can be potentially solved by a lot of checks by the kernel code handling the hypothetical EXT4_IOC_WRITE_SUPER ioctl. This could be improved by a passing in a second superblock so the ioctl handling code can compare the modified superblocks between the original and modified superblock, and then apply more sanity checks. But that's a lot of extra complexity, and you won't know whether the kernel will support modifying some pariticular superblock field. So the better approach is to have multiple new ioctl's for each superblock field (or set of fields) that you might want modify. We have some of these already --- for example, EXT4_IOC_SETFSUUID and FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making configuration changes while the file system is mounted are: * EXT4_IOC_SET_FEATURES - for tune2fs -O... * EXT4_IOC_CLEAR_FEATURES - for tune2fs -O^... * EXT4_IOC_SET_ERROR_BEHAVIOR - for tune2fs -e * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS - for tune2fs -o * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS - for tune2fs -E mount_opts=XXX * EXT4_IOC_SET_FSCK_POLICY - for tune2fs -[cCiT] * EXT4_IOC_SET_RESERVED_ALLOC - for tune2fs -[ugm] (The last two assumes using a structure, since it's probably not worth creating 7 new ioctls when 2 would probably do). Some of these ioctls are pretty esoteric, and are rarely used by most system adminsitrators. And we don't have to add all of them all at once; we could gradually add some of them, and most of these are simple enough that we could assign them as a starter project for a new college grad that you've hired into your company, or to an intern. Cheers, - Ted P.S. I've ignored ioctls to "get" the superblock. We could just have a single EXT4_IOC_READ_SUPERBLOCK, or we can solve the problem by simply having the userspace use an O_DIRECT to read the superblock, since this will avoid the potential invalid checksum issues the superblock will only be written out to disk by the kernel when it is self-consistent. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-07-04 19:33 ` Theodore Ts'o @ 2023-07-08 7:29 ` zhanchengbin 2023-07-12 0:05 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: zhanchengbin @ 2023-07-08 7:29 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin On 2023/7/5 3:33, Theodore Ts'o wrote: I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your instructions and verified that it can indeed modify the data on the disk. However, I realized some problems. If we use the ioctl method to modify the data, it would require multiple ioctls in user space to modify the superblock. If one ioctl successfully modifies the superblock data, but another ioctl fails, the atomicity of the superblock cannot be guaranteed. This is just within one user space process, not to mention the scenario of multiple user space processes calling ioctls concurrently. Additionally, multiple ioctls modifying the superblock may be placed in multiple transactions, which further compromises atomicity. Writing the entire superblock buffer_head to disk can ensure atomicity. However, these data need to be converted to ext4_sb_info. Otherwise, during unmount, the data in memory will overwrite the data on the disk. (Modifying the values in ext4_sb_info can potentially introduce unexpected issues, just like the problem thata arises when setting the UUID without checking ext4_has_feature_csum_seed.) > So the better approach is to have multiple new ioctl's for each > superblock field (or set of fields) that you might want modify. We > have some of these already --- for example, EXT4_IOC_SETFSUUID and > FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making > configuration changes while the file system is mounted are: > > * EXT4_IOC_SET_FEATURES > - for tune2fs -O... > * EXT4_IOC_CLEAR_FEATURES > - for tune2fs -O^... > * EXT4_IOC_SET_ERROR_BEHAVIOR > - for tune2fs -e > * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS > - for tune2fs -o > * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS > - for tune2fs -E mount_opts=XXX > * EXT4_IOC_SET_FSCK_POLICY > - for tune2fs -[cCiT] > * EXT4_IOC_SET_RESERVED_ALLOC > - for tune2fs -[ugm] diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 331859511f80..d598e1975786 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -51,6 +51,11 @@ static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg) memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE); } +static void ext4_sb_set_error_behavior(struct ext4_super_block *es, const void *arg) +{ + es->s_errors = cpu_to_le16(*(__u16 *)arg); +} + static int ext4_update_primary_sb(struct super_block *sb, handle_t *handle, ext4_update_sb_callback func, @@ -1220,6 +1225,32 @@ static int ext4_ioctl_setuuid(struct file *filp, return ret; } +static int ext4_ioctl_set_error_behavior(struct file *filp, + const __u16 __user *uerror_behavior) +{ + int ret = 0; + struct super_block *sb = file_inode(filp)->i_sb; + __u16 error_behavior; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&error_behavior, uerror_behavior, sizeof(error_behavior))) + return -EFAULT; + + if (error_behavior < EXT4_ERRORS_CONTINUE || error_behavior > EXT4_ERRORS_PANIC) + return -EINVAL; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + ret = ext4_update_superblocks_fn(sb, ext4_sb_set_error_behavior, &error_behavior); + mnt_drop_write_file(filp); + + return ret; +} + static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = file_inode(filp); @@ -1607,6 +1638,8 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg); case EXT4_IOC_SETFSUUID: return ext4_ioctl_setuuid(filp, (const void __user *)arg); + case EXT4_IOC_SET_ERROR_BEHAVIOR: + return ext4_ioctl_set_error_behavior(filp, (const void __user *)arg); default: return -ENOTTY; } diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h index 1c4c2dd29112..68329d51a8a7 100644 --- a/include/uapi/linux/ext4.h +++ b/include/uapi/linux/ext4.h @@ -33,6 +33,7 @@ #define EXT4_IOC_CHECKPOINT _IOW('f', 43, __u32) #define EXT4_IOC_GETFSUUID _IOR('f', 44, struct fsuuid) #define EXT4_IOC_SETFSUUID _IOW('f', 44, struct fsuuid) +#define EXT4_IOC_SET_ERROR_BEHAVIOR _IOW('f', 45, __u16) #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32) Thanks, - bin. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-07-08 7:29 ` zhanchengbin @ 2023-07-12 0:05 ` Darrick J. Wong 2023-07-12 9:06 ` zhanchengbin 0 siblings, 1 reply; 9+ messages in thread From: Darrick J. Wong @ 2023-07-12 0:05 UTC (permalink / raw) To: zhanchengbin Cc: Theodore Ts'o, linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote: > On 2023/7/5 3:33, Theodore Ts'o wrote: > > I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your > instructions and verified that it can indeed modify the data on the disk. > > However, I realized some problems. If we use the ioctl method to modify the > data, it would require multiple ioctls in user space to modify the > superblock. > If one ioctl successfully modifies the superblock data, but another ioctl > fails, the atomicity of the superblock cannot be guaranteed. This is just > within one user space process, not to mention the scenario of multiple user > space processes calling ioctls concurrently. Additionally, multiple ioctls > modifying the superblock may be placed in multiple transactions, which > further > compromises atomicity. > > Writing the entire superblock buffer_head to disk can ensure atomicity. ...at a cost of racing with the mounted fs, which might be updating the superblock at the same time; and prohibiting the kernel devs from closing the "scribble on mounted bdev" attack surface. How many of these attributes do you /really/ need to be able to commit atomically, anyway? If the system crashes, can't you re-run the program and end up with the same super fields? --D > However, these data need to be converted to ext4_sb_info. Otherwise, during > unmount, the data in memory will overwrite the data on the disk. > (Modifying the values in ext4_sb_info can potentially introduce unexpected > issues, just like the problem thata arises when setting the UUID without > checking ext4_has_feature_csum_seed.) > > > So the better approach is to have multiple new ioctl's for each > > superblock field (or set of fields) that you might want modify. We > > have some of these already --- for example, EXT4_IOC_SETFSUUID and > > FS_IOC_SETFSLABEL. For tune2fs, all of additional ioctls for making > > configuration changes while the file system is mounted are: > > > > * EXT4_IOC_SET_FEATURES > > - for tune2fs -O... > > * EXT4_IOC_CLEAR_FEATURES > > - for tune2fs -O^... > > * EXT4_IOC_SET_ERROR_BEHAVIOR > > - for tune2fs -e > > * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS > > - for tune2fs -o > > * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS > > - for tune2fs -E mount_opts=XXX > > * EXT4_IOC_SET_FSCK_POLICY > > - for tune2fs -[cCiT] > > * EXT4_IOC_SET_RESERVED_ALLOC > > - for tune2fs -[ugm] > > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c > index 331859511f80..d598e1975786 100644 > --- a/fs/ext4/ioctl.c > +++ b/fs/ext4/ioctl.c > @@ -51,6 +51,11 @@ static void ext4_sb_setuuid(struct ext4_super_block *es, > const void *arg) > memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE); > } > > +static void ext4_sb_set_error_behavior(struct ext4_super_block *es, const > void *arg) > +{ > + es->s_errors = cpu_to_le16(*(__u16 *)arg); > +} > + > static > int ext4_update_primary_sb(struct super_block *sb, handle_t *handle, > ext4_update_sb_callback func, > @@ -1220,6 +1225,32 @@ static int ext4_ioctl_setuuid(struct file *filp, > return ret; > } > > +static int ext4_ioctl_set_error_behavior(struct file *filp, > + const __u16 __user *uerror_behavior) > +{ > + int ret = 0; > + struct super_block *sb = file_inode(filp)->i_sb; > + __u16 error_behavior; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + if (copy_from_user(&error_behavior, uerror_behavior, > sizeof(error_behavior))) > + return -EFAULT; > + > + if (error_behavior < EXT4_ERRORS_CONTINUE || error_behavior > > EXT4_ERRORS_PANIC) > + return -EINVAL; > + > + ret = mnt_want_write_file(filp); > + if (ret) > + return ret; > + > + ret = ext4_update_superblocks_fn(sb, ext4_sb_set_error_behavior, > &error_behavior); > + mnt_drop_write_file(filp); > + > + return ret; > +} > + > static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long > arg) > { > struct inode *inode = file_inode(filp); > @@ -1607,6 +1638,8 @@ static long __ext4_ioctl(struct file *filp, unsigned > int cmd, unsigned long arg) > return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg); > case EXT4_IOC_SETFSUUID: > return ext4_ioctl_setuuid(filp, (const void __user *)arg); > + case EXT4_IOC_SET_ERROR_BEHAVIOR: > + return ext4_ioctl_set_error_behavior(filp, (const void __user *)arg); > default: > return -ENOTTY; > } > diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h > index 1c4c2dd29112..68329d51a8a7 100644 > --- a/include/uapi/linux/ext4.h > +++ b/include/uapi/linux/ext4.h > @@ -33,6 +33,7 @@ > #define EXT4_IOC_CHECKPOINT _IOW('f', 43, __u32) > #define EXT4_IOC_GETFSUUID _IOR('f', 44, struct fsuuid) > #define EXT4_IOC_SETFSUUID _IOW('f', 44, struct fsuuid) > +#define EXT4_IOC_SET_ERROR_BEHAVIOR _IOW('f', 45, __u16) > > #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32) > > > > Thanks, > - bin. > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-07-12 0:05 ` Darrick J. Wong @ 2023-07-12 9:06 ` zhanchengbin 2023-07-12 15:42 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: zhanchengbin @ 2023-07-12 9:06 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin On 2023/7/12 8:05, Darrick J. Wong wrote: > On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote: >> On 2023/7/5 3:33, Theodore Ts'o wrote: >> >> I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your >> instructions and verified that it can indeed modify the data on the disk. >> >> However, I realized some problems. If we use the ioctl method to modify the >> data, it would require multiple ioctls in user space to modify the >> superblock. >> If one ioctl successfully modifies the superblock data, but another ioctl >> fails, the atomicity of the superblock cannot be guaranteed. This is just >> within one user space process, not to mention the scenario of multiple user >> space processes calling ioctls concurrently. Additionally, multiple ioctls >> modifying the superblock may be placed in multiple transactions, which >> further >> compromises atomicity. >> >> Writing the entire superblock buffer_head to disk can ensure atomicity. > > ...at a cost of racing with the mounted fs, which might be updating the > superblock at the same time; and prohibiting the kernel devs from > closing the "scribble on mounted bdev" attack surface. Regardless of whether I am modifying a single byte or the entire buffer_head, there will always be a situation of contention with the kernel lock, You can take a look at ext4_update_superblocks_fn which calls lock_buffer. What I am more concerned about is that the superblock needs to be synchronized to the memory before being saved to the disk. Otherwise, during the ext4_commit_super process, outdated data may be saved to the disk. > > How many of these attributes do you /really/ need to be able to commit My plan with Tytso is to add seven attribute modifications. > atomically, anyway? If the system crashes, can't you re-run the > program and end up with the same super fields? Just run the program again after the system crashes, o.O? I don't think so. The program perceives that the superblock has been modified successfully, and the value of the modified superblock is saved on disk in ext4_update_primary_sb, but there is no guarantee whether the super block is saved in journal on the disk or whether it is checkpointed. If the super block has not been saved in journal on the disk and the system crashes, the modification of the super block may be overwritten when the journal recover; similarly, this problem will also occur for the translation that has not been checkpointed; Both of these scenarios are not perceptible to user process unless there is a backup mechanism implemented in user mode. Moreover, the method of rerunning the program cannot resolve the conflicting racing condition between the two ioctls. Thanks, - bin. > > --D > >> However, these data need to be converted to ext4_sb_info. Otherwise, during >> unmount, the data in memory will overwrite the data on the disk. >> (Modifying the values in ext4_sb_info can potentially introduce unexpected >> issues, just like the problem thata arises when setting the UUID without >> checking ext4_has_feature_csum_seed.) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write 2023-07-12 9:06 ` zhanchengbin @ 2023-07-12 15:42 ` Theodore Ts'o 0 siblings, 0 replies; 9+ messages in thread From: Theodore Ts'o @ 2023-07-12 15:42 UTC (permalink / raw) To: zhanchengbin Cc: Darrick J. Wong, linux-ext4, linfeilong, louhongxiang, liuzhiqiang26, Ye Bin On Wed, Jul 12, 2023 at 05:06:31PM +0800, zhanchengbin wrote: > > ...at a cost of racing with the mounted fs, which might be updating the > > superblock at the same time; and prohibiting the kernel devs from > > closing the "scribble on mounted bdev" attack surface. > > Regardless of whether I am modifying a single byte or the entire > buffer_head, there will always be a situation of contention with the kernel > lock, You can take a look at ext4_update_superblocks_fn which calls > lock_buffer. Many/most of the fields that tune2fs will need to modify are ones which the kernel never needs to modify. So no locking will be necessary, and so long as you are using the journal, we don't need to worry about an invalid checksum getting written to the disk. There might be races with buffered reads to the superblock, but those races exist today. E2fsprogs has a way of dealing this where if the checksum is invalid, it will just sleep and retry. Another way of dealing with is to use an O_DIRECT read to the superblock. Longer-term, we may want to have a EXT4_IOC_GET_SUPERBLOCK ioctl, at which point we may need to have some kind of new kernel locking ---- or we can just take a snapshot of the superblock, and check to see the checksum is valid; if not, it can just retry the snapshot. > What I am more concerned about is that the superblock needs to be > synchronized to the memory before being saved to the disk. Otherwise, during > the ext4_commit_super process, outdated data may be saved to the disk. As I've noted above, this is already not a problem if journalling is enabled. If journalling is not enabled, it's possible that an invalid superblock is written to disk. This can sort of happen already, if you have an orphan list update racing with an ext4_error() update to the superblock. It's a debateable point how much we should care, since if you don't have journalling enabled, on a crash the file system may be corrupted in many situations, and so we will need to use fsck.ext4 anyway. If there is only a single superblock, then fsck might not have a fallback superblock to use, but arguably that's a corner case. > The program perceives that the superblock has been modified > successfully, and the value of the modified superblock is saved on > disk in ext4_update_primary_sb, but there is no guarantee whether > the super block is saved in journal on the disk or whether it is > checkpointed. So in actual practice, e2fsprogs will replay the journal and then reread the superblock. So if you change the max_mounts_count via tune2fs -c (even though very few systems use that these days, and it's largely a deprecated feature), so long as the transaction has been committed, the superblock update will be honored by e2fsck. I'm not sure we care *all* that much, but if we really want to make sure things like tune2fs -c will always "take" after a crash, we can simply have the ioctl force a journal commit before returning. > If the super block has not been saved in journal on > the disk and the system crashes, the modification of the super block > may be overwritten when the journal recover; similarly, this problem > will also occur for the translation that has not been checkpointed; > Both of these scenarios are not perceptible to user process unless > there is a backup mechanism implemented in user mode. We now *always* update the superblock through the journal. We only fall back to a direct update of the superblock if the journal is not present, or the journal has aborted due to some kind of fundamental failure (so that the ext4_error can be written out before we reboot the system or force the file system to be read-only). > Moreover, the method of rerunning the program cannot resolve the conflicting > racing condition between the two ioctls. These ioctls are quite rarely used, and it's a problem we have today if we have two racing tune2fs commands. By having the kernel handle it, so long as the two ioctls are modifying different superblock fields, both ioctl updates will happen --- where as today, when we have two racing tune2fs, one of the tune2fs updates could be completely lost. This has been true for decades in ext2, ext3, and ext4, and no one has actually reported this as a problem. Cheers, ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-07-12 15:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-06-25 16:00 [bug report] tune2fs: filesystem inconsistency occurs by concurrent write zhanchengbin 2023-06-26 2:17 ` Theodore Ts'o 2023-06-26 11:56 ` zhanchengbin 2023-07-04 8:35 ` zhanchengbin 2023-07-04 19:33 ` Theodore Ts'o 2023-07-08 7:29 ` zhanchengbin 2023-07-12 0:05 ` Darrick J. Wong 2023-07-12 9:06 ` zhanchengbin 2023-07-12 15:42 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).