* [ext3] Changes to block device after an ext3 mount point has been remounted readonly @ 2010-02-18 16:45 PiX 2010-02-18 16:50 ` Camille Moncelier 2010-02-18 21:41 ` Andreas Dilger 0 siblings, 2 replies; 22+ messages in thread From: PiX @ 2010-02-18 16:45 UTC (permalink / raw) To: linux-fsdevel I'm experiencing a strange behavior. After having remounted / to readonly, I'm doing a sha1sum on /dev/sda1 (which is mounted on /) Then, I reboot, GRUB starts the kernel with the ro option, when I do a hash of /dev/sda1 the sum has changed. This only happen when the rootfs hash been mounted ro, then remounted rw to make some changes and remounted ro. On the next reboot the hash will change, but only one time. Next reboots will not alter the control sum, until of course I remount it RW. Here's a small shell script that show that some changes are made between "mount -o remount,ro" and umount. On my system the might run 1 time before failing as well as 20. For the record: $ uname -a Linux cortex 2.6.32-ARCH #1 SMP PREEMPT Tue Feb 9 14:46:08 UTC 2010 i686 Intel(R) Core(TM)2 Duo CPU E4400 @ 2.00GHz GenuineIntel GNU/Linux -- Camille Moncelier http://devlife.org/ If Java had true garbage collection, most programs would delete themselves upon execution. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-18 16:45 [ext3] Changes to block device after an ext3 mount point has been remounted readonly PiX @ 2010-02-18 16:50 ` Camille Moncelier 2010-02-18 21:41 ` Andreas Dilger 1 sibling, 0 replies; 22+ messages in thread From: Camille Moncelier @ 2010-02-18 16:50 UTC (permalink / raw) To: linux-fsdevel Just forgot to post the script: ------------ 8< test.sh 8< ------------ #!/bin/bash while true; do dd bs=1M count=3 if=/dev/zero of=/tmp/sample.img > /tmp/testsha.log 2>&1 losetup /dev/loop0 /tmp/sample.img >> /tmp/testsha.log 2>&1 mkfs.ext3 /dev/loop0 >> /tmp/testsha.log 2>&1 mount /dev/loop0 /mnt >> /tmp/testsha.log 2>&1 touch /mnt/test >> /tmp/testsha.log 2>&1 echo 's' > /proc/sysrq-trigger tail -n1 /var/log/messages.log >> /tmp/testsha.log 2>&1 mount -o remount,ro,sync,dirsync /mnt >> /tmp/testsha.log 2>&1 dd if=/dev/loop0 of=/tmp/hash1.img >> /tmp/testsha.log 2>&1 HASH_1=$(sha1sum /tmp/hash1.img | cut -d" " -f1) umount /mnt >> /tmp/testsha.log 2>&1 dd if=/dev/loop0 of=/tmp/hash2.img >> /tmp/testsha.log 2>&1 HASH_2=$(sha1sum /tmp/hash2.img | cut -d" " -f1) mount -o ro /dev/loop0 /mnt >> /tmp/testsha.log 2>&1 umount /mnt >> /tmp/testsha.log 2>&1 dd if=/dev/loop0 of=/tmp/hash3.img >> /tmp/testsha.log 2>&1 HASH_3=$(sha1sum /tmp/hash3.img | cut -d" " -f1) losetup -d /dev/loop0 >> /tmp/testsha.log 2>&1 if [ "$HASH_1" = "$HASH_2" -a "$HASH_1" = "$HASH_3" ] ; then rm /tmp/hash{1,2,3}.img echo All right ! else echo Something gone wrong: cat /tmp/testsha.log echo HASH_1: $HASH_1 echo HASH_2: $HASH_2 echo HASH_3: $HASH_3 exit 1 fi done ------------ 8< test.sh 8< ------------ ----------------- 8< ----------------- $ sudo bash test.sh All right ! All right ! Something gone wrong: 3+0 records in 3+0 records out 3145728 bytes (3.1 MB) copied, 0.00673144 s, 467 MB/s mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=1024 (log=0) Fragment size=1024 (log=0) 768 inodes, 3072 blocks 153 blocks (4.98%) reserved for the super user First data block=1 Maximum filesystem blocks=3145728 1 block group 8192 blocks per group, 8192 fragments per group 768 inodes per group Writing inode tables: done Creating journal (1024 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 23 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. Feb 18 17:40:15 graves kernel: EXT3-fs: mounted filesystem with writeback data mode. 6144+0 records in 6144+0 records out 3145728 bytes (3.1 MB) copied, 0.0271642 s, 116 MB/s 6144+0 records in 6144+0 records out 3145728 bytes (3.1 MB) copied, 0.0315827 s, 99.6 MB/s 6144+0 records in 6144+0 records out 3145728 bytes (3.1 MB) copied, 0.0293324 s, 107 MB/s HASH_1: 5a08a18d139fced51a874a091c52eb31c36d37b5 HASH_2: 6114873f3ef93a4a2636a04e172dd3a80ea54950 HASH_3: 6114873f3ef93a4a2636a04e172dd3a80ea54950 ----------------- 8< ----------------- On Thu, Feb 18, 2010 at 5:45 PM, PiX <pix@devlife.org> wrote: > I'm experiencing a strange behavior. After having remounted / to > readonly, I'm doing a sha1sum on /dev/sda1 (which is mounted on /) > Then, I reboot, GRUB starts the kernel with the ro option, when I do a > hash of /dev/sda1 the sum has changed. > This only happen when the rootfs hash been mounted ro, then remounted > rw to make some changes and remounted ro. > On the next reboot the hash will change, but only one time. Next > reboots will not alter the control sum, until of course I remount it > RW. > Here's a small shell script that show that some changes are made > between "mount -o remount,ro" and umount. > On my system the might run 1 time before failing as well as 20. > For the record: > $ uname -a > Linux cortex 2.6.32-ARCH #1 SMP PREEMPT Tue Feb 9 14:46:08 UTC 2010 > i686 Intel(R) Core(TM)2 Duo CPU E4400 @ 2.00GHz GenuineIntel GNU/Linux > -- > Camille Moncelier > http://devlife.org/ > > If Java had true garbage collection, most programs would > delete themselves upon execution. > -- Camille Moncelier http://devlife.org/ If Java had true garbage collection, most programs would delete themselves upon execution. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-18 16:45 [ext3] Changes to block device after an ext3 mount point has been remounted readonly PiX 2010-02-18 16:50 ` Camille Moncelier @ 2010-02-18 21:41 ` Andreas Dilger 2010-02-19 7:38 ` Camille Moncelier 1 sibling, 1 reply; 22+ messages in thread From: Andreas Dilger @ 2010-02-18 21:41 UTC (permalink / raw) To: PiX; +Cc: linux-fsdevel, ext4 development On 2010-02-18, at 09:45, PiX wrote: > I'm experiencing a strange behavior. After having remounted / to > readonly, I'm doing a sha1sum on /dev/sda1 (which is mounted on /) > Then, I reboot, GRUB starts the kernel with the ro option, when I do a > hash of /dev/sda1 the sum has changed. Are you sure this isn't because e2fsck has been run at boot time and changed e.g. the "last checked" timestamp in the superblock? > This only happen when the rootfs hash been mounted ro, then remounted > rw to make some changes and remounted ro. > On the next reboot the hash will change, but only one time. Next > reboots will not alter the control sum, until of course I remount it > RW. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-18 21:41 ` Andreas Dilger @ 2010-02-19 7:38 ` Camille Moncelier 2010-02-22 22:32 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Camille Moncelier @ 2010-02-19 7:38 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andreas Dilger, ext4 development On Thu, Feb 18, 2010 at 10:41 PM, Andreas Dilger <adilger@sun.com> wrote: > Are you sure this isn't because e2fsck has been run at boot time and changed > e.g. the "last checked" timestamp in the superblock? > No, I replaced /sbin/init by something which compute the sha1sum of the root partition, display it then call /sbin/init and I can see that the hash has changed after mount -o remount,ro. As little as I understand, I managed to make a diff between two hexdump of small images where changes happened after I created a file and remounted the fs ro and it seems that, the driver didn't wrote changes to the disk until unmount ( The hexdump clearly shows that /lost+found and /test file has been written after the umount ) workaround: Is there some knob in /proc or /sys which can trigger all pending changes to disk ? ( Like /proc/sys/vm/drop_caches but for filesystems ? ) >> This only happen when the rootfs hash been mounted ro, then remounted >> rw to make some changes and remounted ro. >> On the next reboot the hash will change, but only one time. Next >> reboots will not alter the control sum, until of course I remount it >> RW. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- Camille Moncelier http://devlife.org/ If Java had true garbage collection, most programs would delete themselves upon execution. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-19 7:38 ` Camille Moncelier @ 2010-02-22 22:32 ` Jan Kara 2010-02-22 23:05 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2010-02-22 22:32 UTC (permalink / raw) To: Camille Moncelier; +Cc: linux-fsdevel, Andreas Dilger, ext4 development > On Thu, Feb 18, 2010 at 10:41 PM, Andreas Dilger <adilger@sun.com> wrote: > > > Are you sure this isn't because e2fsck has been run at boot time and changed > > e.g. the "last checked" timestamp in the superblock? > > > No, I replaced /sbin/init by something which compute the sha1sum of > the root partition, display it then call /sbin/init and I can see that > the hash has changed after mount -o remount,ro. > > As little as I understand, I managed to make a diff between two > hexdump of small images where changes happened after I created a file > and remounted the fs ro and it seems that, the driver didn't wrote > changes to the disk until unmount ( The hexdump clearly shows that > /lost+found and /test file has been written after the umount ) > > workaround: Is there some knob in /proc or /sys which can trigger all > pending changes to disk ? ( Like /proc/sys/vm/drop_caches but for > filesystems ? ) I've looked at your script. The problem is that "echo s >/proc/sysrq_trigger" isn't really a data integrity operation. In particular it does not wait on IO to finish (with the new writeback code it does not even wait for IO to be submitted) so you sometimes take the image checksum before the sync actually happens. If you used sync(1) instead, everything should work as expected... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-22 22:32 ` Jan Kara @ 2010-02-22 23:05 ` Jan Kara 2010-02-22 23:09 ` Andreas Dilger 2010-03-02 10:29 ` Nick Piggin 0 siblings, 2 replies; 22+ messages in thread From: Jan Kara @ 2010-02-22 23:05 UTC (permalink / raw) To: Camille Moncelier; +Cc: linux-fsdevel, Andreas Dilger, ext4 development > > On Thu, Feb 18, 2010 at 10:41 PM, Andreas Dilger <adilger@sun.com> wrote: > > > > > Are you sure this isn't because e2fsck has been run at boot time and changed > > > e.g. the "last checked" timestamp in the superblock? > > > > > No, I replaced /sbin/init by something which compute the sha1sum of > > the root partition, display it then call /sbin/init and I can see that > > the hash has changed after mount -o remount,ro. > > > > As little as I understand, I managed to make a diff between two > > hexdump of small images where changes happened after I created a file > > and remounted the fs ro and it seems that, the driver didn't wrote > > changes to the disk until unmount ( The hexdump clearly shows that > > /lost+found and /test file has been written after the umount ) > > > > workaround: Is there some knob in /proc or /sys which can trigger all > > pending changes to disk ? ( Like /proc/sys/vm/drop_caches but for > > filesystems ? ) > I've looked at your script. The problem is that "echo s >/proc/sysrq_trigger" > isn't really a data integrity operation. In particular it does not wait on > IO to finish (with the new writeback code it does not even wait for IO to be > submitted) so you sometimes take the image checksum before the sync actually > happens. If you used sync(1) instead, everything should work as expected... Hmm, and apparently there is some subtlety in the loopback device code because even when I use sync(1), the first and second images sometimes differ (although it's much rarer). But I see a commit block of the transaction already in the first image (the commit block is written last) but the contents of the transaction is present only in the second image. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-22 23:05 ` Jan Kara @ 2010-02-22 23:09 ` Andreas Dilger 2010-02-23 8:42 ` Camille Moncelier 2010-03-02 10:29 ` Nick Piggin 1 sibling, 1 reply; 22+ messages in thread From: Andreas Dilger @ 2010-02-22 23:09 UTC (permalink / raw) To: Jan Kara; +Cc: Camille Moncelier, linux-fsdevel, ext4 development On 2010-02-22, at 16:05, Jan Kara wrote: > Hmm, and apparently there is some subtlety in the loopback device > code because even when I use sync(1), the first and second images > sometimes differ (although it's much rarer). But I see a commit > block of the transaction already in the first image (the commit > block is written last) but the contents of the transaction is > present only in the second image. It has never been safe to run ext3 on top of a loop device, because the loop device does not preserve ordering, and I'm not sure whether it properly passes barriers either. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-22 23:09 ` Andreas Dilger @ 2010-02-23 8:42 ` Camille Moncelier 2010-02-23 13:55 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Camille Moncelier @ 2010-02-23 8:42 UTC (permalink / raw) To: linux-fsdevel, ext4 development The fact is that I've been able to reproduce the problem on LVM block devices, and sd* block devices so it's definitely not a loop device specific problem. By the way, I tried several other things other than "echo s >/proc/sysrq_trigger" I tried multiple sync followed with a one minute "sleep", "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash changes" but doesn't stops them. On Tue, Feb 23, 2010 at 12:09 AM, Andreas Dilger <adilger@sun.com> wrote: > On 2010-02-22, at 16:05, Jan Kara wrote: >> >> Hmm, and apparently there is some subtlety in the loopback device code >> because even when I use sync(1), the first and second images sometimes >> differ (although it's much rarer). But I see a commit block of the >> transaction already in the first image (the commit block is written last) >> but the contents of the transaction is present only in the second image. > > > It has never been safe to run ext3 on top of a loop device, because the loop > device does not preserve ordering, and I'm not sure whether it properly > passes barriers either. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- Camille Moncelier http://devlife.org/ If Java had true garbage collection, most programs would delete themselves upon execution. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-23 8:42 ` Camille Moncelier @ 2010-02-23 13:55 ` Jan Kara 2010-02-24 16:01 ` Dmitry Monakhov 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2010-02-23 13:55 UTC (permalink / raw) To: Camille Moncelier; +Cc: linux-fsdevel, ext4 development > The fact is that I've been able to reproduce the problem on LVM block > devices, and sd* block devices so it's definitely not a loop device > specific problem. > > By the way, I tried several other things other than "echo s > >/proc/sysrq_trigger" I tried multiple sync followed with a one minute > "sleep", > > "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash > changes" but doesn't stops them. Strange. When I use sync(1) in your script and use /dev/sda5 instead of a /dev/loop0, I cannot reproduce the problem (was running the script for something like an hour). So can you send me (or put up somewhere) the different images of filesystems you've got when you run on /dev/sd* with using sync in your script? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-23 13:55 ` Jan Kara @ 2010-02-24 16:01 ` Dmitry Monakhov 2010-02-24 16:26 ` Camille Moncelier ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Dmitry Monakhov @ 2010-02-24 16:01 UTC (permalink / raw) To: Jan Kara; +Cc: Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development [-- Attachment #1: Type: text/plain, Size: 1123 bytes --] Jan Kara <jack@suse.cz> writes: >> The fact is that I've been able to reproduce the problem on LVM block >> devices, and sd* block devices so it's definitely not a loop device >> specific problem. >> >> By the way, I tried several other things other than "echo s >> >/proc/sysrq_trigger" I tried multiple sync followed with a one minute >> "sleep", >> >> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash >> changes" but doesn't stops them. > Strange. When I use sync(1) in your script and use /dev/sda5 instead of a > /dev/loop0, I cannot reproduce the problem (was running the script for > something like an hour). Theoretically some pages may exist after rw=>ro remount because of generic race between write/sync, And they will be written in by writepage if page already has buffers. This not happen in ext4 because. Each time it try to perform writepages it try to start_journal and this result in EROFS. The race bug will be closed some day but new one may appear again. Let's be honest and change ext3 writepage like follows: - check ROFS flag inside write page - dump writepage's errors. [-- Attachment #2: 0001-ext3-add-sanity-checks-to-writeback.patch --] [-- Type: text/plain, Size: 3645 bytes --] >From a7cadf8017626cd80fcd8ea5a0e4deff4f63e02e Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov <dmonakhov@openvz.org> Date: Wed, 24 Feb 2010 18:17:58 +0300 Subject: [PATCH] ext3: add sanity checks to writeback There is theoretical possibility to perform writepage on RO superblock. Add explicit check for what case. In fact writepage may fail by a number of reasons. This is really rare case but sill may result in data loss. At least we have to dump a error message. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext3/inode.c | 40 +++++++++++++++++++++++++++++++++++----- 1 files changed, 35 insertions(+), 5 deletions(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 455e6e6..cf0e3aa 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1536,6 +1536,11 @@ static int ext3_ordered_writepage(struct page *page, if (ext3_journal_current_handle()) goto out_fail; + if (inode->i_sb->s_flags & MS_RDONLY) { + err = -EROFS; + goto out_fail; + } + if (!page_has_buffers(page)) { create_empty_buffers(page, inode->i_sb->s_blocksize, (1 << BH_Dirty)|(1 << BH_Uptodate)); @@ -1546,7 +1551,8 @@ static int ext3_ordered_writepage(struct page *page, NULL, buffer_unmapped)) { /* Provide NULL get_block() to catch bugs if buffers * weren't really mapped */ - return block_write_full_page(page, NULL, wbc); + ret = block_write_full_page(page, NULL, wbc); + goto out; } } handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); @@ -1584,12 +1590,17 @@ static int ext3_ordered_writepage(struct page *page, err = ext3_journal_stop(handle); if (!ret) ret = err; +out: + if (ret) + ext3_msg(inode->i_sb, KERN_CRIT, "%s: failed " + "%ld pages, ino %lu; err %d\n", __func__, + wbc->nr_to_write, inode->i_ino, ret); return ret; out_fail: redirty_page_for_writepage(wbc, page); unlock_page(page); - return ret; + goto out; } static int ext3_writeback_writepage(struct page *page, @@ -1603,12 +1614,18 @@ static int ext3_writeback_writepage(struct page *page, if (ext3_journal_current_handle()) goto out_fail; + if (inode->i_sb->s_flags & MS_RDONLY) { + err = -EROFS; + goto out_fail; + } + if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) { /* Provide NULL get_block() to catch bugs if buffers * weren't really mapped */ - return block_write_full_page(page, NULL, wbc); + ret = block_write_full_page(page, NULL, wbc); + goto out; } } @@ -1626,12 +1643,17 @@ static int ext3_writeback_writepage(struct page *page, err = ext3_journal_stop(handle); if (!ret) ret = err; +out: + if (ret) + ext3_msg(inode->i_sb, KERN_CRIT, "%s: failed " + "%ld pages, ino %lu; err %d\n", __func__, + wbc->nr_to_write, inode->i_ino, ret); return ret; out_fail: redirty_page_for_writepage(wbc, page); unlock_page(page); - return ret; + goto out; } static int ext3_journalled_writepage(struct page *page, @@ -1645,6 +1667,11 @@ static int ext3_journalled_writepage(struct page *page, if (ext3_journal_current_handle()) goto no_write; + if (inode->i_sb->s_flags & MS_RDONLY) { + err = -EROFS; + goto no_write; + } + handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); if (IS_ERR(handle)) { ret = PTR_ERR(handle); @@ -1684,8 +1711,11 @@ static int ext3_journalled_writepage(struct page *page, if (!ret) ret = err; out: + if (ret) + ext3_msg(inode->i_sb, KERN_CRIT, "%s: failed " + "%ld pages, ino %lu; err %d\n", __func__, + wbc->nr_to_write, inode->i_ino, ret); return ret; ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:01 ` Dmitry Monakhov @ 2010-02-24 16:26 ` Camille Moncelier 2010-02-24 16:59 ` Jan Kara 2010-02-24 16:56 ` Jan Kara 2010-02-24 16:57 ` Eric Sandeen 2 siblings, 1 reply; 22+ messages in thread From: Camille Moncelier @ 2010-02-24 16:26 UTC (permalink / raw) To: linux-fsdevel@vger.kernel.org, ext4 development; +Cc: Jan Kara > Theoretically some pages may exist after rw=>ro remount > because of generic race between write/sync, And they will be written > in by writepage if page already has buffers. This not happen in ext4 > because. Each time it try to perform writepages it try to start_journal > and this result in EROFS. > The race bug will be closed some day but new one may appear again. > > Let's be honest and change ext3 writepage like follows: > - check ROFS flag inside write page > - dump writepage's errors. > > I think I don't understand correctly your patch. For me it seems that when an ext3 filesystem is remounted ro, some data may not have been written to disk right ? But as far as I understand some writes are performed on the journal on remount-ro, before the ro flag is set. So if writepage comes to play and write data to disk it my have to update the journal again, no ? If not it would mean that the journal would reference data that aren't available on disk ? Last question, would it be hard to implement a patch that trigger writepage and wait for completion when remounting read-only (I have no expertise on filesystems in general, but I tried my best to understand the ext3 driver) -- Camille Moncelier http://devlife.org/ If Java had true garbage collection, most programs would delete themselves upon execution. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:26 ` Camille Moncelier @ 2010-02-24 16:59 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2010-02-24 16:59 UTC (permalink / raw) To: Camille Moncelier Cc: linux-fsdevel@vger.kernel.org, ext4 development, Jan Kara On Wed 24-02-10 17:26:37, Camille Moncelier wrote: > > Theoretically some pages may exist after rw=>ro remount > > because of generic race between write/sync, And they will be written > > in by writepage if page already has buffers. This not happen in ext4 > > because. Each time it try to perform writepages it try to start_journal > > and this result in EROFS. > > The race bug will be closed some day but new one may appear again. > > > > Let's be honest and change ext3 writepage like follows: > > - check ROFS flag inside write page > > - dump writepage's errors. > > > > > I think I don't understand correctly your patch. For me it seems that > when an ext3 filesystem is remounted ro, some data may not have been > written to disk right ? I think that Dmitry was concerned about the fact that a process could open a file and write to it after we synced the filesystem in do_remount_sb(). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:01 ` Dmitry Monakhov 2010-02-24 16:26 ` Camille Moncelier @ 2010-02-24 16:56 ` Jan Kara 2010-03-02 9:34 ` Christoph Hellwig 2010-02-24 16:57 ` Eric Sandeen 2 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2010-02-24 16:56 UTC (permalink / raw) To: Dmitry Monakhov Cc: Jan Kara, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development, hch, viro On Wed 24-02-10 19:01:27, Dmitry Monakhov wrote: > Jan Kara <jack@suse.cz> writes: > > >> The fact is that I've been able to reproduce the problem on LVM block > >> devices, and sd* block devices so it's definitely not a loop device > >> specific problem. > >> > >> By the way, I tried several other things other than "echo s > >> >/proc/sysrq_trigger" I tried multiple sync followed with a one minute > >> "sleep", > >> > >> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash > >> changes" but doesn't stops them. > > Strange. When I use sync(1) in your script and use /dev/sda5 instead of a > > /dev/loop0, I cannot reproduce the problem (was running the script for > > something like an hour). > Theoretically some pages may exist after rw=>ro remount > because of generic race between write/sync, And they will be written > in by writepage if page already has buffers. This not happen in ext4 > because. Each time it try to perform writepages it try to start_journal > and this result in EROFS. > The race bug will be closed some day but new one may appear again. OK, I see that in theory a process can open file for writing after fs_may_remount_ro() before MS_RDONLY flag gets set. That could be really nasty. But by no means we should solve this VFS problem by spilling error messages from the filesystem. Especially because block_write_full_page can fail from a number of legal reasons (ENOSPC, EDQUOT, EIO) and we don't want to pollute logs with such stuff. BTW: This isn't the race Camille could see because he did all the writes, then sync and then remount-ro... Al, Christoph, do I miss something or there is really nothing which prevents a process from opening a file after the fs_may_remount_ro() check in do_remount_sb()? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:56 ` Jan Kara @ 2010-03-02 9:34 ` Christoph Hellwig 2010-03-02 10:01 ` Dmitry Monakhov 2010-03-02 23:10 ` Joel Becker 0 siblings, 2 replies; 22+ messages in thread From: Christoph Hellwig @ 2010-03-02 9:34 UTC (permalink / raw) To: Jan Kara Cc: Dmitry Monakhov, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development, hch, viro On Wed, Feb 24, 2010 at 05:56:46PM +0100, Jan Kara wrote: > OK, I see that in theory a process can open file for writing after > fs_may_remount_ro() before MS_RDONLY flag gets set. That could be really > nasty. Not just in theory, but also in practice. We can easily hit this under load with XFS. > But by no means we should solve this VFS problem by spilling error > messages from the filesystem. Exactly. > Al, Christoph, do I miss something or there is really nothing which > prevents a process from opening a file after the fs_may_remount_ro() check > in do_remount_sb()? No, there is nothing. We really do need a multi-stage remount read-only process: 1) stop any writes from userland, that is opening new files writeable 2) stop any periodic writeback from the VM or filesystem-internal 3) write out all filesystem data and metadata 4) mark the filesystem fully read-only ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-03-02 9:34 ` Christoph Hellwig @ 2010-03-02 10:01 ` Dmitry Monakhov 2010-03-02 13:26 ` Jan Kara 2010-03-02 23:10 ` Joel Becker 1 sibling, 1 reply; 22+ messages in thread From: Dmitry Monakhov @ 2010-03-02 10:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development, viro Christoph Hellwig <hch@lst.de> writes: > On Wed, Feb 24, 2010 at 05:56:46PM +0100, Jan Kara wrote: >> OK, I see that in theory a process can open file for writing after >> fs_may_remount_ro() before MS_RDONLY flag gets set. That could be really >> nasty. > > Not just in theory, but also in practice. We can easily hit this under > load with XFS. > >> But by no means we should solve this VFS problem by spilling error >> messages from the filesystem. > > Exactly. > >> Al, Christoph, do I miss something or there is really nothing which >> prevents a process from opening a file after the fs_may_remount_ro() check >> in do_remount_sb()? > > No, there is nothing. We really do need a multi-stage remount read-only > process: > > 1) stop any writes from userland, that is opening new files writeable This is not quite good idea because sync may take really long time, #fsstress -p32 -d /mnt/TEST -l9999999 -n99999999 -z -f creat=100 -f write=100 #sleep 60; #killall -9 fsstress #time mount mnt -oremount,ro it take several minutes to complete. And at the end it may fail but other reason. > 2) stop any periodic writeback from the VM or filesystem-internal > 3) write out all filesystem data and metadata > 4) mark the filesystem fully read-only I've tried to sole the issue in lightly another way Please take a look on this http://marc.info/?l=linux-fsdevel&m=126723036525624&w=2 1) Mark fs as GOING_TO_REMOUNT 2) any new writer will clear this flag This allow us to not block 3) check flag before fssync and after and return EBUSY in this case. 4) At this time we may to block writers (this is absent in my patch) It is acceptable to block writers at this time because later stages doesn't take too long. 5) perform fs-specific remount method. 6) Marks filesystem as MS_RDONLY. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-03-02 10:01 ` Dmitry Monakhov @ 2010-03-02 13:26 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2010-03-02 13:26 UTC (permalink / raw) To: Dmitry Monakhov Cc: Christoph Hellwig, Jan Kara, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development, viro On Tue 02-03-10 13:01:52, Dmitry Monakhov wrote: > Christoph Hellwig <hch@lst.de> writes: > >> Al, Christoph, do I miss something or there is really nothing which > >> prevents a process from opening a file after the fs_may_remount_ro() check > >> in do_remount_sb()? > > > > No, there is nothing. We really do need a multi-stage remount read-only > > process: > > > > 1) stop any writes from userland, that is opening new files writeable > This is not quite good idea because sync may take really long time, > #fsstress -p32 -d /mnt/TEST -l9999999 -n99999999 -z -f creat=100 -f write=100 > #sleep 60; > #killall -9 fsstress > #time mount mnt -oremount,ro > it take several minutes to complete. > And at the end it may fail but other reason. Two points here: 1) Current writeback code has a bug that while we are umounting/remounting, sync_filesystem() just degrades to doing all writeback in sync mode (because any non-sync writeback fails to get s_umount sem for reading and thus skips all the inodes of the superblock). This has considerable impact on the speed of sync during umount / remount. 2) IMHO it's not bad to block all opens for writing during remounting RO (and thus also during the sync). It's not a performance issue (remounting RO does not happen often), it won't confuse any application or so even if we later decide we cannot really finish remounting. Surely we'd have to come up with a better waiting scheme than just cpu_relax() in mnt_want_write() but that shouldn't be hard. The only thing I'm slightly worried about is whether we won't hit some locking issues (i.e., caller of mnt_want_write holding some lock needed to finish remount...). > > 2) stop any periodic writeback from the VM or filesystem-internal > > 3) write out all filesystem data and metadata > > 4) mark the filesystem fully read-only > > I've tried to sole the issue in lightly another way > Please take a look on this > http://marc.info/?l=linux-fsdevel&m=126723036525624&w=2 > 1) Mark fs as GOING_TO_REMOUNT > 2) any new writer will clear this flag > This allow us to not block > 3) check flag before fssync and after and return EBUSY in this case. > 4) At this time we may to block writers (this is absent in my patch) > It is acceptable to block writers at this time because later stages > doesn't take too long. > 5) perform fs-specific remount method. > 6) Marks filesystem as MS_RDONLY. I like my solution more since in my solution, admin does not have go hunting for an application which keeps touching the filesystem while he is trying to remount it read only (currently, using lsof is usually enough but after your changes, running something like "while true; do touch /mnt/; done" has much larger window to stop remounting RO). But in principle your solution is acceptable for me as well. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-03-02 9:34 ` Christoph Hellwig 2010-03-02 10:01 ` Dmitry Monakhov @ 2010-03-02 23:10 ` Joel Becker 1 sibling, 0 replies; 22+ messages in thread From: Joel Becker @ 2010-03-02 23:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Dmitry Monakhov, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development, viro On Tue, Mar 02, 2010 at 10:34:31AM +0100, Christoph Hellwig wrote: > No, there is nothing. We really do need a multi-stage remount read-only > process: > > 1) stop any writes from userland, that is opening new files writeable > 2) stop any periodic writeback from the VM or filesystem-internal > 3) write out all filesystem data and metadata > 4) mark the filesystem fully read-only If you can code this up in a happily accessible way, we can use it in ocfs2 to handle some error cases without puking. That would make us very happy. Specifically, we haven't yet taken the time to audit how we would ensure step (2). Joel -- "Reader, suppose you were and idiot. And suppose you were a member of Congress. But I repeat myself." - Mark Twain Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:01 ` Dmitry Monakhov 2010-02-24 16:26 ` Camille Moncelier 2010-02-24 16:56 ` Jan Kara @ 2010-02-24 16:57 ` Eric Sandeen 2010-02-24 17:05 ` Jan Kara 2 siblings, 1 reply; 22+ messages in thread From: Eric Sandeen @ 2010-02-24 16:57 UTC (permalink / raw) To: Dmitry Monakhov Cc: Jan Kara, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development Dmitry Monakhov wrote: > Jan Kara <jack@suse.cz> writes: > >>> The fact is that I've been able to reproduce the problem on LVM block >>> devices, and sd* block devices so it's definitely not a loop device >>> specific problem. >>> >>> By the way, I tried several other things other than "echo s >>>> /proc/sysrq_trigger" I tried multiple sync followed with a one minute >>> "sleep", >>> >>> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash >>> changes" but doesn't stops them. >> Strange. When I use sync(1) in your script and use /dev/sda5 instead of a >> /dev/loop0, I cannot reproduce the problem (was running the script for >> something like an hour). > Theoretically some pages may exist after rw=>ro remount > because of generic race between write/sync, And they will be written > in by writepage if page already has buffers. This not happen in ext4 > because. Each time it try to perform writepages it try to start_journal > and this result in EROFS. > The race bug will be closed some day but new one may appear again. > > Let's be honest and change ext3 writepage like follows: > - check ROFS flag inside write page > - dump writepage's errors. > > sounds like the wrong approach to me, we really need to fix the root cause and make remount,ro finish the job, I think. Throwing away writes which an application already thinks are completed just because remount,ro didn't keep up sounds like a bad idea. I think I would much rather have the write complete shortly after the readonly transition, if I had to choose... I haven't looked at these paths at all but just hand-wavily, remount,ro should follow pretty much the same path as freeze, I think. And if freeze isn't getting everything on-disk we have an even bigger problem. -Eric ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 16:57 ` Eric Sandeen @ 2010-02-24 17:05 ` Jan Kara 2010-02-24 17:26 ` Dmitry Monakhov 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2010-02-24 17:05 UTC (permalink / raw) To: Eric Sandeen Cc: Dmitry Monakhov, Jan Kara, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development On Wed 24-02-10 10:57:59, Eric Sandeen wrote: > Dmitry Monakhov wrote: > > Jan Kara <jack@suse.cz> writes: > >>> The fact is that I've been able to reproduce the problem on LVM block > >>> devices, and sd* block devices so it's definitely not a loop device > >>> specific problem. > >>> > >>> By the way, I tried several other things other than "echo s > >>>> /proc/sysrq_trigger" I tried multiple sync followed with a one minute > >>> "sleep", > >>> > >>> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash > >>> changes" but doesn't stops them. > >> Strange. When I use sync(1) in your script and use /dev/sda5 instead of a > >> /dev/loop0, I cannot reproduce the problem (was running the script for > >> something like an hour). > > Theoretically some pages may exist after rw=>ro remount > > because of generic race between write/sync, And they will be written > > in by writepage if page already has buffers. This not happen in ext4 > > because. Each time it try to perform writepages it try to start_journal > > and this result in EROFS. > > The race bug will be closed some day but new one may appear again. > > > > Let's be honest and change ext3 writepage like follows: > > - check ROFS flag inside write page > > - dump writepage's errors. > > > > > > sounds like the wrong approach to me, we really need to fix the root > cause and make remount,ro finish the job, I think. > > Throwing away writes which an application already thinks are completed > just because remount,ro didn't keep up sounds like a bad idea. I think > I would much rather have the write complete shortly after the readonly > transition, if I had to choose... Well, my opinion is that VFS should take care about the rw->ro transition so that it isn't racy... > I haven't looked at these paths at all but just hand-wavily, > remount,ro should follow pretty much the same path as freeze, > I think. And if freeze isn't getting everything on-disk we have > an even bigger problem. With freeze you can still keep dirty data in cache until the filesystem unfreezes so it's a different situation from rw->ro transition. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 17:05 ` Jan Kara @ 2010-02-24 17:26 ` Dmitry Monakhov 2010-02-24 21:36 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Dmitry Monakhov @ 2010-02-24 17:26 UTC (permalink / raw) To: Jan Kara Cc: Eric Sandeen, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development Jan Kara <jack@suse.cz> writes: > On Wed 24-02-10 10:57:59, Eric Sandeen wrote: >> Dmitry Monakhov wrote: >> > Jan Kara <jack@suse.cz> writes: >> >>> The fact is that I've been able to reproduce the problem on LVM block >> >>> devices, and sd* block devices so it's definitely not a loop device >> >>> specific problem. >> >>> >> >>> By the way, I tried several other things other than "echo s >> >>>> /proc/sysrq_trigger" I tried multiple sync followed with a one minute >> >>> "sleep", >> >>> >> >>> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash >> >>> changes" but doesn't stops them. >> >> Strange. When I use sync(1) in your script and use /dev/sda5 instead of a >> >> /dev/loop0, I cannot reproduce the problem (was running the script for >> >> something like an hour). >> > Theoretically some pages may exist after rw=>ro remount >> > because of generic race between write/sync, And they will be written >> > in by writepage if page already has buffers. This not happen in ext4 >> > because. Each time it try to perform writepages it try to start_journal >> > and this result in EROFS. >> > The race bug will be closed some day but new one may appear again. >> > >> > Let's be honest and change ext3 writepage like follows: >> > - check ROFS flag inside write page >> > - dump writepage's errors. >> > >> > >> >> sounds like the wrong approach to me, we really need to fix the root >> cause and make remount,ro finish the job, I think. Off course, but still. This is just a sanity check. Similar check in ext4 help me to find the generic issue. Off course it have to be guarded by unlikely() statement >> >> Throwing away writes which an application already thinks are completed >> just because remount,ro didn't keep up sounds like a bad idea. I think >> I would much rather have the write complete shortly after the readonly >> transition, if I had to choose... > Well, my opinion is that VFS should take care about the rw->ro transition > so that it isn't racy... No, My patch just try to nail the RO semantics in to writepage. Since other places are already guarded by start_journal, writepage is the only one which may has weakness. About ENOSPC/EDQUOT spam. It may be not bad to print a error message for crazy person who use mmap for space file. > >> I haven't looked at these paths at all but just hand-wavily, >> remount,ro should follow pretty much the same path as freeze, >> I think. And if freeze isn't getting everything on-disk we have >> an even bigger problem. > With freeze you can still keep dirty data in cache until the filesystem > unfreezes so it's a different situation from rw->ro transition. In fact freeze is also not absolutely io proof :) When i've worked on COW device i use freeze-fs for consistent image creation, And sometimes after filesystem was friezed i still get bios. We do not investigate this too deeply and just queue bios in to pending queue. > > Honza ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-24 17:26 ` Dmitry Monakhov @ 2010-02-24 21:36 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2010-02-24 21:36 UTC (permalink / raw) To: Dmitry Monakhov Cc: Jan Kara, Eric Sandeen, Camille Moncelier, linux-fsdevel@vger.kernel.org, ext4 development On Wed 24-02-10 20:26:13, Dmitry Monakhov wrote: > Jan Kara <jack@suse.cz> writes: > > On Wed 24-02-10 10:57:59, Eric Sandeen wrote: > >> Dmitry Monakhov wrote: > >> > Jan Kara <jack@suse.cz> writes: > >> >>> The fact is that I've been able to reproduce the problem on LVM block > >> >>> devices, and sd* block devices so it's definitely not a loop device > >> >>> specific problem. > >> >>> > >> >>> By the way, I tried several other things other than "echo s > >> >>>> /proc/sysrq_trigger" I tried multiple sync followed with a one minute > >> >>> "sleep", > >> >>> > >> >>> "echo 3 >/proc/sys/vm/drop_caches" seems to lower the chances of "hash > >> >>> changes" but doesn't stops them. > >> >> Strange. When I use sync(1) in your script and use /dev/sda5 instead of a > >> >> /dev/loop0, I cannot reproduce the problem (was running the script for > >> >> something like an hour). > >> > Theoretically some pages may exist after rw=>ro remount > >> > because of generic race between write/sync, And they will be written > >> > in by writepage if page already has buffers. This not happen in ext4 > >> > because. Each time it try to perform writepages it try to start_journal > >> > and this result in EROFS. > >> > The race bug will be closed some day but new one may appear again. > >> > > >> > Let's be honest and change ext3 writepage like follows: > >> > - check ROFS flag inside write page > >> > - dump writepage's errors. > >> > > >> > > >> > >> sounds like the wrong approach to me, we really need to fix the root > >> cause and make remount,ro finish the job, I think. > Off course, but still. This is just a sanity check. Similar check > in ext4 help me to find the generic issue. Off course it have to > be guarded by unlikely() statement. Well I think that something like WARN_ON_ONCE(IS_RDONLY(inode)); in the beginning of every ext3 writepage implementation would be totally sufficient for catching such bugs. Plus it has the advantage that it won't loose user's data if possible. So I'll take patch in this direction. > >> Throwing away writes which an application already thinks are completed > >> just because remount,ro didn't keep up sounds like a bad idea. I think > >> I would much rather have the write complete shortly after the readonly > >> transition, if I had to choose... > > Well, my opinion is that VFS should take care about the rw->ro transition > > so that it isn't racy... > No, My patch just try to nail the RO semantics in to writepage. > Since other places are already guarded by start_journal, writepage is > the only one which may has weakness. > About ENOSPC/EDQUOT spam. It may be not bad to print a error message > for crazy person who use mmap for space file. I'm sorry but I disagree. We set the error in the mapping and return the error in case user calls fsync() on the file. Now I agree that most applications will just miss that but that's no excuse for us writing such messages in the system log. The user just got what he told the system to do. And yes, we could be nicer to applications by making sure at page-fault time that we have space for the mmaped write. I actually have patches for that but they are stuck in the queue behind Nick's truncate-calling-sequence rewrite. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [ext3] Changes to block device after an ext3 mount point has been remounted readonly 2010-02-22 23:05 ` Jan Kara 2010-02-22 23:09 ` Andreas Dilger @ 2010-03-02 10:29 ` Nick Piggin 1 sibling, 0 replies; 22+ messages in thread From: Nick Piggin @ 2010-03-02 10:29 UTC (permalink / raw) To: Jan Kara; +Cc: Camille Moncelier, linux-fsdevel, Andreas Dilger, ext4 development On Tue, Feb 23, 2010 at 12:05:52AM +0100, Jan Kara wrote: > > > On Thu, Feb 18, 2010 at 10:41 PM, Andreas Dilger <adilger@sun.com> wrote: > > > > > > > Are you sure this isn't because e2fsck has been run at boot time and changed > > > > e.g. the "last checked" timestamp in the superblock? > > > > > > > No, I replaced /sbin/init by something which compute the sha1sum of > > > the root partition, display it then call /sbin/init and I can see that > > > the hash has changed after mount -o remount,ro. > > > > > > As little as I understand, I managed to make a diff between two > > > hexdump of small images where changes happened after I created a file > > > and remounted the fs ro and it seems that, the driver didn't wrote > > > changes to the disk until unmount ( The hexdump clearly shows that > > > /lost+found and /test file has been written after the umount ) > > > > > > workaround: Is there some knob in /proc or /sys which can trigger all > > > pending changes to disk ? ( Like /proc/sys/vm/drop_caches but for > > > filesystems ? ) > > I've looked at your script. The problem is that "echo s >/proc/sysrq_trigger" > > isn't really a data integrity operation. In particular it does not wait on > > IO to finish (with the new writeback code it does not even wait for IO to be > > submitted) so you sometimes take the image checksum before the sync actually > > happens. If you used sync(1) instead, everything should work as expected... > Hmm, and apparently there is some subtlety in the loopback device code > because even when I use sync(1), the first and second images sometimes differ > (although it's much rarer). But I see a commit block of the transaction already > in the first image (the commit block is written last) but the contents of the > transaction is present only in the second image. Then I would guess that it might be running into the problem solved by this commit in Al's tree (don't think it hit mainline yet). Hmm, now that I look at the patch again, I can't remember whether I checked that a umount also does the correct bdev invalidation. Better check that. commit 17b0184495b52858fcc514aa0769801ac055b086 Author: Nick Piggin <npiggin@suse.de> Date: Mon Dec 21 16:28:53 2009 -0800 fs: improve remount,ro vs buffercache coherency Invalidate sb->s_bdev on remount,ro. Fixes a problem reported by Jorge Boncompte who is seeing corruption trying to snapshot a minix filesystem image. Some filesystems modify their metadata via a path other than the bdev buffer cache (eg. they may use a private linear mapping for their metadata, or implement directories in pagecache, etc). Also, file data modifications usually go to the bdev via their own mappings. These updates are not coherent with buffercache IO (eg. via /dev/bdev) and never have been. However there could be a reasonable expectation that after a mount -oremount,ro operation then the buffercache should subsequently be coherent with previous filesystem modifications. So invalidate the bdev mappings on a remount,ro operation to provide a coherency point. The problem was exposed when we switched the old rd to brd because old rd didn't really function like a normal block device and updates to rd via mappings other than the buffercache would still end up going into its buffercache. But the same problem has always affected other "normal" block devices, including loop. [akpm@linux-foundation.org: repair comment layout] Reported-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Tested-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff --git a/fs/super.c b/fs/super.c index aff046b..903896e 100644 --- a/fs/super.c +++ b/fs/super.c @@ -568,7 +568,7 @@ out: int do_remount_sb(struct super_block *sb, int flags, void *data, int force) { int retval; - int remount_rw; + int remount_rw, remount_ro; if (sb->s_frozen != SB_UNFROZEN) return -EBUSY; @@ -583,9 +583,12 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force) shrink_dcache_sb(sb); sync_filesystem(sb); + remount_ro = (flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY); + remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY); + /* If we are remounting RDONLY and current sb is read/write, make sure there are no rw files opened */ - if ((flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY)) { + if (remount_ro) { if (force) mark_files_ro(sb); else if (!fs_may_remount_ro(sb)) @@ -594,7 +597,6 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force) if (retval < 0 && retval != -ENOSYS) return -EBUSY; } - remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY); if (sb->s_op->remount_fs) { retval = sb->s_op->remount_fs(sb, &flags, data); @@ -604,6 +606,16 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force) sb->s_flags = (sb->s_flags & ~MS_RMT_MASK) | (flags & MS_RMT_MASK); if (remount_rw) vfs_dq_quota_on_remount(sb); + /* + * Some filesystems modify their metadata via some other path than the + * bdev buffer cache (eg. use a private mapping, or directories in + * pagecache, etc). Also file data modifications go via their own + * mappings. So If we try to mount readonly then copy the filesystem + * from bdev, we could get stale data, so invalidate it to give a best + * effort at coherency. + */ + if (remount_ro && sb->s_bdev) + invalidate_bdev(sb->s_bdev); return 0; } ^ permalink raw reply related [flat|nested] 22+ messages in thread
end of thread, other threads:[~2010-03-02 23:10 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-18 16:45 [ext3] Changes to block device after an ext3 mount point has been remounted readonly PiX 2010-02-18 16:50 ` Camille Moncelier 2010-02-18 21:41 ` Andreas Dilger 2010-02-19 7:38 ` Camille Moncelier 2010-02-22 22:32 ` Jan Kara 2010-02-22 23:05 ` Jan Kara 2010-02-22 23:09 ` Andreas Dilger 2010-02-23 8:42 ` Camille Moncelier 2010-02-23 13:55 ` Jan Kara 2010-02-24 16:01 ` Dmitry Monakhov 2010-02-24 16:26 ` Camille Moncelier 2010-02-24 16:59 ` Jan Kara 2010-02-24 16:56 ` Jan Kara 2010-03-02 9:34 ` Christoph Hellwig 2010-03-02 10:01 ` Dmitry Monakhov 2010-03-02 13:26 ` Jan Kara 2010-03-02 23:10 ` Joel Becker 2010-02-24 16:57 ` Eric Sandeen 2010-02-24 17:05 ` Jan Kara 2010-02-24 17:26 ` Dmitry Monakhov 2010-02-24 21:36 ` Jan Kara 2010-03-02 10:29 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).