* e2scrub finds corruption immediately after mounting @ 2024-01-03 21:14 Brian J. Murrell 2024-01-04 4:38 ` Theodore Ts'o 2024-01-04 4:55 ` Darrick J. Wong 0 siblings, 2 replies; 16+ messages in thread From: Brian J. Murrell @ 2024-01-03 21:14 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2587 bytes --] I am trying to migrate from lvcheck (https://github.com/BryanKadzban/lvcheck) to using the officially supported e2scrub[_all] kit. I am finding that e2scrub very often (much more than lvcheck even) finds corruption and wants me to do an offline e2fsck. Not only does it do this immediately after booting a system that includes filesystem checks (that were caused by e2scrub previously setting a filesystem to be checked on next boot), but it happens immediately after I run an e2fsck and then mount the filesystem, even without any activity on it. Observe: # umount /opt # e2fsck -y /dev/rootvol_tmp/almalinux8_opt e2fsck 1.45.6 (20-Mar-2020) /dev/mapper/rootvol_tmp-almalinux8_opt: clean, 1698/178816 files, 482404/716800 blocks # e2scrub /dev/rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt.e2scrub" created. e2fsck 1.45.6 (20-Mar-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non- contiguous), 482404/716800 blocks /dev/rootvol_tmp/almalinux8_opt: Scrub succeeded. tune2fs 1.45.6 (20-Mar-2020) Setting current mount count to 0 Setting time filesystem last checked to Wed Jan 3 11:37:04 2024 Logical volume "almalinux8_opt.e2scrub" successfully removed. # mount /opt # e2scrub /dev/rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt.e2scrub" created. e2fsck 1.45.6 (20-Mar-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non- contiguous), 482404/716800 blocks /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! Unmount and run e2fsck -y. tune2fs 1.45.6 (20-Mar-2020) Setting filesystem error flag to force fsck. Logical volume "almalinux8_opt.e2scrub" successfully removed. So as you can see, I unmount /opt, run an e2fsck -y on it to clean it and then before mounting run e2scrub and it finds the filesystem clean. Good so far. I then mount it and then immediately run another e2scrub on it and that finds it dirty and wants me to unmount and run another e2fsck -y on it. But how can that be? Surely an e2scrub on a freshly cleaned and mounted filesystem (with no activity on it in between) should be clean, yes? Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-03 21:14 e2scrub finds corruption immediately after mounting Brian J. Murrell @ 2024-01-04 4:38 ` Theodore Ts'o 2024-01-04 14:10 ` Brian J. Murrell 2024-01-04 4:55 ` Darrick J. Wong 1 sibling, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2024-01-04 4:38 UTC (permalink / raw) To: Brian J. Murrell; +Cc: linux-ext4 On Wed, Jan 03, 2024 at 04:14:36PM -0500, Brian J. Murrell wrote: > I am trying to migrate from lvcheck > (https://github.com/BryanKadzban/lvcheck) to using the officially > supported e2scrub[_all] kit. What distribution are you using, and what version of the kernel are you using? I note that you are using e2fsprogs 1.45.6, and Debian Stable is shipping with e2fsprogs 1.47.0. That being said, this is the first time I've seen any report of an issue like what you've reported.. > # e2scrub /dev/rootvol_tmp/almalinux8_opt > Logical volume "almalinux8_opt.e2scrub" created. > e2fsck 1.45.6 (20-Mar-2020) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non- > contiguous), 482404/716800 blocks > /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! This error means that e2fsck exited with a non-zero exit status. Which is strange because there is no report of any kind of problem from e2fsck in its output. From the e2scrub script: check() { # First we recover the journal, then we see if e2fsck tries any # non-optimization repairs. If either of these two returns a # non-zero status (errors fixed or remaining) then this fs is bad. E2FSCK_FIXES_ONLY=1 export E2FSCK_FIXES_ONLY ${DBG} "@root_sbindir@/e2fsck" -E journal_only -p ${e2fsck_opts} "${snap_dev}" || return $? ${DBG} "@root_sbindir@/e2fsck" -f -y ${e2fsck_opts} "${snap_dev}" } ... check case "$?" in "0") # Clean check! echo "${arg}: Scrub succeeded." ... "8") # Operational error, what now? echo "${arg}: e2fsck operational error." ... *) # fsck failed. Check if the snapshot is invalid; if so, make a # note of that at the end of the log. This isn't necessarily a # failure because the mounted fs could have overflowed the # snapshot with regular disk writes /or/ our repair process # could have done it by repairing too much. # # If it's really corrupt we ought to fsck at next boot. is_invalid="$(lvs -o lv_snapshot_invalid --noheadings "${snap_dev}" | awk '{print $1}')" if [ -n "${is_invalid}" ]; then echo "${arg}: Scrub FAILED due to invalid snapshot." ret=8 else echo "${arg}: Scrub FAILED due to corruption! Unmount and run e2fsck -y." mark_corrupt ret=6 fi ... My best guess is that e2fsck from 1.45.6 is somehow returning a non-zero exit status for some reason. So the first thing I'd suggest is upgrading to e2fsprogs 1.47.0 and see if that causes the problem to resolve itself. Cheers, - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-04 4:38 ` Theodore Ts'o @ 2024-01-04 14:10 ` Brian J. Murrell 0 siblings, 0 replies; 16+ messages in thread From: Brian J. Murrell @ 2024-01-04 14:10 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2521 bytes --] On Wed, 2024-01-03 at 23:38 -0500, Theodore Ts'o wrote: > What distribution are you using, EL8, specifically AlmaLinux 8.9. > and what version of the kernel are > you using? EL8 is currently shipping 4.18.0-513.9.1.el8_9.x86_64 but as you know at this point in an EL8 kernel's life, the version hardly reflects what's actually in the kernel due to the copious backporting RH do their kernel. > This error means that e2fsck exited with a non-zero exit status. > Which is strange because there is no report of any kind of problem > from e2fsck in its output. Indeed! I even added debug output to e2scrub to print e2fsck's exit value and it's usually 1. > My best guess is that e2fsck from 1.45.6 is somehow returning a > non-zero exit status for some reason. So the first thing I'd suggest > is upgrading to e2fsprogs 1.47.0 and see if that causes the problem > to > resolve itself. Unfortunately, that doesn't seem to be the solution. :-( + umount /opt + e2fsck -y /dev/rootvol_tmp/almalinux8_opt e2fsck 1.47.0 (5-Feb-2023) /dev/rootvol_tmp/almalinux8_opt: clean, 1698/178816 files, 482473/716800 blocks + e2scrub /dev/rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt.e2scrub" created. e2fsck 1.47.0 (5-Feb-2023) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (87.0% non-contiguous), 482473/716800 blocks /dev/rootvol_tmp/almalinux8_opt: Scrub succeeded. tune2fs 1.47.0 (5-Feb-2023) Setting current mount count to 0 Setting time filesystem last checked to Thu Jan 4 09:07:56 2024 Logical volume "almalinux8_opt.e2scrub" successfully removed. + mount /opt + e2scrub /dev/rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt.e2scrub" created. e2fsck 1.47.0 (5-Feb-2023) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (87.0% non-contiguous), 482473/716800 blocks /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! Unmount and run e2fsck -y. tune2fs 1.47.0 (5-Feb-2023) Setting filesystem error flag to force fsck. Logical volume "almalinux8_opt.e2scrub" successfully removed. Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-03 21:14 e2scrub finds corruption immediately after mounting Brian J. Murrell 2024-01-04 4:38 ` Theodore Ts'o @ 2024-01-04 4:55 ` Darrick J. Wong 2024-01-04 14:13 ` Brian J. Murrell 2024-01-04 14:37 ` Brian J. Murrell 1 sibling, 2 replies; 16+ messages in thread From: Darrick J. Wong @ 2024-01-04 4:55 UTC (permalink / raw) To: Brian J. Murrell; +Cc: linux-ext4 On Wed, Jan 03, 2024 at 04:14:36PM -0500, Brian J. Murrell wrote: > I am trying to migrate from lvcheck > (https://github.com/BryanKadzban/lvcheck) to using the officially > supported e2scrub[_all] kit. > > I am finding that e2scrub very often (much more than lvcheck even) > finds corruption and wants me to do an offline e2fsck. Not only does > it do this immediately after booting a system that includes filesystem > checks (that were caused by e2scrub previously setting a filesystem to > be checked on next boot), but it happens immediately after I run an > e2fsck and then mount the filesystem, even without any activity on it. > Observe: > > # umount /opt > # e2fsck -y /dev/rootvol_tmp/almalinux8_opt > e2fsck 1.45.6 (20-Mar-2020) > /dev/mapper/rootvol_tmp-almalinux8_opt: clean, 1698/178816 files, > 482404/716800 blocks > # e2scrub /dev/rootvol_tmp/almalinux8_opt > Logical volume "almalinux8_opt.e2scrub" created. > e2fsck 1.45.6 (20-Mar-2020) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non- > contiguous), 482404/716800 blocks > /dev/rootvol_tmp/almalinux8_opt: Scrub succeeded. > tune2fs 1.45.6 (20-Mar-2020) > Setting current mount count to 0 > Setting time filesystem last checked to Wed Jan 3 11:37:04 2024 > > Logical volume "almalinux8_opt.e2scrub" successfully removed. > # mount /opt > # e2scrub /dev/rootvol_tmp/almalinux8_opt > Logical volume "almalinux8_opt.e2scrub" created. > e2fsck 1.45.6 (20-Mar-2020) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (86.9% non- > contiguous), 482404/716800 blocks > /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! Curious. Normally e2scrub will run e2fsck twice: Once in journal-only preen mode to replay the journal, then again with -fy to perform the full filesystem (snapshot) check. I wonder if you would paste the output of "bash -x e2scrub /dev/rootvol_tmp/almalinux8_opt" here? I'd be curious to see what the command flow is. Assuming that 1.47.0 doesn't magically fix it. :) > Unmount and run e2fsck -y. > tune2fs 1.45.6 (20-Mar-2020) > Setting filesystem error flag to force fsck. > Logical volume "almalinux8_opt.e2scrub" successfully removed. > > So as you can see, I unmount /opt, run an e2fsck -y on it to clean it > and then before mounting run e2scrub and it finds the filesystem clean. > Good so far. > > I then mount it and then immediately run another e2scrub on it and that > finds it dirty and wants me to unmount and run another e2fsck -y on it. > But how can that be? Surely an e2scrub on a freshly cleaned and > mounted filesystem (with no activity on it in between) should be clean, > yes? Right. Unless something's broken in e2fsck. :/ --D > > Cheers, > b. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-04 4:55 ` Darrick J. Wong @ 2024-01-04 14:13 ` Brian J. Murrell 2024-01-08 12:52 ` Brian J. Murrell 2024-01-04 14:37 ` Brian J. Murrell 1 sibling, 1 reply; 16+ messages in thread From: Brian J. Murrell @ 2024-01-04 14:13 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 3951 bytes --] On Wed, 2024-01-03 at 20:55 -0800, Darrick J. Wong wrote: > Curious. Normally e2scrub will run e2fsck twice: Once in journal- > only > preen mode to replay the journal, then again with -fy to perform the > full filesystem (snapshot) check. It is doing that. I suspect the first e2fsck is silent. > I wonder if you would paste the output of > "bash -x e2scrub /dev/rootvol_tmp/almalinux8_opt" here? I'd be > curious > to see what the command flow is. Sure. + PATH=/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/.dotnet/tools + (( 0 != 0 )) + snap_size_mb=256 + fstrim=0 + reap=0 + e2fsck_opts= + conffile=/etc/e2scrub.conf + test -f /etc/e2scrub.conf + . /etc/e2scrub.conf ++ periodic_e2scrub=1 ++ sender=e2scrub@pvr.interlinx.bc.ca + getopts nrtV opt + shift 0 + arg=/dev/rootvol_tmp/almalinux8_opt + '[' -z /dev/rootvol_tmp/almalinux8_opt ']' + type lsblk + type lvcreate + exec + '[' -b /dev/rootvol_tmp/almalinux8_opt ']' ++ dev_from_arg /dev/rootvol_tmp/almalinux8_opt ++ local dev=/dev/rootvol_tmp/almalinux8_opt +++ lsblk -o FSTYPE -n /dev/rootvol_tmp/almalinux8_opt ++ local fstype=ext2 ++ case "${fstype}" in ++ echo /dev/rootvol_tmp/almalinux8_opt ++ return 0 + dev=/dev/rootvol_tmp/almalinux8_opt ++ mnt_from_dev /dev/rootvol_tmp/almalinux8_opt ++ local dev=/dev/rootvol_tmp/almalinux8_opt ++ '[' -n /dev/rootvol_tmp/almalinux8_opt ']' ++ lsblk -o MOUNTPOINT -n /dev/rootvol_tmp/almalinux8_opt + mnt=/opt + '[' '!' -e /dev/rootvol_tmp/almalinux8_opt ']' ++ lvs --nameprefixes -o name,vgname,lv_role --noheadings /dev/rootvol_tmp/almalinux8_opt + lvm_vars=' LVM2_LV_NAME='\''almalinux8_opt'\'' LVM2_VG_NAME='\''rootvol_tmp'\'' LVM2_LV_ROLE='\''public'\''' + eval ' LVM2_LV_NAME='\''almalinux8_opt'\'' LVM2_VG_NAME='\''rootvol_tmp'\'' LVM2_LV_ROLE='\''public'\''' ++ LVM2_LV_NAME=almalinux8_opt ++ LVM2_VG_NAME=rootvol_tmp ++ LVM2_LV_ROLE=public + '[' -z rootvol_tmp ']' + '[' -z almalinux8_opt ']' + echo public + grep -q snapshot ++ date +%Y%m%d%H%M%S + start_time=20240104091039 + snap=almalinux8_opt.e2scrub + snap_dev=/dev/rootvol_tmp/almalinux8_opt.e2scrub + '[' 0 -gt 0 ']' + setup ++ date +%s + lvremove_deadline=1704377469 + lvremove -f rootvol_tmp/almalinux8_opt.e2scrub + '[' -e /dev/rootvol_tmp/almalinux8_opt.e2scrub ']' + '[' -e /dev/rootvol_tmp/almalinux8_opt.e2scrub ']' + lvcreate -s -L 256m -n almalinux8_opt.e2scrub rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt.e2scrub" created. + '[' 0 -ne 0 ']' + udevadm settle + return 0 + trap 'teardown; exit 1' EXIT INT QUIT TERM + check + E2FSCK_FIXES_ONLY=1 + export E2FSCK_FIXES_ONLY + /usr/sbin/e2fsck -E journal_only -p /dev/rootvol_tmp/almalinux8_opt.e2scrub + /usr/sbin/e2fsck -f -y /dev/rootvol_tmp/almalinux8_opt.e2scrub e2fsck 1.47.0 (5-Feb-2023) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt.e2scrub: 1698/178816 files (87.0% non-contiguous), 482473/716800 blocks + case "$?" in ++ lvs -o lv_snapshot_invalid --noheadings /dev/rootvol_tmp/almalinux8_opt.e2scrub ++ awk '{print $1}' + is_invalid= + '[' -n '' ']' + echo '/dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! Unmount and run e2fsck -y.' /dev/rootvol_tmp/almalinux8_opt: Scrub FAILED due to corruption! Unmount and run e2fsck -y. + mark_corrupt + /usr/sbin/tune2fs -E force_fsck /dev/rootvol_tmp/almalinux8_opt tune2fs 1.47.0 (5-Feb-2023) Setting filesystem error flag to force fsck. + ret=6 + teardown + lvremove -f rootvol_tmp/almalinux8_opt.e2scrub Logical volume "almalinux8_opt.e2scrub" successfully removed. + '[' -e /dev/rootvol_tmp/almalinux8_opt.e2scrub ']' + trap '' EXIT + exitcode 6 + ret=6 + '[' -n '' -a 6 -ne 0 ']' + exit 6 Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-04 14:13 ` Brian J. Murrell @ 2024-01-08 12:52 ` Brian J. Murrell 2024-01-09 6:06 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Brian J. Murrell @ 2024-01-08 12:52 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1026 bytes --] On Thu, 2024-01-04 at 09:13 -0500, Brian J. Murrell wrote: > On Wed, 2024-01-03 at 20:55 -0800, Darrick J. Wong wrote: > > Curious. Normally e2scrub will run e2fsck twice: Once in journal- > > only > > preen mode to replay the journal, then again with -fy to perform > > the > > full filesystem (snapshot) check. > > It is doing that. I suspect the first e2fsck is silent. > > > I wonder if you would paste the output of > > "bash -x e2scrub /dev/rootvol_tmp/almalinux8_opt" here? I'd be > > curious > > to see what the command flow is. > > Sure. Was the bash -x output useful in any way, or was any of the information I supplied in my other replies on this list: https://lore.kernel.org/linux-ext4/51aa3ceea05945c9f28e884bc2f43a249ef7e23e.camel@interlinx.bc.ca/ https://lore.kernel.org/linux-ext4/be5e8488f8484194889216603d2aba2812c6adcb.camel@interlinx.bc.ca/ useful including the test of 1.47.0 being able to reproduce the behaviour? Any thoughts on how to proceed? Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-08 12:52 ` Brian J. Murrell @ 2024-01-09 6:06 ` Darrick J. Wong 2024-01-10 5:31 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2024-01-09 6:06 UTC (permalink / raw) To: Brian J. Murrell; +Cc: linux-ext4 On Mon, Jan 08, 2024 at 07:52:33AM -0500, Brian J. Murrell wrote: > On Thu, 2024-01-04 at 09:13 -0500, Brian J. Murrell wrote: > > On Wed, 2024-01-03 at 20:55 -0800, Darrick J. Wong wrote: > > > Curious. Normally e2scrub will run e2fsck twice: Once in journal- > > > only > > > preen mode to replay the journal, then again with -fy to perform > > > the > > > full filesystem (snapshot) check. > > > > It is doing that. I suspect the first e2fsck is silent. > > > > > I wonder if you would paste the output of > > > "bash -x e2scrub /dev/rootvol_tmp/almalinux8_opt" here? I'd be > > > curious > > > to see what the command flow is. > > > > Sure. > > Was the bash -x output useful in any way, or was any of the information > I supplied in my other replies on this list: > > https://lore.kernel.org/linux-ext4/51aa3ceea05945c9f28e884bc2f43a249ef7e23e.camel@interlinx.bc.ca/ > https://lore.kernel.org/linux-ext4/be5e8488f8484194889216603d2aba2812c6adcb.camel@interlinx.bc.ca/ > > useful including the test of 1.47.0 being able to reproduce the > behaviour? It was good and bad -- good in that it eliminated all of my hypotheses about what could be causing it; and bad in that now I have no idea. *Something* is causing the e2fsck exit code to be nonzero, but there's nothing identifying what did that in the stdout/stderr dump. > Any thoughts on how to proceed? If you're willing to share a metadata dump of the filesystem, injecting: e2image -Q "${snap_dev}" /tmp/disk.qcow2 right before the second e2fsck invocation in check() might help us get a reproducer going. Please compress the qcow2 file before uploading it somewhere. --D > Cheers, > b. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-09 6:06 ` Darrick J. Wong @ 2024-01-10 5:31 ` Darrick J. Wong 2024-01-10 13:44 ` Brian J. Murrell 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2024-01-10 5:31 UTC (permalink / raw) To: Brian J. Murrell, tytso; +Cc: linux-ext4 On Mon, Jan 08, 2024 at 10:06:29PM -0800, Darrick J. Wong wrote: > On Mon, Jan 08, 2024 at 07:52:33AM -0500, Brian J. Murrell wrote: > > On Thu, 2024-01-04 at 09:13 -0500, Brian J. Murrell wrote: > > > On Wed, 2024-01-03 at 20:55 -0800, Darrick J. Wong wrote: > > > > Curious. Normally e2scrub will run e2fsck twice: Once in journal- > > > > only > > > > preen mode to replay the journal, then again with -fy to perform > > > > the > > > > full filesystem (snapshot) check. > > > > > > It is doing that. I suspect the first e2fsck is silent. > > > > > > > I wonder if you would paste the output of > > > > "bash -x e2scrub /dev/rootvol_tmp/almalinux8_opt" here? I'd be > > > > curious > > > > to see what the command flow is. > > > > > > Sure. > > > > Was the bash -x output useful in any way, or was any of the information > > I supplied in my other replies on this list: > > > > https://lore.kernel.org/linux-ext4/51aa3ceea05945c9f28e884bc2f43a249ef7e23e.camel@interlinx.bc.ca/ > > https://lore.kernel.org/linux-ext4/be5e8488f8484194889216603d2aba2812c6adcb.camel@interlinx.bc.ca/ > > > > useful including the test of 1.47.0 being able to reproduce the > > behaviour? > > It was good and bad -- good in that it eliminated all of my hypotheses > about what could be causing it; and bad in that now I have no idea. > > *Something* is causing the e2fsck exit code to be nonzero, but there's > nothing identifying what did that in the stdout/stderr dump. > > > Any thoughts on how to proceed? > > If you're willing to share a metadata dump of the filesystem, injecting: > > e2image -Q "${snap_dev}" /tmp/disk.qcow2 > > right before the second e2fsck invocation in check() might help us get a > reproducer going. Please compress the qcow2 file before uploading it > somewhere. /me downloads dump, takes a look... AHA! This is an ext2 filesystem, since it doesn't have the "has_journal" or "extents" features turned on: # e2image -r /tmp/disk.qcow2 /dev/sda # dumpe2fs /dev/sda -h dumpe2fs 1.47.1~WIP-2023-12-27 (27-Dec-2023) Filesystem volume name: <none> Last mounted on: /opt Filesystem UUID: 2c70368a-0d54-4805-8620-fda19466d819 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: ext_attr resize_inode dir_index filetype sparse_super large_file Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: not clean with errors (Note: Filesystem state == "clean" means that EXT2_VALID_FS is set in the superblock s_state field; "not clean with errors" means that the flag is not set.) I bet the "journal only" preen doesn't actually reset the filesystem state either: # e2fsck -E journal_only -p /dev/sda # dumpe2fs /dev/sda -h | grep state dumpe2fs 1.47.1~WIP-2023-12-27 (27-Dec-2023) Filesystem state: not clean with errors Nope. So now I know what happened -- when mounting an ext* filesystem that doesn't have a journal, the driver clears EXT2_VALID_FS from the primary superblock. This forces the system to run e2fsck after a crash, because that's what you have to do for unjournalled filesystems. The "e2fsck -E journal_only -p" call in e2scrub only replays the journal. Since there is no journal, it exits almost immediately. That's the intended behavior, but then it means that the "e2fsck -fy" call immediately after sees that the superblock doesn't have EXT2_VALID_FS set, sets it, and makes e2fsck return 1. So that's why you're getting the e2scrub failures. Contrast this to what you get when the filesystem has a journal: # dumpe2fs -h /dev/sdb dumpe2fs 1.47.0 (5-Feb-2023) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: e18b8b57-a75e-4316-87ce-6a08969476c3 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Filesystems with journals retain their EXT4_VALID_FS state when they're mounted. Hmm. I'll have to think tomorrow morning about what e2scrub should do about unjournalled filesystems. My initial thought is that it skip them, because a mounted unjournalled filesystem cannot by definition be made to be consistent. Restricting the scope of e2scrub sucks, but in the meantime at least it means that your filesystem isn't massively corrupt. Thanks for the metadump, it was very useful for root cause analysis. Ted: do you have any ideas? --D > --D > > > Cheers, > > b. > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-10 5:31 ` Darrick J. Wong @ 2024-01-10 13:44 ` Brian J. Murrell 2024-01-10 18:06 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Brian J. Murrell @ 2024-01-10 13:44 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1040 bytes --] On Tue, 2024-01-09 at 21:31 -0800, Darrick J. Wong wrote: > > AHA! This is an ext2 filesystem, since it doesn't have the > "has_journal" or "extents" features turned on: This is very odd. I haven't (intentionally) created a ext2 filesystem since ext3 became available. :-) Moreover /proc/mounts says it's an ext4 filesystem: /dev/mapper/rootvol_tmp-almalinux8_opt /opt ext4 rw,seclabel,relatime 0 0 Do ext2 filesystems actually mount successfully and quietly when mounted as ext4? Surely if one asks to mount an ext2 filesystem as ext4 mount should fail and complain, yes? Is https://ext4.wiki.kernel.org/index.php/UpgradeToExt4 still considered accurate, in terms of an in-place upgrade of ext2 to ext4 being sub-optimal? Is metadata locality the only thing you don't get with an in-place upgrade? If so, how important is that, really? > Thanks for the > metadump, it was very useful for root cause analysis. NPAA. Thank-you very much for your time and analysis on this issue. Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-10 13:44 ` Brian J. Murrell @ 2024-01-10 18:06 ` Darrick J. Wong 2024-01-10 23:43 ` Andreas Dilger 2024-01-16 13:22 ` Brian J. Murrell 0 siblings, 2 replies; 16+ messages in thread From: Darrick J. Wong @ 2024-01-10 18:06 UTC (permalink / raw) To: Brian J. Murrell; +Cc: linux-ext4 On Wed, Jan 10, 2024 at 08:44:31AM -0500, Brian J. Murrell wrote: > On Tue, 2024-01-09 at 21:31 -0800, Darrick J. Wong wrote: > > > > AHA! This is an ext2 filesystem, since it doesn't have the > > "has_journal" or "extents" features turned on: > > This is very odd. I haven't (intentionally) created a ext2 filesystem > since ext3 became available. :-) Huh. Do you remember the exact command that was used to format this filesystem? "mke2fs" still formats ext2 filesystems unless you pass -T ext4 or call its cousin mkfs.ext4. > Moreover /proc/mounts says it's an ext4 filesystem: > > /dev/mapper/rootvol_tmp-almalinux8_opt /opt ext4 rw,seclabel,relatime 0 0 Check /etc/fstab -- if the type is specified as ext4, then that's what ends up in /proc/mounts, even if it's an ext2 filesystem. > Do ext2 filesystems actually mount successfully and quietly when > mounted as ext4? Yes. Most distros enable ext4.ko and do not enable ext2.ko, and the ext4 driver is happy to mount ext2 filesystems but report them as ext4. > Surely if one asks to mount an ext2 filesystem as ext4 mount should > fail and complain, yes? Nope. ext4 is really just ext2 plus a bunch of new features (journal, extents, uninit_bg, dir_index). Or another way to look at it is that ext2 is really just ext4 minus a bunch of features. Muddying the water here is the fact that you're allowed to turn /off/ all these new features from the past 20 years, which means that the integer after "ext" is not actually a gestalt id. > Is https://ext4.wiki.kernel.org/index.php/UpgradeToExt4 still > considered accurate, in terms of an in-place upgrade of ext2 to ext4 > being sub-optimal? Yes, that's accurate. It's suboptimal in the sense that you ought to back up the directory tree before running any of those commands in case something goes wrong (program bug, power outage, etc) but if you have a backup, you might as well format fresh and restore the backup. > Is metadata locality the only thing you don't get with an in-place > upgrade? If so, how important is that, really? IIRC I think you don't get flex_bg, which means that the bitmaps are every 128M instead of every 1G or so, which leads to more seeking. > > Thanks for the > > metadump, it was very useful for root cause analysis. > > NPAA. Thank-you very much for your time and analysis on this issue. No problem. It's always fun to do a bit of Why, Tho? ;) --D > > Cheers, > b. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-10 18:06 ` Darrick J. Wong @ 2024-01-10 23:43 ` Andreas Dilger 2024-01-16 13:29 ` Brian J. Murrell 2024-01-16 13:22 ` Brian J. Murrell 1 sibling, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2024-01-10 23:43 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Brian J. Murrell, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 3863 bytes --] On Jan 10, 2024, at 11:06 AM, Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Jan 10, 2024 at 08:44:31AM -0500, Brian J. Murrell wrote: >> On Tue, 2024-01-09 at 21:31 -0800, Darrick J. Wong wrote: >>> >>> AHA! This is an ext2 filesystem, since it doesn't have the >>> "has_journal" or "extents" features turned on: >> >> This is very odd. I haven't (intentionally) created a ext2 filesystem >> since ext3 became available. :-) > > Huh. Do you remember the exact command that was used to format this > filesystem? "mke2fs" still formats ext2 filesystems unless you pass > -T ext4 or call its cousin mkfs.ext4. > >> Moreover /proc/mounts says it's an ext4 filesystem: >> >> /dev/mapper/rootvol_tmp-almalinux8_opt /opt ext4 rw,seclabel,relatime 0 0 > > Check /etc/fstab -- if the type is specified as ext4, then that's what > ends up in /proc/mounts, even if it's an ext2 filesystem. > >> Do ext2 filesystems actually mount successfully and quietly when >> mounted as ext4? > > Yes. Most distros enable ext4.ko and do not enable ext2.ko, and the > ext4 driver is happy to mount ext2 filesystems but report them as ext4. > >> Surely if one asks to mount an ext2 filesystem as ext4 mount should >> fail and complain, yes? > > Nope. ext4 is really just ext2 plus a bunch of new features (journal, > extents, uninit_bg, dir_index). Or another way to look at it is that > ext2 is really just ext4 minus a bunch of features. > > Muddying the water here is the fact that you're allowed to turn /off/ > all these new features from the past 20 years, which means that the > integer after "ext" is not actually a gestalt id. > >> Is https://ext4.wiki.kernel.org/index.php/UpgradeToExt4 still >> considered accurate, in terms of an in-place upgrade of ext2 to ext4 >> being sub-optimal? > > Yes, that's accurate. It's suboptimal in the sense that you ought to > back up the directory tree before running any of those commands in case > something goes wrong (program bug, power outage, etc) but if you have a > backup, you might as well format fresh and restore the backup. > >> Is metadata locality the only thing you don't get with an in-place >> upgrade? If so, how important is that, really? > > IIRC I think you don't get flex_bg, which means that the bitmaps are > every 128M instead of every 1G or so, which leads to more seeking. > >>> Thanks for the >>> metadump, it was very useful for root cause analysis. >> >> NPAA. Thank-you very much for your time and analysis on this issue. > > No problem. It's always fun to do a bit of Why, Tho? ;) Hello Brian, long time no see! I was wondering if this might be a case where e2fsck removed the journal on an ext4 filesystem, and then it wasn't recreated (e.g. if e2fsck was killed before it finished cleanly). However, looking at the features enabled on the filesystem, it definitely looks like this was originally formatted as ext4. Like Darrick mentioned, it is missing flex_bg, along with a whole slew of newer features. On one of my local ext4 filesystems it has: Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize compared to your filesystem: Filesystem features: ext_attr resize_inode dir_index filetype sparse_super large_file Many of these features can be enabled on an existing filesystem, like has_journal (ext3/4 journal), extents (improved large file allocation), huge_file (> 2TB files), dir_nlink (> 32000 subdirs) if you want them. I _think_ uninit_bg (e2fsck skip unused metadata may) is included here. Some cannot be enabled on an existing filesystem like flex_bg (localized metadata), and extra_isize (fast xattrs). Whether that is worthwhile for you to enable, or just backup/reformat/sync is up to you. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-10 23:43 ` Andreas Dilger @ 2024-01-16 13:29 ` Brian J. Murrell 0 siblings, 0 replies; 16+ messages in thread From: Brian J. Murrell @ 2024-01-16 13:29 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1149 bytes --] On Wed, 2024-01-10 at 16:43 -0700, Andreas Dilger wrote: > > Hello Brian, long time no see! Hi Andreas. Indeed, it has been a while. :-) > I was wondering if this might be a case where e2fsck removed the > journal > on an ext4 filesystem, and then it wasn't recreated (e.g. if e2fsck > was > killed before it finished cleanly). > > However, looking at the features enabled on the filesystem, it > definitely > looks like this was originally formatted as ext4. I suspect you mean s/4/2/ above? > Many of these features can be enabled on an existing filesystem, like > has_journal (ext3/4 journal), extents (improved large file > allocation), > huge_file (> 2TB files), dir_nlink (> 32000 subdirs) if you want > them. > I _think_ uninit_bg (e2fsck skip unused metadata may) is included > here. > > Some cannot be enabled on an existing filesystem like flex_bg > (localized > metadata), and extra_isize (fast xattrs). > > Whether that is worthwhile for you to enable, or just > backup/reformat/sync > is up to you. Indeed. I did simply re-create and copy as suggested by the wiki entry. Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-10 18:06 ` Darrick J. Wong 2024-01-10 23:43 ` Andreas Dilger @ 2024-01-16 13:22 ` Brian J. Murrell 2024-01-17 19:42 ` Andreas Dilger 1 sibling, 1 reply; 16+ messages in thread From: Brian J. Murrell @ 2024-01-16 13:22 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 802 bytes --] On Wed, 2024-01-10 at 10:06 -0800, Darrick J. Wong wrote: > > Huh. Do you remember the exact command that was used to format this > filesystem? I do not. It was created quite a while ago. > "mke2fs" still formats ext2 filesystems unless you pass > -T ext4 or call its cousin mkfs.ext4. I wonder if that's what I did perhaps. > Nope. ext4 is really just ext2 plus a bunch of new features > (journal, > extents, uninit_bg, dir_index). Yes, that's completely understood. I would have thought it an interesting "safety" measure to flag that when a user requests an ext4 mount and the file system is actually only ext2 that a refusal to mount would indicate to the user that their ext* file system does not have the required features to be called ext4. Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-16 13:22 ` Brian J. Murrell @ 2024-01-17 19:42 ` Andreas Dilger 2024-01-17 22:20 ` Brian J. Murrell 0 siblings, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2024-01-17 19:42 UTC (permalink / raw) To: Brian J. Murrell; +Cc: Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 1023 bytes --] On Jan 16, 2024, at 6:29 AM, Brian J. Murrell <brian@interlinx.bc.ca> wrote: > > On Wed, 2024-01-10 at 10:06 -0800, Darrick J. Wong wrote: >> >> Huh. Do you remember the exact command that was used to format this >> filesystem? > > I do not. It was created quite a while ago. > >> "mke2fs" still formats ext2 filesystems unless you pass >> -T ext4 or call its cousin mkfs.ext4. > > I wonder if that's what I did perhaps. > > >> Nope. ext4 is really just ext2 plus a bunch of new features >> (journal, >> extents, uninit_bg, dir_index). > > Yes, that's completely understood. I would have thought it an > interesting "safety" measure to flag that when a user requests an ext4 > mount and the file system is actually only ext2 that a refusal to mount > would indicate to the user that their ext* file system does not have > the required features to be called ext4. At this stage in the game, it _probably_ makes sense that bare "mke2fs" default to ext4 instead of ext2 to avoid this issue? Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-17 19:42 ` Andreas Dilger @ 2024-01-17 22:20 ` Brian J. Murrell 0 siblings, 0 replies; 16+ messages in thread From: Brian J. Murrell @ 2024-01-17 22:20 UTC (permalink / raw) To: Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 245 bytes --] On Wed, 2024-01-17 at 12:42 -0700, Andreas Dilger wrote: > > At this stage in the game, it _probably_ makes sense that bare > "mke2fs" > default to ext4 instead of ext2 to avoid this issue? Seems reasonable to me. :-) Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: e2scrub finds corruption immediately after mounting 2024-01-04 4:55 ` Darrick J. Wong 2024-01-04 14:13 ` Brian J. Murrell @ 2024-01-04 14:37 ` Brian J. Murrell 1 sibling, 0 replies; 16+ messages in thread From: Brian J. Murrell @ 2024-01-04 14:37 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2368 bytes --] As a point of reference, the aforementioned lvcheck doesn't seem to find any corruption on the same device and here is what it's doing: … + lvcreate -s -L 256M -n almalinux8_opt-lvcheck-temp-20240104 rootvol_tmp/almalinux8_opt Logical volume "almalinux8_opt-lvcheck-temp-20240104" created. + perform_check /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104 ext2 /tmp/lvcheck.log.e0Xq523Wio + local dev=/dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104 + local fstype=ext2 + local tmpfile=/tmp/lvcheck.log.e0Xq523Wio + case "$fstype" in + nice logsave -as /tmp/lvcheck.log.e0Xq523Wio e2fsck -p -C 0 /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104 /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104 contains a file system with errors, check forced. /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104: 1698/178816 files (87.0% non-contiguous), 482473/716800 blocks e2fsck exited with status code 1 + nice logsave -as /tmp/lvcheck.log.e0Xq523Wio e2fsck -fy -C 0 /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104 e2fsck 1.47.0 (5-Feb-2023) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rootvol_tmp/almalinux8_opt-lvcheck-temp-20240104: 1698/178816 files (87.0% non-contiguous), 482473/716800 blocks + return 0 + log info 'Background scrubbing of /dev/rootvol_tmp/almalinux8_opt succeeded.' + local sev=info + local 'msg=Background scrubbing of /dev/rootvol_tmp/almalinux8_opt succeeded.' + local arg= + '[' info == emerg -o info == alert -o info == crit -o info == err -o info == warning ']' + logger -t lvcheck -p user.info -- 'Background scrubbing of /dev/rootvol_tmp/almalinux8_opt succeeded.' + try_delay_checks /dev/rootvol_tmp/almalinux8_opt ext2 + local dev=/dev/rootvol_tmp/almalinux8_opt + local fstype=ext2 + case "$fstype" in + tune2fs -C 0 -T now /dev/rootvol_tmp/almalinux8_opt tune2fs 1.47.0 (5-Feb-2023) Setting current mount count to 0 Setting time filesystem last checked to Thu Jan 4 09:29:25 2024 The significant difference between lvcheck and e2scrub seems to be the '-E journal_only' option to e2fsck that e2scrub is adding. Cheers, b. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-01-17 22:29 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-03 21:14 e2scrub finds corruption immediately after mounting Brian J. Murrell 2024-01-04 4:38 ` Theodore Ts'o 2024-01-04 14:10 ` Brian J. Murrell 2024-01-04 4:55 ` Darrick J. Wong 2024-01-04 14:13 ` Brian J. Murrell 2024-01-08 12:52 ` Brian J. Murrell 2024-01-09 6:06 ` Darrick J. Wong 2024-01-10 5:31 ` Darrick J. Wong 2024-01-10 13:44 ` Brian J. Murrell 2024-01-10 18:06 ` Darrick J. Wong 2024-01-10 23:43 ` Andreas Dilger 2024-01-16 13:29 ` Brian J. Murrell 2024-01-16 13:22 ` Brian J. Murrell 2024-01-17 19:42 ` Andreas Dilger 2024-01-17 22:20 ` Brian J. Murrell 2024-01-04 14:37 ` Brian J. Murrell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).