ext4 damage suspected in between 5.15.167

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4 damage suspected in between 5.15.167 - 5.15.170
@ 2024-12-12 18:31 Nikolai Zhubr
  2024-12-12 19:16 ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolai Zhubr @ 2024-12-12 18:31 UTC (permalink / raw)
  To: linux-ext4, stable, linux-kernel, jack, Nikolai Zhubr

Hi,

This is to report that after jumping from generic kernel 5.15.167 to
5.15.170 I apparently observe ext4 damage.

After some few days of regular daily use of 5.15.170, one morning my
ext4 partition refused to mount complaining about corrupted system
area (-117).
There were no unusual events preceding this. The device in question is
a laptop with healthy battery, also connected to AC permanently.
The laptop is privately owned by me, in daily use at home, so I am
100% aware of everything happening with it.
The filesystem in question lives on md raid1 with very assymmetric
members (ssd+hdd) so one would not possibly expect that in the event
of emergency cpu halt or some other abnormal stop while filesystem was
actively writing data, raid members could stay in perfect sync.
After the incident, I've run raid1 check multiple times and run
memtest multiple times from different boot media and certainly
consulted startctl.
Nothing. No issues whatsoever except for this spontaneous ext4 damage.

Looking at git log for ext4 changes between 5.15.167 and 5.15.170
shows a few commits. All landed in 5.15.168.
Interestingly, one of them is a comeback of the (in)famous
91562895f803 "properly sync file size update after O_SYNC ..." which
caused some blowup 1 year ago due to "subtle interaction".
I've no idea if 91562895f803 is related to damage this time or not,
but most definitely it looks like some problem was introduced between
5.15.167 and 5.15.170 anyway.
And because there are apparently 0 commits to ext4 in 5.15 since
5.15.168 at the moment, I thought I'd report.

Please CC me if you want me to see your reply and/or need more info
(I'm not subscribed to the normal flow).

Take care,

Nick

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-12 18:31 ext4 damage suspected in between 5.15.167 - 5.15.170 Nikolai Zhubr
@ 2024-12-12 19:16 ` Theodore Ts'o
  2024-12-13 10:49   ` Nikolai Zhubr
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2024-12-12 19:16 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: linux-ext4, stable, linux-kernel, jack

On Thu, Dec 12, 2024 at 09:31:05PM +0300, Nikolai Zhubr wrote:
> This is to report that after jumping from generic kernel 5.15.167 to
> 5.15.170 I apparently observe ext4 damage.

Hi Nick,

In general this is not something that upstream kernel developers will
pay a lot of attention to try to root cause.  If you can come up with
a reliable reproducer, not just a single one-off, it's much more
likely that people will pay attention.  If you can demonstrate that
the reliable reproducer shows the issue on the latest development HEAD
of the upstream kernel, they will definitely pay attention.

People will also pay more attention if you give more detail in your
message.  Not just some vague "ext4 damage" (where 99% of time, these
sorts of things happen due to hardware-induced corruption), but the
exact message when mount failed.

Also helpful when reporting ext4 issues, it's helpful to include
information about the file system configuration using "dumpe2fs -h
/dev/XXX".  Extracting kernel log messages that include the string
"EXT4-fs", via commands like "sudo dmesg | grep EXT4-fs", or "sudo
journalctl | grep EXT4-fs", or "grep EXT4-fs /var/log/messages" are
also helpful, as is getting a report from fsck via a command like
"fsck.ext4 -fn /dev/XXX >& /tmp/fsck.out"

That way they can take a quick look the information and do an initial
triage over the most likely cause.

> And because there are apparently 0 commits to ext4 in 5.15 since
> 5.15.168 at the moment, I thought I'd report.

Did you check for any changes to the md/dm code, or the block layer?
Also, if you checked for I/O errors in the system logs, or run
"smartctl" on the block devices, please say so.  (And if there are
indications of I/O errors or storage device issues, please do
immediate backups and make plans to replace your hardware before you
suffer more serious data loss.)

Finally, if you want more support than what volunteers in the upstream
linux kernel community can provide, this is what paid support from
companies like SuSE, or Red Hat, can provide.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-12 19:16 ` Theodore Ts'o
@ 2024-12-13 10:49   ` Nikolai Zhubr
  2024-12-13 16:12     ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolai Zhubr @ 2024-12-13 10:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, stable, linux-kernel, jack

Hi Ted,

> On Thu, Dec 12, 2024 at 09:31:05PM +0300, Nikolai Zhubr wrote:
>> This is to report that after jumping from generic kernel 5.15.167 to
>> 5.15.170 I apparently observe ext4 damage.
> 
> Hi Nick,
> 
> In general this is not something that upstream kernel developers will
> pay a lot of attention to try to root cause.  If you can come up with

Thanks for a quick and detailed reply. That's really appreciated. I need 
to clarify. I'm not a hardcore kernel developer at all, I just touch it 
a little bit occasionally, for random reasons. Debugging the situation 
thoroughly so as to find and prove the cause is far beyond my capability 
and also not exactly my personal or professional interest. I also don't 
need any sort of support (i.e. as a client) - I've already repaired and 
validated/restored from backups almost everything now, and I can just 
stick at 5.15.167 for basically as long as I like.

On the other hand, having buggy kernels (to the point of ext4 fs 
corruption) published as suitable for wide general use is not a good 
thing in my book, therefore I believe in the case of reasonable suspects 
I must at least raise a warning about it, and if I can somehow 
contribute to tracking the problem I'll do what I'm able to.

Not going to argue, but it'd seem if 5.15 is totally out of interest 
already, why keep patching it? And as long as it keeps receiving 
patches, supposedly they are backported and applied to stabilize, not 
damage it? Ok, nevermind :-)

> People will also pay more attention if you give more detail in your
> message.  Not just some vague "ext4 damage" (where 99% of time, these
> sorts of things happen due to hardware-induced corruption), but the
> exact message when mount failed.

Yes. That is why I spent 2 days for solely testing hardware, booting 
from separate media, stressing everything, and making plenty of copies. 
As I mentioned in my initial post, this had revealed no hardware issues. 
And I'm enjoying md raid-1 since around 2003 already (Not on this device 
though). I can post all my "smart" values as is, but I can assure they 
are perfectly fine for both raid-1 members. I encounter faulty hdds 
elsewhere routinely so its not something unseen too.

#smartctl -a /dev/nvme0n1 | grep Spare
Available Spare:                    100%
Available Spare Threshold:          10%

#smartctl -a /dev/sda | grep Sector
Sector Sizes:     512 bytes logical, 4096 bytes physical
   5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail Always 
       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always 
       -       0

I have a copy of the entire ext4 partition taken immediately as mount 
first failed, it is ~800Gb and may contain some sensitive data so I 
cannot just hand it to someone else or publish for examination. But I 
can now easily do a replay of mount failure and fsck processing as many 
times as needed. For now, it seems file/dir bodies had not been damaged, 
just some system areas had. I've not encountered any file which would 
give wrong checksum or otherwise appeared definitely damaged, with 
overall like 95% verified and definitely fine, 5% hard to reliably 
verify but those are less important files.

> Also helpful when reporting ext4 issues, it's helpful to include
> information about the file system configuration using "dumpe2fs -h

This is a dump run on a standalone copy taken before repair (after 
successful raid re-check):

#dumpe2fs -h /dev/sdb1
Filesystem volume name:   DATA
Last mounted on:          /opt
Filesystem UUID:          ea823c6c-500f-4bf0-a4a7-a872ed740af3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent 64bit flex_bg sparse_super large_file huge_file 
dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              51634176
Block count:              206513920
Reserved block count:     10325696
Overhead clusters:        3292742
Free blocks:              48135978
Free inodes:              50216050
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Tue Jul  9 01:51:16 2024
Last mount time:          Mon Dec  9 10:08:27 2024
Last write time:          Tue Dec 10 04:08:17 2024
Mount count:              273
Maximum mount count:      -1
Last checked:             Tue Jul  9 01:51:16 2024
Check interval:           0 (<none>)
Lifetime writes:          913 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      60bfa28b-cdd2-4ba6-8261-87961db4ecea
Journal backup:           inode blocks
FS Error count:           293
First error time:         Tue Dec 10 06:17:23 2024
First error function:     ext4_lookup
First error line #:       1437
First error inode #:      20709377
Last error time:          Tue Dec 10 21:12:30 2024
Last error function:      ext4_lookup
Last error line #:        1437
Last error inode #:       20709377
Journal features:         journal_incompat_revoke journal_64bit
Total journal size:       128M
Total journal blocks:     32768
Max transaction length:   32768
Fast commit length:       0
Journal sequence:         0x00064c6e
Journal start:            0

> /dev/XXX".  Extracting kernel log messages that include the string
> "EXT4-fs", via commands like "sudo dmesg | grep EXT4-fs", or "sudo
> journalctl | grep EXT4-fs", or "grep EXT4-fs /var/log/messages" are
> also helpful, as is getting a report from fsck via a command like

#grep EXT4-fs messages-20241212 | grep md126
2024-12-06T11:53:09.471317+03:00 lenovo-zh kernel: [    7.649474][ 
T1124] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-06T11:53:09.471351+03:00 lenovo-zh kernel: [    7.899321][ 
T1124] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.
2024-12-07T12:03:18.518047+03:00 lenovo-zh kernel: [    7.633150][ 
T1106] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-07T12:03:18.518054+03:00 lenovo-zh kernel: [    7.951716][ 
T1106] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.
2024-12-08T12:41:33.686145+03:00 lenovo-zh kernel: [    7.588405][ 
T1118] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-08T12:41:33.686148+03:00 lenovo-zh kernel: [    7.679963][ 
T1118] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.
(* normal boot failed and subsequently fsck was run on real data here *)
2024-12-10T18:21:40.356656+03:00 lenovo-zh kernel: [  483.522025][ 
T1740] EXT4-fs (md126): failed to initialize system zone (-117)
2024-12-10T18:21:40.356685+03:00 lenovo-zh kernel: [  483.522050][ 
T1740] EXT4-fs (md126): mount failed
2024-12-11T02:00:18.382301+03:00 lenovo-zh kernel: [  490.551080][ 
T1809] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
(null). Quota mode: none.
2024-12-11T12:00:53.249626+03:00 lenovo-zh kernel: [    7.550823][ 
T1056] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-11T12:00:53.249629+03:00 lenovo-zh kernel: [    7.662317][ 
T1056] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.

#grep md126 messages-20241212
2024-12-07T12:03:18.518038+03:00 lenovo-zh kernel: [    7.154448][ T992] 
md126: detected capacity change from 0 to 1652111360
2024-12-07T12:03:18.518047+03:00 lenovo-zh kernel: [    7.633150][ 
T1106] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-07T12:03:18.518054+03:00 lenovo-zh kernel: [    7.951716][ 
T1106] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.
2024-12-08T12:41:33.685280+03:00 lenovo-zh systemd[1]: Started Timer to 
wait for more drives before activating degraded array md126..
2024-12-08T12:41:33.685325+03:00 lenovo-zh systemd[1]: 
mdadm-last-resort@md126.timer: Deactivated successfully.
2024-12-08T12:41:33.685327+03:00 lenovo-zh systemd[1]: Stopped Timer to 
wait for more drives before activating degraded array md126..
2024-12-08T12:41:33.686136+03:00 lenovo-zh kernel: [    7.346744][ 
T1107] md/raid1:md126: active with 2 out of 2 mirrors
2024-12-08T12:41:33.686137+03:00 lenovo-zh kernel: [    7.357218][ 
T1107] md126: detected capacity change from 0 to 1652111360
2024-12-08T12:41:33.686145+03:00 lenovo-zh kernel: [    7.588405][ 
T1118] EXT4-fs (md126): Mount option "noacl" will be removed by 3.5
2024-12-08T12:41:33.686148+03:00 lenovo-zh kernel: [    7.679963][ 
T1118] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
noacl. Quota mode: none.
(* on 2024-12-09 system refused to boot and no normal log was written *)
2024-12-10T18:13:44.862091+03:00 lenovo-zh systemd[1]: Started Timer to 
wait for more drives before activating degraded array md126..
2024-12-10T18:13:45.164589+03:00 lenovo-zh kernel: [    8.332616][ 
T1248] md/raid1:md126: active with 2 out of 2 mirrors
2024-12-10T18:13:45.196580+03:00 lenovo-zh kernel: [    8.363066][ 
T1248] md126: detected capacity change from 0 to 1652111360
2024-12-10T18:13:45.469396+03:00 lenovo-zh systemd[1]: 
mdadm-last-resort@md126.timer: Deactivated successfully.
2024-12-10T18:13:45.469584+03:00 lenovo-zh systemd[1]: Stopped Timer to 
wait for more drives before activating degraded array md126..
2024-12-10T18:18:51.652575+03:00 lenovo-zh kernel: [  314.821429][ 
T1657] md: data-check of RAID array md126
2024-12-10T18:21:40.356656+03:00 lenovo-zh kernel: [  483.522025][ 
T1740] EXT4-fs (md126): failed to initialize system zone (-117)
2024-12-10T18:21:40.356685+03:00 lenovo-zh kernel: [  483.522050][ 
T1740] EXT4-fs (md126): mount failed
2024-12-10T20:07:29.116652+03:00 lenovo-zh kernel: [ 6832.284366][ 
T1657] md: md126: data-check done.
(fsck was run on real data here)
2024-12-11T01:52:15.839052+03:00 lenovo-zh systemd[1]: Started Timer to 
wait for more drives before activating degraded array md126..
2024-12-11T01:52:15.840396+03:00 lenovo-zh kernel: [    7.832271][ 
T1170] md/raid1:md126: active with 2 out of 2 mirrors
2024-12-11T01:52:15.840397+03:00 lenovo-zh kernel: [    7.845385][ 
T1170] md126: detected capacity change from 0 to 1652111360
2024-12-11T01:52:16.255454+03:00 lenovo-zh systemd[1]: 
mdadm-last-resort@md126.timer: Deactivated successfully.
2024-12-11T01:52:16.255573+03:00 lenovo-zh systemd[1]: Stopped Timer to 
wait for more drives before activating degraded array md126..
2024-12-11T02:00:18.382301+03:00 lenovo-zh kernel: [  490.551080][ 
T1809] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: 
(null). Quota mode: none.

> "fsck.ext4 -fn /dev/XXX >& /tmp/fsck.out"

This is a fsck run on a standalone copy taken before repair (after 
successful raid re-check):

#fsck.ext4 -fn /dev/sdb1
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
fsck.ext4: Group descriptors look bad... trying backup blocks...
Pass 1: Checking inodes, blocks, and sizes
Inode 9185447 extent tree (at level 1) could be narrower.  Optimize? no
Inode 9189969 extent tree (at level 1) could be narrower.  Optimize? no
Inode 22054610 extent tree (at level 1) could be shorter.  Optimize? no
Inode 22959998 extent tree (at level 1) could be shorter.  Optimize? no
Inode 23351116 extent tree (at level 1) could be shorter.  Optimize? no
Inode 23354700 extent tree (at level 1) could be shorter.  Optimize? no
Inode 23363083 extent tree (at level 1) could be shorter.  Optimize? no
Inode 25197205 extent tree (at level 1) could be narrower.  Optimize? no
Inode 25197271 extent tree (at level 1) could be narrower.  Optimize? no
Inode 47710225 extent tree (at level 1) could be narrower.  Optimize? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (23414, counted=22437).
Fix? no
Free blocks count wrong for group #1 (31644, counted=7).
Fix? no
Free blocks count wrong for group #2 (32768, counted=0).
Fix? no
Free blocks count wrong for group #3 (31644, counted=4).
Fix? no

[repeated tons of times]

Free inodes count wrong for group #4895 (8192, counted=8044).
Fix? no
Directories count wrong for group #4895 (0, counted=148).
Fix? no
Free inodes count wrong for group #4896 (8192, counted=8114).
Fix? no
Directories count wrong for group #4896 (0, counted=13).
Fix? no
Free inodes count wrong for group #5824 (8192, counted=8008).
Fix? no
Directories count wrong for group #5824 (0, counted=31).
Fix? no
Free inodes count wrong (51634165, counted=50157635).
Fix? no
DATA: ********** WARNING: Filesystem still has errors **********
DATA: 11/51634176 files (73845.5% non-contiguous), 3292748/206513920 blocks

>> And because there are apparently 0 commits to ext4 in 5.15 since
>> 5.15.168 at the moment, I thought I'd report.
> 
> Did you check for any changes to the md/dm code, or the block layer?

No. Generally, it could be just anything, therefore I see no point even 
starting without good background knowledge. That is why I'm trying to 
draw attention of those who are more aware instead. :-)

> Also, if you checked for I/O errors in the system logs, or run
> "smartctl" on the block devices, please say so.  (And if there are
> indications of I/O errors or storage device issues, please do
> immediate backups and make plans to replace your hardware before you

I have not found any indication of hardware errors at this point.

#grep -i err messages-20241212 | grep sda
(nothing)
#grep -i err messages-20241212 | grep nvme
(nothing)

Some "smart" values are posted above. Nothing suspicious whatsoever.

Thank you!

Regards,

Nick

> suffer more serious data loss.)
> 
> Finally, if you want more support than what volunteers in the upstream
> linux kernel community can provide, this is what paid support from
> companies like SuSE, or Red Hat, can provide.
> 
> Cheers,
> 
> 							- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-13 10:49   ` Nikolai Zhubr
@ 2024-12-13 16:12     ` Theodore Ts'o
  2024-12-14 19:58       ` Nikolai Zhubr
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2024-12-13 16:12 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: linux-ext4, stable, linux-kernel, jack

On Fri, Dec 13, 2024 at 01:49:59PM +0300, Nikolai Zhubr wrote:
> 
> Not going to argue, but it'd seem if 5.15 is totally out of interest
> already, why keep patching it? And as long as it keeps receiving patches,
> supposedly they are backported and applied to stabilize, not damage it? Ok,
> nevermind :-)

The Long-Term Stable (LTS) kernels are maintained by the LTS team.  A
description of how it works can be found here[1].

[1] https://docs.kernel.org/process/2.Process.html#the-big-picture

Subsystems can tag patches sent to the development head by adding "Cc:
stable@kernel.org" to the commit description.  However, they are not
obligated to do that, so there is an auxillary system which uses AI to
intuit which patches might be a bug fix.  There is also automated
systems that try to automatically figure out which patches might be
prerequites that are needed.  This system is very automated, and after
the LTS team uses their automated scripts to generate the LTS kernel,
it gets published as an release candidate for 48 hours before it gets
pushed out.

Kernel developers are not obligated to support LTS kernels.  The fact
that they tag commits as "you might want to consider it for
backporting" might be all they do; and in some cases, not even that.
Most kernel maintainers don't even bother testing the LTS candidate
releases.  (I only started adding automated tests earlier this year to
test the LTS release candidates.)

The primary use for LTS kernels are for companies that really don't
want to update to newer kernels, and have kernel teams who can provide
support for the LTS kernels and their customers.  So if Amazon,
Google, and some Android manufacturers want to keep using 5.15, or
6.1, or 6.6, it's provided as a starting point to make life easier for
them, especially in terms of geting security bugs backported.

If the kernel teams for thecompanies which use the LTS kernels find
problems, they can let the LTS team know if there is some regression,
or they can manually backport some patch that couldn't be handled by
the automated scripts.  But it's all on a best-efforts basis.

For hobbists and indeed most kernels, what I generally recommend is
that they switch to the latest LTS kernel once a year.  So for
example, the last LTS kernel released in 2023 was 6.6.  It looks very
much like the last kerel released in 2024 will be 6.12, so that will
likely be the next LTS kernel.  In general, there is more attention
paid to the newer LTS kernels, and although *technically* there are
LTS kernels going back to 5.4, pretty much no one pays attention to
them other than the companies stubbornly hanging on because they don't
have the engineering bandwidth to go to a newer kernel, despite the
fact that many security bug fixes never make it all the way back to
those ancient kernels.

> Yes. That is why I spent 2 days for solely testing hardware, booting from
> separate media, stressing everything, and making plenty of copies. As I
> mentioned in my initial post, this had revealed no hardware issues. And I'm
> enjoying md raid-1 since around 2003 already (Not on this device though). I
> can post all my "smart" values as is, but I can assure they are perfectly
> fine for both raid-1 members. I encounter faulty hdds elsewhere routinely so
> its not something unseen too.

Note that some hardware errors can be caused by one-off errors, such
as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
RAID won't save you, since the error was introduced before an updated
block group descriptor (for example) gets written.  ECC will help;
unfortunately, most consumer grade systems don't use ECC.  (And by the
way, the are systems used in hyperscaler cloud companies which look
for CPU-level failures, which can start with silent bit flips leading
to crashes or rep-invariant failures, and correlating them with
specific CPU cores.  For example, see[2].)

[2] https://research.google/pubs/detection-and-prevention-of-silent-data-corruption-in-an-exabyte-scale-database-system/

> This is a fsck run on a standalone copy taken before repair (after
> successful raid re-check):
> 
> #fsck.ext4 -fn /dev/sdb1
> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
> fsck.ext4: Group descriptors look bad... trying backup blocks...

What this means is that the block group descriptor has for one of
ext4's block groups has the location for its block allcation bitmap to
be a invalid value.  For example, if one of the high bits in the block
allcation gets flipped, the block number will be wildly out of range,
and so it's something that can be noticed very quickly at mount time.
This is a lucky failure, because (a) it can get detected right away,
and (b) it can be very easily fixed by consulting one of the backup
copies of the block group descriptors.  This is what happened in this
case, and rest of fsck transcript is consitent with that.

The location of block allocation bitmaps never gets changed, so this
sort of thing only happens due to hardware-induced corruption.

Looking at the dumpe2fs output, it looks like it was created
relatively recently (July 2024) but it doesn't have the metadata
checksum feature enabled, which has been enabled for quite a long
time.  I'm going to guess that this means that you're using a fairly
old version version of e2fsprogs (it was enabled by default in
e2fsprogs 1.43, released in May 2016[3]).

[3] https://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.43

You got lucky because it block allocation bitmap location was
corrupted to an obviously invalid value.  But if it had been a
low-order bit that had gotten flipped this could have lead to data
corruption before the data and metadata corruption became obvious
enough that ext4 would flag it.  Metadata checksums would catch that
kind of error much more quickly --- and is an example of how RAID
arrays shouldn't be treated as a magic bullet.

> > Did you check for any changes to the md/dm code, or the block layer?
> 
> No. Generally, it could be just anything, therefore I see no point even
> starting without good background knowledge. That is why I'm trying to draw
> attention of those who are more aware instead. :-)

The problem is that there are millions and millions of Linux users.
If everyone were do that, it just wouldn't scale.  For companies who
don't want to bother with upgrading to newer versions of software,
that's why they pay the big bucks to companies like Red Hat or SuSE or
Canonical.  Or if you are a platinum level customer for Amazon or
Google, you can use Amazon Linux or Google's Container-Optimized OS,
and the cloud company's tech support teams will help you out.  :-)

Otherwise, I strongly encourage you to learn, and to take
responsibility for the health of your own system.  And ideally, you
can also use that knowledge to help other users out, which is the only
way the free-as-in-beer ecosystem can flurish; by having everybody
helping each other.  Who knows, maybe you could even get a job doing
it for a living.  :-) :-) :-)

Cheers,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-13 16:12     ` Theodore Ts'o
@ 2024-12-14 19:58       ` Nikolai Zhubr
  2024-12-16 12:59         ` Jan Kara
  2024-12-16 15:16         ` David Laight
  0 siblings, 2 replies; 8+ messages in thread
From: Nikolai Zhubr @ 2024-12-14 19:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, stable, linux-kernel, jack

Hi Ted,

On 12/13/24 19:12, Theodore Ts'o wrote:
> stable@kernel.org" to the commit description.  However, they are not
> obligated to do that, so there is an auxillary system which uses AI to
> intuit which patches might be a bug fix.  There is also automated
> systems that try to automatically figure out which patches might be

Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for 
pointing out.

> Note that some hardware errors can be caused by one-off errors, such
> as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
> RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes 
I'd still rather bet on a more usual "subtle interaction" problem, 
either exact same or some similar to [1].
I even tried to run an existing test for this particular case as 
described in [2] but it is not too user-friendly and somehow exits 
abnormally without actually doing any interesting work. I'll get back to 
it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
[2] https://lwn.net/Articles/954364/

> The location of block allocation bitmaps never gets changed, so this
> sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random 
wrong offsets, like in [1] above, or something similar.

> Looking at the dumpe2fs output, it looks like it was created
> relatively recently (July 2024) but it doesn't have the metadata
> checksum feature enabled, which has been enabled for quite a long

Yes. That was intentional - for better compatibility with even more 
ancient stuff. Maybe time has come to reconsider the approach though.

> You got lucky because it block allocation bitmap location was
> corrupted to an obviously invalid value.  But if it had been a

Absolutely. I was really amazed when I realized that :-)
It saved me days or even weeks of unnecessary verification work.

> Otherwise, I strongly encourage you to learn, and to take
> responsibility for the health of your own system.  And ideally, you
> can also use that knowledge to help other users out, which is the only
> way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible.
It is sad a direct communication end-user-to-developer for solving 
issues is becoming increasingly problematic here.
Anyway, thank you for friendly speech, useful hints and good references!

Regards,

Nick

> helping each other.  Who knows, maybe you could even get a job doing
> it for a living.  :-) :-) :-)
> 
> Cheers,
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-14 19:58       ` Nikolai Zhubr
@ 2024-12-16 12:59         ` Jan Kara
  2024-12-16 15:16         ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2024-12-16 12:59 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: Theodore Ts'o, linux-ext4, stable, linux-kernel, jack

Hi Nikolai!

On Sat 14-12-24 22:58:24, Nikolai Zhubr wrote:
> On 12/13/24 19:12, Theodore Ts'o wrote:
> > Note that some hardware errors can be caused by one-off errors, such
> > as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
> > RAID won't save you, since the error was introduced before an updated
> 
> Certainly cosmic rays is a possibility, but based on previous episodes I'd
> still rather bet on a more usual "subtle interaction" problem, either exact
> same or some similar to [1].
> I even tried to run an existing test for this particular case as described
> in [2] but it is not too user-friendly and somehow exits abnormally without
> actually doing any interesting work. I'll get back to it later when I have
> some time.
> 
> [1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
> [2] https://lwn.net/Articles/954364/
> 
> > The location of block allocation bitmaps never gets changed, so this
> > sort of thing only happens due to hardware-induced corruption.
> 
> Well, unless e.g. some modified sectors start being flushed to random wrong
> offsets, like in [1] above, or something similar.

Note that above bug led to writing file data to another position in that
file. As such it cannot really lead to metadata corruption. Corrupting data
in a file is relatively frequent event (given the wide variety of
manipulations we do with file data). OTOH I've never seen corrupting
metadata like this (in particular because ext4 has additional sanity checks
that newly allocated blocks don't overlap with critical fs metadata). In
theory, there could be software bug leading to writing sector to a wrong
position but frankly, in all the cases I've investigated so far such bugs
ended up being HW related.

> > Otherwise, I strongly encourage you to learn, and to take
> > responsibility for the health of your own system.  And ideally, you
> > can also use that knowledge to help other users out, which is the only
> > way the free-as-in-beer ecosystem can flurish; by having everybody
> 
> True. Generally I try to follow that, as much as appears possible.
> It is sad a direct communication end-user-to-developer for solving issues is
> becoming increasingly problematic here.

On one hand I understand you, on the other hand back in the good old days
(and I remember those as well ;) you wouldn't get much help when running
over three years old kernel either. And I understand you're running a stable
kernel that gets at least some updates but that's meant more for companies
that build their products on top of that and have teams available for
debugging issues. For an enduser I find some distribution kernels (Debian,
Ubuntu, openSUSE, Fedora) more suitable as they get much more scrutiny
before being released than -stable and also people there are more willing
to look at issues with older kernels (that are still supported by the
distro).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-14 19:58       ` Nikolai Zhubr
  2024-12-16 12:59         ` Jan Kara
@ 2024-12-16 15:16         ` David Laight
  2024-12-16 19:31           ` Theodore Ts'o
  1 sibling, 1 reply; 8+ messages in thread
From: David Laight @ 2024-12-16 15:16 UTC (permalink / raw)
  To: 'Nikolai Zhubr', Theodore Ts'o
  Cc: linux-ext4@vger.kernel.org, stable@vger.kernel.org,
	linux-kernel@vger.kernel.org, jack@suse.cz

....
> > The location of block allocation bitmaps never gets changed, so this
> > sort of thing only happens due to hardware-induced corruption.
> 
> Well, unless e.g. some modified sectors start being flushed to random
> wrong offsets, like in [1] above, or something similar.

Or cutting the power in the middle of SSD 'wear levelling'.

I've seen a completely trashed disk (sectors in completely the
wrong places) after an unexpected power cut.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4 damage suspected in between 5.15.167 - 5.15.170
  2024-12-16 15:16         ` David Laight
@ 2024-12-16 19:31           ` Theodore Ts'o
  0 siblings, 0 replies; 8+ messages in thread
From: Theodore Ts'o @ 2024-12-16 19:31 UTC (permalink / raw)
  To: David Laight
  Cc: 'Nikolai Zhubr', linux-ext4@vger.kernel.org,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org,
	jack@suse.cz

On Mon, Dec 16, 2024 at 03:16:00PM +0000, David Laight wrote:
> ....
> > > The location of block allocation bitmaps never gets changed, so this
> > > sort of thing only happens due to hardware-induced corruption.
> > 
> > Well, unless e.g. some modified sectors start being flushed to random
> > wrong offsets, like in [1] above, or something similar.

Well in the bug that you referenced in [1], what was happening was
that data could get written to the wrong offset in the file under
certain race conditions.  This would not be the case of data block
getting written over some metadata block like the block group
descriptors.

Sectors getting written to the wrong LBA's do happen; there's a reason
why enterprise databases include a checksum in every 4k database
block.  But the root cause of that generally tends to be a bit getting
flipped in the LBA number when it is being sent from the CPU to the
Controller to the storage device.  It's rare, but when it does happen,
it is more often than not hardware-induced --- and again, one of those
things where RAID won't necessarily save you.

> Or cutting the power in the middle of SSD 'wear levelling'.
> 
> I've seen a completely trashed disk (sectors in completely the
> wrong places) after an unexpected power cut.

Sure, but that falls in the category of hardware-induced corruption.
There have been non-power-fail certified SSD which have their flash
translation metadata so badly corrupted that you lose everything
(there's a reason why professional photographers use dual SDcard
slots, and some may use duct tape to make sure the battery access door
won't fly open if their camera gets dropped).

					- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-16 19:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-12 18:31 ext4 damage suspected in between 5.15.167 - 5.15.170 Nikolai Zhubr
2024-12-12 19:16 ` Theodore Ts'o
2024-12-13 10:49   ` Nikolai Zhubr
2024-12-13 16:12     ` Theodore Ts'o
2024-12-14 19:58       ` Nikolai Zhubr
2024-12-16 12:59         ` Jan Kara
2024-12-16 15:16         ` David Laight
2024-12-16 19:31           ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).