public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
@ 2024-06-10 14:56 ein
  2024-07-29  8:43 ` ein
  0 siblings, 1 reply; 6+ messages in thread
From: ein @ 2024-06-10 14:56 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Dear devs and users,

I used BRTFS for few months in RAID1 mode on Debian 12 (6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 
6.1.90-1 (2024-05-03) x86_64 GNU/Linux, btrfs-progs v6.2).
I created my filesystem by issuing:
mkfs.btrfs -d raid1 -m raid1 /dev/sda1 /dev/sde1 /dev/sdf1

Those are 2TB WD Reds, mix of CMRs and SMRs with good S.M.A.R.T. stats.
I am using on-die-ecc RAM memory modules.
I never did balance or replace any device.
I had couple of unexpected hangs because of nvme power management which made my root fs unavailable, 
but hopefully it's been fixed by installing new firmware for WD black nvme.

How it's possible that btrfs kept same chunk of data on the same physical device?

Jun 02 23:27:54 node0 kernel: BTRFS warning (device sdf1): csum failed root 256 ino 259 off 
140290392064 csum 0x1315675d expected csum 0x49271c1b mirror 2
Jun02 23:27:54 node0 kernel: BTRFS warning (device sdf1): csum failed root 256 ino 259 off 
140290392064 csum 0x1315675d expected csum 0x49271c1b mirror 1

The corrupted file is qocw2 image with Windows 7 on it and I think I am able to corrupt this file 
and only this file on daily basis.
I resorted my filesystem from backup, by:
1. wipfs any singatires on my hdds,
2. recreating fs from scratch,
3. coping over new data, few days later I see the same issues:

Jun10 07:22:14 node0 kernel: BTRFS info (device dm-10): read error corrected: ino 259 off 
39193079808 (dev /dev/mapper/vg1-lv1 sector 670864280)
Jun10 07:22:14 node0 kernel: BTRFS info (device dm-10): read error corrected: ino 259 off 
33579532288 (dev /dev/mapper/vg1-lv1 sector 199031056)

I don't think that it's RAM related because,
- HW is new, RAM is good quality and I did mem. check couple months ago,
- it affects only one file, I have other much busier VMs, that one mostly stays idle,
- other OS operations seems to be working perfect for months.

Sincerely,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
  2024-06-10 14:56 RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted ein
@ 2024-07-29  8:43 ` ein
  2024-07-29 10:05   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: ein @ 2024-07-29  8:43 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On 10.06.2024 16:56, ein wrote:
> [...]
>
> I don't think that it's RAM related because,
> - HW is new, RAM is good quality and I did mem. check couple months ago,
> - it affects only one file, I have other much busier VMs, that one mostly stays idle,
> - other OS operations seems to be working perfectly for months.
>
> Sincerely,

Hi,

after spotting this:
https://www.reddit.com/r/GlobalOffensive/comments/1eb00pg/intel_processors_are_causing_significant/

I decided to move from:
cpupower frequency-set -g performance
to:
cpupower frequency-set -g powersave

I have got:

~# lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             13th Gen Intel(R) Core(TM) i9-13900K
    BIOS Model name:      13th Gen Intel(R) Core(TM) i9-13900K To Be Filled By O.E.M. CPU @ 5.3GHz

One week without corruptions.

Sincerely,


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
  2024-07-29  8:43 ` ein
@ 2024-07-29 10:05   ` Qu Wenruo
  2025-01-13 15:54     ` ein
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2024-07-29 10:05 UTC (permalink / raw)
  To: ein, linux-btrfs@vger.kernel.org



在 2024/7/29 18:13, ein 写道:
> On 10.06.2024 16:56, ein wrote:
>> [...]
>>
>> I don't think that it's RAM related because,
>> - HW is new, RAM is good quality and I did mem. check couple months ago,
>> - it affects only one file, I have other much busier VMs, that one
>> mostly stays idle,
>> - other OS operations seems to be working perfectly for months.
>>
>> Sincerely,
>
> Hi,
>
> after spotting this:
> https://www.reddit.com/r/GlobalOffensive/comments/1eb00pg/intel_processors_are_causing_significant/
>
> I decided to move from:
> cpupower frequency-set -g performance
> to:
> cpupower frequency-set -g powersave
>
> I have got:
>
> ~# lscpu
> Architecture:             x86_64
>   CPU op-mode(s):         32-bit, 64-bit
>   Address sizes:          46 bits physical, 48 bits virtual
>   Byte Order:             Little Endian
> CPU(s):                   32
>   On-line CPU(s) list:    0-31
> Vendor ID:                GenuineIntel
>   BIOS Vendor ID:         Intel(R) Corporation
>   Model name:             13th Gen Intel(R) Core(TM) i9-13900K
>     BIOS Model name:      13th Gen Intel(R) Core(TM) i9-13900K To Be
> Filled By O.E.M. CPU @ 5.3GHz
>
> One week without corruptions.

Normally we only suspect the hardware when we have enough evidence.
(e.g. proof of bitflip etc)
Even if the hardware is known to have problems.

In your case, I still do not believe it's hardware problem.

 > - it affects only one file, I have other much busier VMs, that one
mostly stays idle,

Due to btrfs' datacsum behavior, it's very sensitive to page content
change during writeback.

Normally this should not happen for buffered writes as btrfs has locked
the page cache.

But for Direct IO it's still very possible that one process submitted a
direct IO, and when the IO was still under way, the user space changed
the contents of that page.

In that case, btrfs csum is calculated using that old contents, but the
on-disk data is the new contents, causing the csum mismatch.

So I'm wondering what's the workload inside the VM?

Thanks,
Qu
>
> Sincerely,
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
  2024-07-29 10:05   ` Qu Wenruo
@ 2025-01-13 15:54     ` ein
  2025-01-13 20:39       ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: ein @ 2025-01-13 15:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux fs Btrfs

On 29.07.2024 12:05, Qu Wenruo wrote:
> On 10.06.2024 16:56, ein wrote:
>>> [...]
>>> I don't think that it's RAM related because,
>>> - HW is new, RAM is good quality and I did mem. check couple months ago,
>>> - it affects only one file, I have other much busier VMs, that one
>>> mostly stays idle,
>>> - other OS operations seems to be working perfectly for months.
>>
>> [...]
>>
>> after spotting this:
>> https://www.reddit.com/r/GlobalOffensive/comments/1eb00pg/intel_processors_are_causing_significant/
>>
>> I decided to move from:
>> cpupower frequency-set -g performance
>> to:
>> cpupower frequency-set -g powersave
>>
>> I have got:
>>
>> ~# lscpu
>> Architecture:             x86_64
>>   CPU op-mode(s):         32-bit, 64-bit
>>   Address sizes:          46 bits physical, 48 bits virtual
>>   Byte Order:             Little Endian
>> CPU(s):                   32
>>   On-line CPU(s) list:    0-31
>> Vendor ID:                GenuineIntel
>>   BIOS Vendor ID:         Intel(R) Corporation
>>   Model name:             13th Gen Intel(R) Core(TM) i9-13900K
>>     BIOS Model name:      13th Gen Intel(R) Core(TM) i9-13900K To Be
>> Filled By O.E.M. CPU @ 5.3GHz
>>
>> One week without corruptions.
Hi Qu,  thank for the answer.
> Normally we only suspect the hardware when we have enough evidence.
> (e.g. proof of bitflip etc)
> Even if the hardware is known to have problems.
I think I have those - proofs. (1)
> In your case, I still do not believe it's hardware problem.
>
> > - it affects only one file, I have other much busier VMs, that one
> mostly stays idle,
>
> Due to btrfs' datacsum behavior, it's very sensitive to page content
> change during writeback.
>
> Normally this should not happen for buffered writes as btrfs has locked
> the page cache.
>
> But for Direct IO it's still very possible that one process submitted a
> direct IO, and when the IO was still under way, the user space changed
> the contents of that page.
>
> In that case, btrfs csum is calculated using that old contents, but the
> on-disk data is the new contents, causing the csum mismatch.
>
> So I'm wondering what's the workload inside the VM?

As far as I know in such configuration there's no writeback:

<disk type="file" device="disk">
   <driver name="qemu" type="qcow2" cache="none" discard="unmap"/>
   <source file="/var/lib/libvirt/images-red-btrfs/dell.qcow2" index="2"/>
   <backingStore/>
   <target dev="vda" bus="virtio"/>
   <alias name="virtio-disk0"/>
   <address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0"/>
</disk>
[...]
<controller type="pci" index="0" model="pci-root">
   <alias name="pci.0"/>
</controller>

This is mostly empty Win7 virtual machine with very small SQLite database (100-500MiB) with some 
network monitoring tool.

(1)
It took almost a year, I spent hundredths of hours and thousands of $ chasing this issue:
- tired 4 different new SATA controllers, from cheap ASM106X series to, DC grade HBA like LSI,
- multiple times replaced all SATA cables,
- replacing HDDs WD Red drives (mix of CMA/SMR) to WD Red SSDs SA500,
That part changed nothing. I experienced a lot of PCI-E link issues like, disappearing SATA drives, 
disappearing NVME drives - sometimes both of them, USB link problems etc.
But I don't think that link issues was related - the corruption happens without them (indication of 
link reset in dmsg).

- RMA the CPU from i9-13900k to i9-14900k,
- try every available Intel CPU microcode update packaged as BIOS update by mainboard vendor.
This part made the situation better, but I still could recreate corruption errors. As times goes on 
when running in the "performance" mode, the issues appeared often and were more severe. Every time 
switching from performance mode to powersave (lower voltage) made the CPU more stable.

The process of recreation looked as follows.
- shut the VM off,
- defrag the filesystem (btrfs filesystem defragment),
- turn the VM on,
- defrag/chkdsk on VM.
The errors appeared almost immediately. There was correlation how often it happens.
If the VM image was very fragmented in btrfs, then the probability of corruption was lower.

i9-14900k 3 month after RMA, started to have threading issues and started to leave zombie processes 
in performance mode. Powersave mode fixed it as well and it worked stable.

Finally, I replaced my mainboard (it was X13SAE-F) with Intel Z890 mobo and the latest CPU 
generation leaving whole IO stack intact (same: chassis, cables, controllers and disks).
I ran scrub, balance, this VM had one small 4096b unrecoverable error on bluescreen memory dump file 
and everything works fine from couple of days. I can't reproduce it with above method anymore.
I used ddrescue to reread everything I could from btrfs (this one file used by mentioned VM) and 
just replaced the file after ddrescue was done.

On Friday last week I asked Intel for refund.

I am positively surprised how much pain this btrfs filesystem (RAID10 for data and metadata) handled 
over last year. Great job devs, keep it up!

Sincerely,
e.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
  2025-01-13 15:54     ` ein
@ 2025-01-13 20:39       ` Qu Wenruo
  2025-01-16 14:55         ` ein
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-01-13 20:39 UTC (permalink / raw)
  To: ein; +Cc: Linux fs Btrfs



在 2025/1/14 02:24, ein 写道:
> On 29.07.2024 12:05, Qu Wenruo wrote:
>> On 10.06.2024 16:56, ein wrote:
>>>> [...]
>>>> I don't think that it's RAM related because,
>>>> - HW is new, RAM is good quality and I did mem. check couple months
>>>> ago,
>>>> - it affects only one file, I have other much busier VMs, that one
>>>> mostly stays idle,
>>>> - other OS operations seems to be working perfectly for months.
>>>
>>> [...]
>>>
>>> after spotting this:
>>> https://www.reddit.com/r/GlobalOffensive/comments/1eb00pg/
>>> intel_processors_are_causing_significant/
>>>
>>> I decided to move from:
>>> cpupower frequency-set -g performance
>>> to:
>>> cpupower frequency-set -g powersave
>>>
>>> I have got:
>>>
>>> ~# lscpu
>>> Architecture:             x86_64
>>>   CPU op-mode(s):         32-bit, 64-bit
>>>   Address sizes:          46 bits physical, 48 bits virtual
>>>   Byte Order:             Little Endian
>>> CPU(s):                   32
>>>   On-line CPU(s) list:    0-31
>>> Vendor ID:                GenuineIntel
>>>   BIOS Vendor ID:         Intel(R) Corporation
>>>   Model name:             13th Gen Intel(R) Core(TM) i9-13900K
>>>     BIOS Model name:      13th Gen Intel(R) Core(TM) i9-13900K To Be
>>> Filled By O.E.M. CPU @ 5.3GHz
>>>
>>> One week without corruptions.
> Hi Qu,  thank for the answer.
>> Normally we only suspect the hardware when we have enough evidence.
>> (e.g. proof of bitflip etc)
>> Even if the hardware is known to have problems.
> I think I have those - proofs. (1)
>> In your case, I still do not believe it's hardware problem.
>>
>> > - it affects only one file, I have other much busier VMs, that one
>> mostly stays idle,
>>
>> Due to btrfs' datacsum behavior, it's very sensitive to page content
>> change during writeback.
>>
>> Normally this should not happen for buffered writes as btrfs has locked
>> the page cache.
>>
>> But for Direct IO it's still very possible that one process submitted a
>> direct IO, and when the IO was still under way, the user space changed
>> the contents of that page.
>>
>> In that case, btrfs csum is calculated using that old contents, but the
>> on-disk data is the new contents, causing the csum mismatch.
>>
>> So I'm wondering what's the workload inside the VM?
>
> As far as I know in such configuration there's no writeback:
>
> <disk type="file" device="disk">
>    <driver name="qemu" type="qcow2" cache="none" discard="unmap"/>

cache="none" means direct IO.

Exactly the problem I mentioned, direct IO with data changed during
writeback.

You can change it to "cache=writeback" then it should resolve the false
alert mismatch.
(Or just simply change the disk image file to NODATASUM)

Thanks,
Qu

>    <source file="/var/lib/libvirt/images-red-btrfs/dell.qcow2" index="2"/>
>    <backingStore/>
>    <target dev="vda" bus="virtio"/>
>    <alias name="virtio-disk0"/>
>    <address type="pci" domain="0x0000" bus="0x00" slot="0x04"
> function="0x0"/>
> </disk>
> [...]
> <controller type="pci" index="0" model="pci-root">
>    <alias name="pci.0"/>
> </controller>
>
> This is mostly empty Win7 virtual machine with very small SQLite
> database (100-500MiB) with some network monitoring tool.
>
> (1)
> It took almost a year, I spent hundredths of hours and thousands of $
> chasing this issue:
> - tired 4 different new SATA controllers, from cheap ASM106X series to,
> DC grade HBA like LSI,
> - multiple times replaced all SATA cables,
> - replacing HDDs WD Red drives (mix of CMA/SMR) to WD Red SSDs SA500,
> That part changed nothing. I experienced a lot of PCI-E link issues
> like, disappearing SATA drives, disappearing NVME drives - sometimes
> both of them, USB link problems etc.
> But I don't think that link issues was related - the corruption happens
> without them (indication of link reset in dmsg).
>
> - RMA the CPU from i9-13900k to i9-14900k,
> - try every available Intel CPU microcode update packaged as BIOS update
> by mainboard vendor.
> This part made the situation better, but I still could recreate
> corruption errors. As times goes on when running in the "performance"
> mode, the issues appeared often and were more severe. Every time
> switching from performance mode to powersave (lower voltage) made the
> CPU more stable.
>
> The process of recreation looked as follows.
> - shut the VM off,
> - defrag the filesystem (btrfs filesystem defragment),
> - turn the VM on,
> - defrag/chkdsk on VM.
> The errors appeared almost immediately. There was correlation how often
> it happens.
> If the VM image was very fragmented in btrfs, then the probability of
> corruption was lower.
>
> i9-14900k 3 month after RMA, started to have threading issues and
> started to leave zombie processes in performance mode. Powersave mode
> fixed it as well and it worked stable.
>
> Finally, I replaced my mainboard (it was X13SAE-F) with Intel Z890 mobo
> and the latest CPU generation leaving whole IO stack intact (same:
> chassis, cables, controllers and disks).
> I ran scrub, balance, this VM had one small 4096b unrecoverable error on
> bluescreen memory dump file and everything works fine from couple of
> days. I can't reproduce it with above method anymore.
> I used ddrescue to reread everything I could from btrfs (this one file
> used by mentioned VM) and just replaced the file after ddrescue was done.
>
> On Friday last week I asked Intel for refund.
>
> I am positively surprised how much pain this btrfs filesystem (RAID10
> for data and metadata) handled over last year. Great job devs, keep it up!
>
> Sincerely,
> e.
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted
  2025-01-13 20:39       ` Qu Wenruo
@ 2025-01-16 14:55         ` ein
  0 siblings, 0 replies; 6+ messages in thread
From: ein @ 2025-01-16 14:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux fs Btrfs

On 13.01.2025 21:39, Qu Wenruo wrote:
> 在 2025/1/14 02:24, ein 写道:
>> On 29.07.2024 12:05, Qu Wenruo wrote:
>>> On 10.06.2024 16:56, ein wrote:
>>> In your case, I still do not believe it's hardware problem.
>>>
>>> > - it affects only one file, I have other much busier VMs, that one
>>> mostly stays idle,
>>>
>>> Due to btrfs' datacsum behavior, it's very sensitive to page content
>>> change during writeback.
>>>
>>> Normally this should not happen for buffered writes as btrfs has locked
>>> the page cache.
>>>
>>> But for Direct IO it's still very possible that one process submitted a
>>> direct IO, and when the IO was still under way, the user space changed
>>> the contents of that page.
>>>
>>> In that case, btrfs csum is calculated using that old contents, but the
>>> on-disk data is the new contents, causing the csum mismatch.
>>>
>>> So I'm wondering what's the workload inside the VM?
>>
>> As far as I know in such configuration there's no writeback:
>>
>> <disk type="file" device="disk">
>>    <driver name="qemu" type="qcow2" cache="none" discard="unmap"/>
>
> cache="none" means direct IO.
>
> Exactly the problem I mentioned, direct IO with data changed during
> writeback.
>
> You can change it to "cache=writeback" then it should resolve the false
> alert mismatch.
> (Or just simply change the disk image file to NODATASUM)
Hi Qu.
You were right, those errors still happened.
Switching to cache=writeback seemed to help for now.
Thank you.
>>    <source file="/var/lib/libvirt/images-red-btrfs/dell.qcow2" index="2"/>
>>    <backingStore/>
>>    <target dev="vda" bus="virtio"/>
>>    <alias name="virtio-disk0"/>
>>    <address type="pci" domain="0x0000" bus="0x00" slot="0x04"
>> function="0x0"/>
>> </disk>
>> [...]
>> <controller type="pci" index="0" model="pci-root">
>>    <alias name="pci.0"/>
>> </controller>
>>
>> This is mostly empty Win7 virtual machine with very small SQLite
>> database (100-500MiB) with some network monitoring tool.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-01-16 14:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 14:56 RAID1 two chunks of the same data on the same physical disk, one file keeps being corrupted ein
2024-07-29  8:43 ` ein
2024-07-29 10:05   ` Qu Wenruo
2025-01-13 15:54     ` ein
2025-01-13 20:39       ` Qu Wenruo
2025-01-16 14:55         ` ein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox